使用 R 进行高级网页抓取

2020-09-15 08:00:00 · 飞浪

介绍你看完了一部科幻电影，主角开发了一个类人机器人，它可以与人交谈，甚至像人类一样表达自己的感情。你很兴奋，现在想自己做一个。但是等等！你知道智能是建立在信息之上的吗？你如何获得这些信

介绍

你看完了一部科幻电影，主角开发了一个类人机器人，它可以与人交谈，甚至像人类一样表达自己的感情。你很兴奋，现在想自己做一个。但是等等！你知道智能是建立在信息之上的吗？你如何获得这些信息？

网络抓取提供了获取此类信息的途径之一。首先，您需要学习使用 R 从网络获取数据的不同角度。

从 HTML 网页上的单个表或多个表获取数据

Yahoo! Finance包含股票、商品、期货等股票市场数据。进入此网页后，在搜索框中搜索“Pluralsight”或“PS”。这将打开一个专门用于 Pluralsight 股票市场数据的网页。由于该网页为您提供了下载历史数据的前期选项，因此无需对其进行抓取。但公司股东呢？

单击“Holders”选项卡，其中将列出三个部分：

主要持有者
主要机构持有者
顶级共同基金持有者

每个部分都包含表格数据。要抓取这些表格，请使用rvest和xml2库。

给定的代码完成了任务。阅读注释以了解每个命令的工作原理：

      # --
# Importing the rvest library 
# It internally imports xml2 library too 
# --
library(rvest)


# --
# Load the link of Holders tab in a variable, here link
# --
link <- "https://finance.yahoo.com/quote/PS/holders?p=PS"


# --
# Read the HTML webpage using the xml2 package function read_html()
# --
driver <- read_html(link)


# --
# Since we know there is a tabular data on the webpage, we pass "table" as the CSS selector
# The variable "allTables" will hold all three tables in it
# --
allTables <- html_nodes(driver, css = "table")


# --
# Fetch any of the three tables based on their index
# 1. Major Holders
# --
majorHolders <- html_table(allTables)[[1]]
majorHolders

#       X1                                    X2
# 1   5.47%       % of Shares Held by All Insider
# 2 110.24%      % of Shares Held by Institutions
# 3 116.62%       % of Float Held by Institutions
# 4     275 Number of Institutions Holding Shares


# --
# 2. Top Institutional Holders
# --
topInstHolders <- html_table(allTables)[[2]]
topInstHolders

#                             Holder     Shares Date Reported  % Out       Value
# 1      Insight Holdings Group, Llc 18,962,692  Dec 30, 2019 17.99% 326,347,929
# 2                         FMR, LLC 10,093,850  Dec 30, 2019  9.58% 173,715,158
# 3       Vanguard Group, Inc. (The)  7,468,146  Dec 30, 2019  7.09% 128,526,792
# 4  Mackenzie Financial Corporation  4,837,441  Dec 30, 2019  4.59%  83,252,359
# 5               Crewe Advisors LLC  4,761,680  Dec 30, 2019  4.52%  81,948,512
# 6        Ensign Peak Advisors, Inc  4,461,122  Dec 30, 2019  4.23%  76,775,909
# 7         Riverbridge Partners LLC  4,021,869  Mar 30, 2020  3.82%  44,160,121
# 8          First Trust Advisors LP  3,970,327  Dec 30, 2019  3.77%  68,329,327
# 9       Fred Alger Management, LLC  3,875,827  Dec 30, 2019  3.68%  66,702,982
# 10 ArrowMark Colorado Holdings LLC  3,864,321  Dec 30, 2019  3.67%  66,504,964


# --
# 3. Top Mutual Fund Holders
# --
topMutualFundHolders <- html_table(allTables)[[3]]
topMutualFundHolders

#                                                           Holder    Shares Date Reported % Out      Value
# 1                 First Trust Dow Jones Internet Index (SM) Fund 3,964,962  Dec 30, 2019 3.76% 68,236,996
# 2                                     Alger Small Cap Focus Fund 3,527,274  Oct 30, 2019 3.35% 63,773,113
# 3  Fidelity Select Portfolios - Software & IT Services Portfolio 3,297,900  Jan 30, 2020 3.13% 63,946,281
# 4                         Vanguard Total Stock Market Index Fund 2,264,398  Dec 30, 2019 2.15% 38,970,289
# 5                                  Vanguard Small-Cap Index Fund 2,094,866  Dec 30, 2019 1.99% 36,052,643
# 6                                      Ivy Small Cap Growth Fund 1,302,887  Sep 29, 2019 1.24% 21,881,987
# 7                            Vanguard Small Cap Value Index Fund 1,278,504  Dec 30, 2019 1.21% 22,003,053
# 8                            Vanguard Extended Market Index Fund 1,186,015  Dec 30, 2019 1.13% 20,411,318
# 9       Franklin Strategic Series-Franklin Small Cap Growth Fund 1,134,200  Oct 30, 2019 1.08% 20,506,336
# 10                          Fidelity Stock Selector All Cap Fund 1,018,833  Jan 30, 2020 0.97% 19,755,171
    

使用 CSS 选择器从网页获取不同节点

您可以从我的 GitHub博客中了解如何使用 CSS 选择器获取数据。

自动导航至多个页面并获取实体

以上部分可帮助您了解如果只有一个网页专门介绍技能，如何获取实体。但 Pluralsight 的技能远不止机器学习。查看下图中从此URL获取的主要技能：

您可以观察到总共有 10 项主要技能，每项技能都有不同的 URL。忽略“浏览所有课程”部分，因为它会重定向回同一网页。

这里的目标是只向 R 程序提供一个 URL，即https://www.pluralsight.com/browse，并让程序自动导航到这 10 个技能网页中的每一个，并提取所有课程详细信息，如下所示：

      library(rvest)
library(stringr) # For data cleaning

link <- "https://www.pluralsight.com/browse"

driver <- read_html(link)

# Extracting sub URLs
# Here, tile-box is a parent class which holds the content in the nested class.
# First, go inside the sub-class using html_children() and then fetch the URLs to each Skill page
subURLs <- html_nodes(driver,'div.tile-box') %>% 
            html_children() %>% 
            html_attr('href')

# Removing NA values and last `/browse` URL
subURLs <- subURLs[!is.na(subURLs)][1:10]

# Main URL - to complete the above URLs
mainURL <- "https://www.pluralsight.com"

# This function fetches those four entities as you learned in the previous section of this guide
entity <- function(s){
  
  # Course Title
  # Since Number of Courses may differ from Skill to Skill, therefore,
  # we have done dynamic fetching of the course names
  
  v <- html_nodes(s, "div.course-item__info") %>%
    html_children() 
  
  titles <- gsub("<|>", "", str_extract(v[!is.na(str_match(v, "course-item__title"))], ">.*<"))
  
  # Course Authors
  authors <- html_nodes(s, "div.course--item__list.course-item__author") %>% html_text()
  
  # Course Level
  level <- html_nodes(s, "div.course--item__list.course-item__level") %>% html_text()
  
  # Course Duration
  duration <- html_nodes(s, "div.course--item__list.course-item__duration") %>% html_text()
  
  # Creating a final DataFrame
  courses <- data.frame(titles, authors, level, duration)
  
  return(courses)
}


# A for loop which goes through all the URLs, fetch the entities and display them on the screen 
i = 1
for (i in 1:10) {
  subDriver <- read_html(paste(mainURL, subURLs[i], sep = ""))
  print(entity(subDriver))
}
    

在上面的代码中，了解html_children()和html_attr()的重要性。代码中有详细的注释来简要说明每个命令的作用。上述代码的输出将类似于上一节的输出，并可用于每个技能。

从 R 控制浏览器并抓取数据

假设您想要抓取有关 Pluralsight 的最新 Google 新闻。您可以手动打开www.google.com，搜索关键字“Pluralsight”，然后点击“新闻”。

如果所有这些步骤都可以自动化，并且您只需通过一个小型 R 脚本即可获取最新消息，那会怎样？

注意- 在继续 R 代码之前，请确保按照以下步骤在系统中建立 Docker：

下载并安装Docker
打开 Docker Terminal 并运行docker pull selenium/standalone-chrome。如果你是 Firefox 用户，请将chrome替换为firefox 。
然后docker run -d -p 4445:4444 selenium/standalone-chrome
如果以上两个代码成功，运行docker-machine ip并记下 R 代码中要使用的 IP 地址

下面给出了使用RSelenium库的详细代码：

      library(RSelenium)

# Initiate the connection, remember remoteServerAddr needs to be replaced with the IP address you have 
# received from the Docker Terminal
driver <- remoteDriver(browserName = "chrome", remoteServerAddr = "192.168.99.100", port = 4445L)
driver$open()


# Provide the URL and let the driver load it for you
driver$navigate("https://www.google.com/")


# The search textarea of the Google falls under the name=q. Call this element.
init <- driver$findElement(using = 'name', "q")

# Enter the search keyword and hit Enter key
init$sendKeysToElement(list("Pluralsight", key = "enter"))


# Now, we have landed on the page with Pluralsight "All" results. Select the XPATH of the News tab and click it.
News_tab <- driver$findElement(using = 'xpath', "//*[@id=\"hdtb-msb-vis\"]/div[2]/a")
News_tab$clickElement()


# You are now on the News results. Select the CSS selector for all the news (here, a.l)
# Don't ignore that you have to use findElements (with s), not findElement. The latter gives only one result.
res <- driver$findElements(using = 'css selector', 'a.l')

# List out the latest headlines
headlines <- unlist(lapply(res, function(x){x$getElementText()}))
headlines

# [1] "Pluralsight has free courses to help you learn Microsoft Azure ..."
# [2] "Pluralsight offers free access to full portfolio of skill ..."     
# [3] "Will Pluralsight Continue to Surge Higher?"                        
# [4] "The CEO of Pluralsight explains why the online tech skills ..."    
# [5] "Pluralsight Is Free For the Month of April"                        
# [6] "This Pluralsight deal lets you learn some new skills from home ..."
# [7] "Pluralsight One Commits Over $1 Million to Strategic Nonprofit ..."
# [8] "Pluralsight Announces First Quarter 2020 Results"                  
# [9] "Pluralsight Announces Date for its First Quarter 20

_{免责声明：本内容来源于第三方作者授权、网友推荐或互联网整理，旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有，其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况，请与我们取得联系，我们将尽快进行相关处理与修改。感谢您的理解与支持！}

_查看原文

技术指南

R语言

阅读全文

使用 R 进行高级网页抓取

杭州电子商务研究院

4年前 · 面向社会、服务行业、政产学研结合、整合资源、和谐发展

# -- # Importing the rvest library # It internally imports xml2 library too # -- library(rvest) # -- # Load the link of Holders tab in a variable, here link # -- link <- "https://finance.yahoo.com/quote/PS/holders?p=PS" # -- # Read the HTML webpage using the xml2 package function read_html() # -- driver <- read_html(link) # -- # Since we know there is a tabular data on the webpage, we pass "table" as the CSS selector # The variable "allTables" will hold all three tables in it # -- allTables <- html_nodes(driver, css = "table") # -- # Fetch any of the three tables based on their index # 1. Major Holders # -- majorHolders <- html_table(allTables)[[1]] majorHolders # X1 X2 # 1 5.47% % of Shares Held by All Insider # 2 110.24% % of Shares Held by Institutions # 3 116.62% % of Float Held by Institutions # 4 275 Number of Institutions Holding Shares # -- # 2. Top Institutional Holders # -- topInstHolders <- html_table(allTables)[[2]] topInstHolders # Holder Shares Date Reported % Out Value # 1 Insight Holdings Group, Llc 18,962,692 Dec 30, 2019 17.99% 326,347,929 # 2 FMR, LLC 10,093,850 Dec 30, 2019 9.58% 173,715,158 # 3 Vanguard Group, Inc. (The) 7,468,146 Dec 30, 2019 7.09% 128,526,792 # 4 Mackenzie Financial Corporation 4,837,441 Dec 30, 2019 4.59% 83,252,359 # 5 Crewe Advisors LLC 4,761,680 Dec 30, 2019 4.52% 81,948,512 # 6 Ensign Peak Advisors, Inc 4,461,122 Dec 30, 2019 4.23% 76,775,909 # 7 Riverbridge Partners LLC 4,021,869 Mar 30, 2020 3.82% 44,160,121 # 8 First Trust Advisors LP 3,970,327 Dec 30, 2019 3.77% 68,329,327 # 9 Fred Alger Management, LLC 3,875,827 Dec 30, 2019 3.68% 66,702,982 # 10 ArrowMark Colorado Holdings LLC 3,864,321 Dec 30, 2019 3.67% 66,504,964 # -- # 3. Top Mutual Fund Holders # -- topMutualFundHolders <- html_table(allTables)[[3]] topMutualFundHolders # Holder Shares Date Reported % Out Value # 1 First Trust Dow Jones Internet Index (SM) Fund 3,964,962 Dec 30, 2019 3.76% 68,236,996 # 2 Alger Small Cap Focus Fund 3,527,274 Oct 30, 2019 3.35% 63,773,113 # 3 Fidelity Select Portfolios - Software & IT Services Portfolio 3,297,900 Jan 30, 2020 3.13% 63,946,281 # 4 Vanguard Total Stock Market Index Fund 2,264,398 Dec 30, 2019 2.15% 38,970,289 # 5 Vanguard Small-Cap Index Fund 2,094,866 Dec 30, 2019 1.99% 36,052,643 # 6 Ivy Small Cap Growth Fund 1,302,887 Sep 29, 2019 1.24% 21,881,987 # 7 Vanguard Small Cap Value Index Fund 1,278,504 Dec 30, 2019 1.21% 22,003,053 # 8 Vanguard Extended Market Index Fund 1,186,015 Dec 30, 2019 1.13% 20,411,318 # 9 Franklin Strategic Series-Franklin Small Cap Growth Fund 1,134,200 Oct 30, 2019 1.08% 20,506,336 # 10 Fidelity Stock Selector All Cap Fund 1,018,833 Jan 30, 2020 0.97% 19,755,171

library(rvest) library(stringr) # For data cleaning link <- "https://www.pluralsight.com/browse" driver <- read_html(link) # Extracting sub URLs # Here, tile-box is a parent class which holds the content in the nested class. # First, go inside the sub-class using html_children() and then fetch the URLs to each Skill page subURLs <- html_nodes(driver,'div.tile-box') %>% html_children() %>% html_attr('href') # Removing NA values and last `/browse` URL subURLs <- subURLs[!is.na(subURLs)][1:10] # Main URL - to complete the above URLs mainURL <- "https://www.pluralsight.com" # This function fetches those four entities as you learned in the previous section of this guide entity <- function(s){ # Course Title # Since Number of Courses may differ from Skill to Skill, therefore, # we have done dynamic fetching of the course names v <- html_nodes(s, "div.course-item__info") %>% html_children() titles <- gsub("<|>", "", str_extract(v[!is.na(str_match(v, "course-item__title"))], ">.*<")) # Course Authors authors <- html_nodes(s, "div.course--item__list.course-item__author") %>% html_text() # Course Level level <- html_nodes(s, "div.course--item__list.course-item__level") %>% html_text() # Course Duration duration <- html_nodes(s, "div.course--item__list.course-item__duration") %>% html_text() # Creating a final DataFrame courses <- data.frame(titles, authors, level, duration) return(courses) } # A for loop which goes through all the URLs, fetch the entities and display them on the screen i = 1 for (i in 1:10) { subDriver <- read_html(paste(mainURL, subURLs[i], sep = "")) print(entity(subDriver)) }

library(RSelenium) # Initiate the connection, remember remoteServerAddr needs to be replaced with the IP address you have # received from the Docker Terminal driver <- remoteDriver(browserName = "chrome", remoteServerAddr = "192.168.99.100", port = 4445L) driver$open() # Provide the URL and let the driver load it for you driver$navigate("https://www.google.com/") # The search textarea of the Google falls under the name=q. Call this element. init <- driver$findElement(using = 'name', "q") # Enter the search keyword and hit Enter key init$sendKeysToElement(list("Pluralsight", key = "enter")) # Now, we have landed on the page with Pluralsight "All" results. Select the XPATH of the News tab and click it. News_tab <- driver$findElement(using = 'xpath', "//*[@id=\"hdtb-msb-vis\"]/div[2]/a") News_tab$clickElement() # You are now on the News results. Select the CSS selector for all the news (here, a.l) # Don't ignore that you have to use findElements (with s), not findElement. The latter gives only one result. res <- driver$findElements(using = 'css selector', 'a.l') # List out the latest headlines headlines <- unlist(lapply(res, function(x){x$getElementText()})) headlines # [1] "Pluralsight has free courses to help you learn Microsoft Azure ..." # [2] "Pluralsight offers free access to full portfolio of skill ..." # [3] "Will Pluralsight Continue to Surge Higher?" # [4] "The CEO of Pluralsight explains why the online tech skills ..." # [5] "Pluralsight Is Free For the Month of April" # [6] "This Pluralsight deal lets you learn some new skills from home ..." # [7] "Pluralsight One Commits Over $1 Million to Strategic Nonprofit ..." # [8] "Pluralsight Announces First Quarter 2020 Results" # [9] "Pluralsight Announces Date for its First Quarter 20