使用 R 进行高级网页抓取
介绍
你看完了一部科幻电影,主角开发了一个类人机器人,它可以与人交谈,甚至像人类一样表达自己的感情。你很兴奋,现在想自己做一个。但是等等!你知道智能是建立在信息之上的吗?你如何获得这些信息?
网络抓取提供了获取此类信息的途径之一。首先,您需要学习使用 R 从网络获取数据的不同角度。
从 HTML 网页上的单个表或多个表获取数据
Yahoo! Finance包含股票、商品、期货等股票市场数据。进入此网页后,在搜索框中搜索“Pluralsight”或“PS”。这将打开一个专门用于 Pluralsight 股票市场数据的网页。由于该网页为您提供了下载历史数据的前期选项,因此无需对其进行抓取。但公司股东呢?
单击“Holders”选项卡,其中将列出三个部分:
- 主要持有者
- 主要机构持有者
- 顶级共同基金持有者
每个部分都包含表格数据。要抓取这些表格,请使用rvest和xml2库。
给定的代码完成了任务。阅读注释以了解每个命令的工作原理:
# --
# Importing the rvest library
# It internally imports xml2 library too
# --
library(rvest)
# --
# Load the link of Holders tab in a variable, here link
# --
link <- "https://finance.yahoo.com/quote/PS/holders?p=PS"
# --
# Read the HTML webpage using the xml2 package function read_html()
# --
driver <- read_html(link)
# --
# Since we know there is a tabular data on the webpage, we pass "table" as the CSS selector
# The variable "allTables" will hold all three tables in it
# --
allTables <- html_nodes(driver, css = "table")
# --
# Fetch any of the three tables based on their index
# 1. Major Holders
# --
majorHolders <- html_table(allTables)[[1]]
majorHolders
# X1 X2
# 1 5.47% % of Shares Held by All Insider
# 2 110.24% % of Shares Held by Institutions
# 3 116.62% % of Float Held by Institutions
# 4 275 Number of Institutions Holding Shares
# --
# 2. Top Institutional Holders
# --
topInstHolders <- html_table(allTables)[[2]]
topInstHolders
# Holder Shares Date Reported % Out Value
# 1 Insight Holdings Group, Llc 18,962,692 Dec 30, 2019 17.99% 326,347,929
# 2 FMR, LLC 10,093,850 Dec 30, 2019 9.58% 173,715,158
# 3 Vanguard Group, Inc. (The) 7,468,146 Dec 30, 2019 7.09% 128,526,792
# 4 Mackenzie Financial Corporation 4,837,441 Dec 30, 2019 4.59% 83,252,359
# 5 Crewe Advisors LLC 4,761,680 Dec 30, 2019 4.52% 81,948,512
# 6 Ensign Peak Advisors, Inc 4,461,122 Dec 30, 2019 4.23% 76,775,909
# 7 Riverbridge Partners LLC 4,021,869 Mar 30, 2020 3.82% 44,160,121
# 8 First Trust Advisors LP 3,970,327 Dec 30, 2019 3.77% 68,329,327
# 9 Fred Alger Management, LLC 3,875,827 Dec 30, 2019 3.68% 66,702,982
# 10 ArrowMark Colorado Holdings LLC 3,864,321 Dec 30, 2019 3.67% 66,504,964
# --
# 3. Top Mutual Fund Holders
# --
topMutualFundHolders <- html_table(allTables)[[3]]
topMutualFundHolders
# Holder Shares Date Reported % Out Value
# 1 First Trust Dow Jones Internet Index (SM) Fund 3,964,962 Dec 30, 2019 3.76% 68,236,996
# 2 Alger Small Cap Focus Fund 3,527,274 Oct 30, 2019 3.35% 63,773,113
# 3 Fidelity Select Portfolios - Software & IT Services Portfolio 3,297,900 Jan 30, 2020 3.13% 63,946,281
# 4 Vanguard Total Stock Market Index Fund 2,264,398 Dec 30, 2019 2.15% 38,970,289
# 5 Vanguard Small-Cap Index Fund 2,094,866 Dec 30, 2019 1.99% 36,052,643
# 6 Ivy Small Cap Growth Fund 1,302,887 Sep 29, 2019 1.24% 21,881,987
# 7 Vanguard Small Cap Value Index Fund 1,278,504 Dec 30, 2019 1.21% 22,003,053
# 8 Vanguard Extended Market Index Fund 1,186,015 Dec 30, 2019 1.13% 20,411,318
# 9 Franklin Strategic Series-Franklin Small Cap Growth Fund 1,134,200 Oct 30, 2019 1.08% 20,506,336
# 10 Fidelity Stock Selector All Cap Fund 1,018,833 Jan 30, 2020 0.97% 19,755,171
使用 CSS 选择器从网页获取不同节点
您可以从我的 GitHub博客中了解如何使用 CSS 选择器获取数据。
以上部分可帮助您了解如果只有一个网页专门介绍技能,如何获取实体。但 Pluralsight 的技能远不止机器学习。查看下图中从此URL获取的主要技能:
您可以观察到总共有 10 项主要技能,每项技能都有不同的 URL。忽略“浏览所有课程”部分,因为它会重定向回同一网页。
这里的目标是只向 R 程序提供一个 URL,即https://www.pluralsight.com/browse,并让程序自动导航到这 10 个技能网页中的每一个,并提取所有课程详细信息,如下所示:
library(rvest)
library(stringr) # For data cleaning
link <- "https://www.pluralsight.com/browse"
driver <- read_html(link)
# Extracting sub URLs
# Here, tile-box is a parent class which holds the content in the nested class.
# First, go inside the sub-class using html_children() and then fetch the URLs to each Skill page
subURLs <- html_nodes(driver,'div.tile-box') %>%
html_children() %>%
html_attr('href')
# Removing NA values and last `/browse` URL
subURLs <- subURLs[!is.na(subURLs)][1:10]
# Main URL - to complete the above URLs
mainURL <- "https://www.pluralsight.com"
# This function fetches those four entities as you learned in the previous section of this guide
entity <- function(s){
# Course Title
# Since Number of Courses may differ from Skill to Skill, therefore,
# we have done dynamic fetching of the course names
v <- html_nodes(s, "div.course-item__info") %>%
html_children()
titles <- gsub("<|>", "", str_extract(v[!is.na(str_match(v, "course-item__title"))], ">.*<"))
# Course Authors
authors <- html_nodes(s, "div.course--item__list.course-item__author") %>% html_text()
# Course Level
level <- html_nodes(s, "div.course--item__list.course-item__level") %>% html_text()
# Course Duration
duration <- html_nodes(s, "div.course--item__list.course-item__duration") %>% html_text()
# Creating a final DataFrame
courses <- data.frame(titles, authors, level, duration)
return(courses)
}
# A for loop which goes through all the URLs, fetch the entities and display them on the screen
i = 1
for (i in 1:10) {
subDriver <- read_html(paste(mainURL, subURLs[i], sep = ""))
print(entity(subDriver))
}
在上面的代码中,了解html_children()和html_attr()的重要性。代码中有详细的注释来简要说明每个命令的作用。上述代码的输出将类似于上一节的输出,并可用于每个技能。
从 R 控制浏览器并抓取数据
假设您想要抓取有关 Pluralsight 的最新 Google 新闻。您可以手动打开www.google.com,搜索关键字“Pluralsight”,然后点击“新闻”。
如果所有这些步骤都可以自动化,并且您只需通过一个小型 R 脚本即可获取最新消息,那会怎样?
注意- 在继续 R 代码之前,请确保按照以下步骤在系统中建立 Docker:
- 下载并安装Docker
- 打开 Docker Terminal 并运行docker pull selenium/standalone-chrome。如果你是 Firefox 用户,请将chrome替换为firefox 。
- 然后docker run -d -p 4445:4444 selenium/standalone-chrome
- 如果以上两个代码成功,运行docker-machine ip并记下 R 代码中要使用的 IP 地址
下面给出了使用RSelenium库的详细代码:
library(RSelenium)
# Initiate the connection, remember remoteServerAddr needs to be replaced with the IP address you have
# received from the Docker Terminal
driver <- remoteDriver(browserName = "chrome", remoteServerAddr = "192.168.99.100", port = 4445L)
driver$open()
# Provide the URL and let the driver load it for you
driver$navigate("https://www.google.com/")
# The search textarea of the Google falls under the name=q. Call this element.
init <- driver$findElement(using = 'name', "q")
# Enter the search keyword and hit Enter key
init$sendKeysToElement(list("Pluralsight", key = "enter"))
# Now, we have landed on the page with Pluralsight "All" results. Select the XPATH of the News tab and click it.
News_tab <- driver$findElement(using = 'xpath', "//*[@id=\"hdtb-msb-vis\"]/div[2]/a")
News_tab$clickElement()
# You are now on the News results. Select the CSS selector for all the news (here, a.l)
# Don't ignore that you have to use findElements (with s), not findElement. The latter gives only one result.
res <- driver$findElements(using = 'css selector', 'a.l')
# List out the latest headlines
headlines <- unlist(lapply(res, function(x){x$getElementText()}))
headlines
# [1] "Pluralsight has free courses to help you learn Microsoft Azure ..."
# [2] "Pluralsight offers free access to full portfolio of skill ..."
# [3] "Will Pluralsight Continue to Surge Higher?"
# [4] "The CEO of Pluralsight explains why the online tech skills ..."
# [5] "Pluralsight Is Free For the Month of April"
# [6] "This Pluralsight deal lets you learn some new skills from home ..."
# [7] "Pluralsight One Commits Over $1 Million to Strategic Nonprofit ..."
# [8] "Pluralsight Announces First Quarter 2020 Results"
# [9] "Pluralsight Announces Date for its First Quarter 20
免责声明:本内容来源于第三方作者授权、网友推荐或互联网整理,旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有,其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况,请与我们取得联系,我们将尽快进行相关处理与修改。感谢您的理解与支持!
请先 登录后发表评论 ~