使用 Requests 实现网页抓取
介绍
Web 抓取是从网站提取数据的技术。互联网革命导致了数据爆炸式增长,提取这些数据的能力已成为数据科学家的重要先决条件。在 Python 中进行 Web 抓取的最流行方法之一是通过 Requests 包发送 HTTP 请求,然后使用 BeautifulSoup 包解析收到的 HTML。
Requests 包是下载次数最多的 Python 包之一,它包含两个组件:向 API 发出请求和获取原始 HTML 内容。
在本指南中,您将学习使用 Python 中的 Requests 包实现网络抓取的基础知识,该包用于执行 HTTP 请求。HTTP 是超文本传输协议的缩写,是网络数据通信的基础。
让我们首先加载所需的库。
import requests
import urllib
import urllib.request
from urllib.request import urlretrieve
from urllib.request import urlopen, Request
探索请求包
在本指南中,我们将从电影《复仇者联盟:终局之战》的任意一篇维基百科文章中抓取数据。我们将使用该网页的 URL 地址。(URL 是统一资源定位器的缩写)。
URL 网址有两个组成部分,它们合在一起构成完整的网址。
协议标识符: 用http:表示
资源名称:在本例中用en.wikipedia.org/wiki/Avengers:_Endgame表示
下面的第一行代码指定了电影的 Wikipedia 页面的 URL,并将其存储到变量url中。第二行打包并发送请求,并使用函数request.get()捕获响应。我们将响应存储到变量req1中。
url = "https://en.wikipedia.org/wiki/Avengers:_Endgame"
req1 = requests.get(url)
也可以用一行代码完成相同的任务:
req1 = 请求.get('https://en.wikipedia.org/wiki/Avengers:_Endgame')
我们可以使用下面的代码查看上面创建的对象的内容。
req1.content
输出:
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Avengers: Endgame - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XjSKaApAADkAAIpdQNsAAAEX","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Avengers:_Endgame","wgTitle":"Avengers: Endgame","wgCurRevisionId":938381569,"wgRevisionId":938381569,"wgArticleId":44254295,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","Wikipedia pages semi-protected against vandalism","Articles with short description","Use American English from October 2019",\n"All Wikipedia articles written in American English","Use mdy dates from January 2020","Use list-defined references from October 2019","Pages using multiple image with manual scaled images","Articles with Encyclop\xc3\xa6dia Britannica links","Comics navigational boxes purge","2019 films","English-language films","2010s science fiction action films","2010s sequel films","2010s superhero films","2019 3D films","Alien invasions in films","Alternate timeline films","American 3D films","American films","American science fiction action films","American sequel films","Avengers (film series)","Crossover films","Films about extraterrestrial life","Films about quantum mechanics","Films about size change","Films about time travel","Films directed by Anthony and Joe Russo","Films featuring anthropomorphic characters","Films scored by Alan Silvestri","Films set in 1970","Films set in 2012","Films set in 2013","Films set in 2014","Films set in 2018","Films set in 2023","Films set in New Jersey",\n"Films set in New York (state)","Films set in New York City","Films set in Norway","Films set in San Francisco","Films set in Tokyo","Films set in Wakanda","Films set in Africa","Films set in the 1940s","Films set on fictional planets","Films shot at Pinewood Atlanta Studios","Films shot in A tlanta","Films shot in County Durham","Films shot in New York (state)","Films shot in Scotland","Films using computer-generated imagery","Films with screenplays by Christopher Markus \\u0026 Stephen McFeely","IMAX films","Intergalactic travel in fiction","Marvel Cinematic Universe films","Motion capture in film","Nanotechnology in fiction","Post-apocalyptic films","Sequel films"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Avengers:_Endgame","wgRelevantArticleId":44254295,"wgIsProbablyEditabl
我们可以使用下面的代码检查网页的标题。
req1.headers
输出:
{'Date': 'Fri, 31 Jan 2020 20:13:29 GMT', 'Content-Type': 'text/html; charset=UTF-8', 'Server': 'mw1262.eqiad.wmnet', 'X-Powered-By': 'PHP/7.2.26-1+0~20191218.33+debian9~1.gbpb5a340+wmf1', 'X-Content-Type-Options': 'nosniff', 'P3P': 'CP="See https://en.wikipedia.org/wiki/Special:CentralAutoLogin/P3P for more info."', 'Content-language': 'en', 'Vary': 'Accept-Encoding,Cookie,Authorization', 'Content-Encoding': 'gzip', 'Last-Modified': 'Fri, 31 Jan 2020 20:09:01 GMT', 'Backend-Timing': 'D=211902 t=1580501608833754', 'X-ATS-Timestamp': '1580501609', 'X-Varnish': '569435268 464972321', 'Age': '39789', 'X-Cache': 'cp2004 miss, cp2010 hit/166', 'X-Cache-Status': 'hit-front', 'Server-Timing': 'cache;desc="hit-front"', 'Strict-Transport-Security': 'max-age=106384710; includeSubDomains; preload', 'Set-Cookie': 'WMF-Last-Access=01-Feb-2020;Path=/;HttpOnly;secure;Expires=Wed, 04 Mar 2020 00:00:00 GMT, WMF-Last-Access-Global=01-Feb-2020;Path=/;Domain=.wikipedia.org;HttpOnly;secure;Expires=Wed, 04 Mar 2020 00:00:00 GMT, GeoIP=US:TX:San_Antonio:29.42:-98.49:v4; Path=/; secure; Domain=.wikipedia.org', 'X-Client-IP': '13.84.209.100', 'Cache-Control': 'private, s-maxage=0, max-age=0, must-revalidate', 'Accept-Ranges': 'bytes', 'Content-Length': '110083', 'Connection': 'keep-alive'}
我们还可以使用下面第一行代码,使用对象的 text 属性提取响应。这将返回我们存储在变量text_object中的网页的 HTML 内容。第二行打印文本对象的内容。
text_object = req1.text
print(text_object)
输出:
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>Avengers: Endgame - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XjSKaApAADkAAIpdQNsAAAEX","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Avengers:_Endgame","wgTitle":"Avengers: Endgame","wgCurRevisionId":938381569,"wgRevisionId":938381569,"wgArticleId":44254295,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","Wikipedia pages semi-protected against vandalism","Articles with short description","Use American English from October 2019",
"All Wikipedia articles written in American English","Use mdy dates from January 2020","Use list-defined references from October 2019","Pages using multiple image with manual scaled images","Articles with Encyclopædia Britannica links","Comics navigational boxes purge","2019 films","English-language films","2010s science fiction action films","2010s sequel films","2010s superhero films","2019 3D films","Alien invasions in films","Alternate timeline films","American 3D films","American films","American science fiction action films","American sequel films","Avengers (film series)","Crossover films","Films about extraterrestrial life","Films about quantum mechanic s","Films about size change","Films about time travel","Films directed by Anthony and Joe Russo","Films featuring anthropomorphic characters","Films scored by Alan Silvestri","Films set in 1970","Films set in 2012","Films set in 2013","Films set in 2014","Films set in 2018","Films set in 2023","Films set in New Jersey",
"Films set in New York (state)","Films set in New York City","Films set in Norway","Films set in San Francisco","Films set in Tokyo","Films set in Wakanda","Films set in Africa","Films set in the 1940s","Films set on fictional planets","Films shot at Pinewood Atlanta Studios","Films shot in Atlanta","Films shot in County Durham","Films shot in New York (state)","Films shot in Scotland","Films using computer-generated imagery","Films with screenplays by Christopher Markus \u0026 Stephen McFeely","IMAX films","Intergalactic travel in fiction","Marvel Cinematic Universe films","Motion capture in film","Nanotechnology in fiction","Post-apocalyptic films","Sequel films"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Avengers:_Endgame","wgRelevantArticleId":44254295,"wgIsProbablyEditable":!1,"wgRelevantPageIsProbablyEditable":!1,"wgRestrictionEdit":["autoconfirmed"],"wgRestrictionMove":["extendedconfirmed"],"wgMediaViewerOnClick":!0,
Urllib 包
另一个可用于检索 Web 数据的有用包是 urllib 库。它不像 Requests 包那么流行,但了解它很有用。该包使用 urlretrieve ()函数执行GET请求。
下面的第一行代码指定了网址。第二行使用 Request ()函数打包请求,第三行使用urlopen()函数捕获响应。
url = "https://en.wikipedia.org/wiki/Avengers:_Endgame"
req1 = Request(url)
response = urlopen(req1)
我们可以使用下面的代码打印响应对象的数据类型。
print(response)
print(type(response))
输出:
<http.client.HTTPResponse object at 0x7f0b261a5588>
<class 'http.client.HTTPResponse'>
最后,我们可以使用下面的代码行提取响应。
html_1 = response.read()
print(html_1)
输出:
b'<!DOCTYPE html>\n<html class="client-nojs" lang="en" dir="ltr">\n<head>\n<meta charset="UTF-8"/>\n<title>Avengers: Endgame - Wikipedia</title>\n<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XjSKaApAADkAAIpdQNsAAAEX","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Avengers:_Endgame","wgTitle":"Avengers: Endgame","wgCurRevisionId":938381569,"wgRevisionId":938381569,"wgArticleId":44254295,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","Wikipedia pages semi-protected against vandalism","Articles with short description","Use American English from October 2019",\n"All Wikipedia articles written in American English","Use mdy dates from January 2020","Use list-defined references from October 2019","Pages using multiple image with manual scaled images","Articles with Encyclop\xc3\xa6dia Britannica links","Comics navigational boxes purge","2019 films","English-language films","2010s science fiction action films","2010s sequel films","2010s superhero films","2019 3D films","Alien invasions in films","Alternate timeline films","American 3D films","American films","American science fiction action films","American sequel films","Avengers (film series)","Crossover films","Films about extraterrestrial life","Films about quantum mechanics","Films about size change","Films about time travel","Films directed by Anthony and Joe Russo","Films featuring anthropomorphic characters","Films scored by Alan Silvestri","Films set in 1970","Films set in 2012","Films set in 2013","Films set in 2014","Films set in 2018","Films set in 2023","Films set in New Jersey",\n"Films set in New York (state)","Films set in New York City","Films set in Norway","Films set in San Francisco","Films set in Tokyo","Films set in Wakanda","Films set in Africa","Films set in the 1940s","Films set on fictional planets","Films shot at Pinewood Atlanta Studios","Films shot in Atlanta","Films shot in County Durham","Films shot in New York (state)","Films shot in Scotland","Films using computer-generated imagery","Films with screenplays by Christopher Markus \\u0026 Stephen McFeely","IMAX films","Intergalactic travel in fiction","Marvel Cinematic Universe films","Motion capture in film","Nanotechnology in fiction","Post-apocalyptic films","Sequel films"],"wgPageContentLanguage":"en","wgPageContentModel":"wikitext","wgRelevantPageName":"Avengers:_Endgame","wgRelevantArticleId":44254295,"wgIsProbablyEditable":!1,"wgRelevantPageIsProbablyEditable":!1,"wgRestrictionEdit":["autoconfirmed"],"wgRestrictionMove":["extendedconfirmed"],"wgMediaViewerOnClick":!0,\n"wgMediaViewerEnabledByDefault":!0,"wgPopupsReferencePreviews":!1,"wgPopupsConflictsWithNavPopupGadget":!1,"wgVisualEditor":{"pageLanguageCode":"en","pageLanguageDir":"ltr","pageVariantFallbacks":"en"},"wgMFDisplayWikibaseDescriptions":{"search":!0,"nearby":!0,"watchlist":!0,"tagline":!1},"wgWMESchemaEditAttemptStepOversample":!1,"wgULSCurrentAutonym":"English","wgNoticeProject":"wikipedia","wgWikibaseItemId":"Q23781155","wgCentralAuthMobileDomain":!1,"wgEditSubmitButtonLabelPublish":!0};RLSTATE={"ext.globalCssJs.user.styles":"ready","site.styles":"ready","noscript":"ready","user.styles":"ready","ext.globalCssJs.user":"ready","user":"ready"
结论
在本指南中,您了解了使用 Python 中流行的 Requests 库进行 Web 抓取的基础知识,该库是使用 Python 发出 HTTP 请求的事实标准。您还了解了 urllib 包并探索了这两个库。
要了解有关使用 Python 进行数据科学的更多信息,请参阅以下指南。
免责声明:本内容来源于第三方作者授权、网友推荐或互联网整理,旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有,其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况,请与我们取得联系,我们将尽快进行相关处理与修改。感谢您的理解与支持!
请先 登录后发表评论 ~