使用 BeautifulSoup 实现网页抓取
介绍
互联网革命导致了数据爆炸式增长,许多公司都在尝试从网络上提取和分析尽可能多的数据。从网站抓取数据并提取信息的过程称为网络抓取。在本指南中,您将了解如何使用 Python 强大的软件包 BeautifulSoup 进行网络抓取,该软件包用于解析 HTML 和 XML 文档。
让我们首先加载所需的库。
import pandas as pd
import numpy as np
import bs4
import requests
import urllib
import urllib.request
import re
from bs4 import BeautifulSoup
from urllib.request import urlretrieve
from urllib.request import urlopen, Request
获取网络数据
在本指南中,我们将从电影《复仇者联盟:终局之战》的维基百科文章中抓取数据。我们将使用下面的第一行代码指定网页的 URL 地址。URL 是统一资源定位符的首字母缩写,它专注于网址,包含两个部分:
协议标识符,用http表示:
资源名称,在本例中用en.wikipedia.org/wiki/Avengers:_Endgame表示
这两个组件完整地指定了网址。下面的第一行代码指定了电影的 Wikipedia 链接的 URL,而第二行将响应提取为 HTML 对象。HTML 是超文本标记语言的首字母缩写,是网页的标准语言。一旦我们有了 HTML 对象,我们将使用 BeautifulSoup 方法来解析 HTML 文档,如第三行代码所示。第四行打印对象的类型。
url = "https://en.wikipedia.org/wiki/Avengers:_Endgame"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
输出:
bs4.BeautifulSoup
我们可以使用下面的代码查看上面创建的对象的结构。
print(soup.prettify())
输出:
<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>
Avengers: Endgame - Wikipedia
</title>
<script>
document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgMonthNamesShort":["","Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"],"wgRequestId":"XjNILwpAAEIAAJRf3S8AAAAU","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"Avengers:_Endgame","wgTitle":"Avengers: Endgame","wgCurRevisionId":938381569,"wgRevisionId":938381569,"wgArticleId":44254295,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 uses Russian-language script (ru)","CS1 Russian-language sources (ru)","Wikipedia pages semi-protected against vandalism","Articles with short description","Use American English from October 2019",
"All Wikipedia articles written in American English","Use mdy dates from January 2020","Use list-defined references from October 2019","Pages using multiple image with manual scaled images","Articles with Encyclopædia Britannica links","Comics navigational boxes purge","2019 films","English-language films","2010s science fiction action films","2010s sequel films","2010s superhero films","2019 3D films","Alien invasions in films","Alternate timeline films","American 3D films","American films","American science fiction action films","American sequel films","Avengers (film series)","Crossover films","Films about extraterrestrial life","Films about quantum mechanics","Films about size change","Films about time travel","Films directed by Anthony and Joe Russo","Films featuring anthropomorphic characters","Films scored by Alan Silvestri","Films set in 1970","Films set in 2012","Films set in 2013","Films set in 2014","Films set in 2018","Films set in 2023","Films set in New Jersey",
</script>
<script>
(RLQ=window.RLQ||[]).push(function(){mw.loader.implement("user.tokens@tffin",function($,jQuery,require,module){/*@nomin*/mw.user.tokens.set({"patrolToken":"+\\","watchToken":"+\\","csrfToken":"+\\"});
});});
</script>
<link href="/w/load.php?lang=en&modules=ext.cite.styles%7Cext.uls.interlanguage%7Cext.visualEditor.desktopArticleTarget.noscript%7Cext.wikimediaBadges%7Cjquery.makeCollapsible.styles%7Cmediawiki.legacy.commonPrint%2Cshared%7Cmediawiki.skinning.interface%7Cmediawiki.toc.styles%7Cskins.vector.styles%7Cwikibase.client.init&only=styles&skin=vector" rel="stylesheet"/>
<script async="" src="/w/load.php?lang=en&modules=startup&only=scripts&raw=1&skin=vector">
</script>
<meta content="" name="ResourceLoaderDynamicStyles"/>
<link href="/w/load.php?lang=en&modules=site.styles&only=styles&skin=vector" rel="stylesheet"/>
<meta content="MediaWiki 1.35.0-wmf.15" name="generator"/>
<meta content="origin" name="referrer"/>
<meta content="origin-when-crossorigin" name="referrer"/>
<meta content="origin-when-cross-origin" name="referrer"/>
命令print(soup.prettify())生成一个长输出,为了简洁起见,上面已将其截断。
我们上面创建的结构可以通过多种方式导航,其中一些方式在下面突出显示。下面的代码行将提取网页的标题。
print(soup.title)
输出:
<title>Avengers: Endgame - Wikipedia</title>
soup.get_text ()方法将从网页中提取 HTML 对象的文本,如下所示。
print(soup.text)
输出:
Avengers: EndgameTheatrical release posterDirected byAnthony RussoJoe RussoProduced byKevin FeigeScreenplay byChristopher MarkusStephen McFeelyBased onThe Avengersby Stan LeeJack KirbyStarring
Robert Downey Jr.
Chris Evans
Mark Ruffalo
Chris Hemsworth
Scarlett Johansson
Jeremy Renner
Don Cheadle
Paul Rudd
Brie Larson
Karen Gillan
Danai Gurira
Benedict Wong
Jon Favreau
Bradley Cooper
Gwyneth Paltrow
Josh Brolin
Music byAlan SilvestriCinematographyTrent OpalochEdited by
Jeffrey Ford
Matthew Schmidt
Productioncompany Marvel Studios Distributed byWalt Disney StudiosMotion PicturesRelease date
April 22, 2019 (2019-04-22) (Los Angeles Convention Center)
April 26, 2019 (2019-04-26) (United States)
Running time181 minutes[1]CountryUnited States[2]LanguageEnglishBudget$356 million[3]Box office$2.8 billion[3]
Avengers: Endgame is a 2019 American superhero film based on the Marvel Comics superhero team the Avengers, produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures. It is the sequel to 2012's The Avengers, 2015's Avengers: Age of Ultron, and 2018's Avengers: Infinity War, and the twenty-second film in the Marvel Cinematic Universe (MCU). It was directed by Anthony and Joe Russo and written by Christopher Markus and Stephen McFeely, and features an ensemble cast including Robert Downey Jr., Chris Evans, Mark Ruffalo, Chris Hemsworth, Scarlett Johansson, Jeremy Renner, Don Cheadle, Paul Rudd, Brie Larson, Karen Gillan, Danai Gurira, Benedict Wong, Jon Favreau, Bradley Cooper, Gwyneth Paltrow, and Josh Brolin. In the film, the surviving members of the Avengers and their allies attempt to reverse the damage caused by Thanos in Infinity War.
The film was announced in October 2014 as Avengers: Infinity War – Part 2, but Marvel later removed this title. The Russo brothers joined as directors in April 2015, with Markus and McFeely signing on to write the script a month later. The film serves as a conclusion to the story of the MCU up to that point, ending the story arcs for several main characters. Filming began in August 2017 at Pinewood Atlanta Studios in Fayette County, Georgia, shooting back-to-back with Infinity War, and ended in January 2018. Additional filming took place in the Metro and Downtown Atlanta areas, New York, Scotland, and England. The story revisits several moments from earlier films, bringing back actors and settings from throughout the franchise as well as music from previous films. The official title was revealed in December 2018. With an estimated budget of $356 million, it is one of the most expensive films ever made.
Avengers: Endgame was widely anticipated, and Disney backed the film with Marvel's largest marketing campaign. It premiered in Los Angeles on April 22, 2019, and was theatrically released in the United States on April 26. The film received praise for its direction, acting, musical score, action sequences, visual effects, and emotional weight, with critics lauding its culmination of the 22-film story. It grossed nearly $2.8 billion worldwide, surpassing Infinity War's entire theatrical run in just eleven days and breaking numerous box office records, including becoming the highest-grossing film of all time. The film received numerous awards and nominations, including a nomination for Best Visual Effects at the 92nd Academy Awards, three nominations at the 25th Critics' Choice Awards, and a nomination for Special Visual Effects at the 73rd British Academy Film Awards.
BeautifulSoup 中还有许多更强大的功能,可让您轻松地从网站中抓取信息。其中一种方法是find_all()方法,它允许我们从网页中提取有用的 HTML 标签。其中一个标签是用于提取网页中超链接的标签<a> 。我们创建一个for循环,如下所示,搜索所有超链接并打印它们,如下所示。
web_links = soup.find_all("a")
for link in web_links:
print(link.get("href"))
输出:
None
/wiki/Wikipedia:Protection_policy#semi
#mw-head
#p-search
/wiki/File:Avengers_Endgame_poster.jpg
/wiki/Russo_brothers
/wiki/Kevin_Feige
/wiki/Christopher_Markus_and_Stephen_McFeely
/wiki/Avengers_(comics)
/wiki/Stan_Lee
/wiki/Jack_Kirby
/wiki/Robert_Downey_Jr.
/wiki/Chris_Evans_(actor)
/wiki/Mark_Ruffalo
/wiki/Chris_Hemsworth
/wiki/Scarlett_Johansson
#cite_ref-LebowsChris_27-1
https://variety.com/2019/film/news/chris-hemsworth-fat-thor-avengers-endgame-1203226429/
/wiki/Variety_(magazine)
https://web.archive.org/web/20190529212103/https://variety.com/2019/film/news/chris-hemsworth-fat-thor-avengers-endgame-1203226429/
#cite_ref-JohanssonAvengers4_28-0
http://screenrant.com/avengers-4-images-japan-black-widow/
https://web.archive.org/web/20170823002522/http://screenrant.com/avengers-4-images-japan-black-widow/
#cite_ref-ControWidow_29-0
#cite_ref-ControWidow_29-1
https://ew.com/movies/2019/05/02/avengers-endgame-directors-black-widow-scene/
/wiki/Entertainment_Weekly
https://web.archive.org/web/20190507013118/https://ew.com/movies/2019/05/02/avengers-endgame-directors-black-widow-scene/
#cite_ref-JohanssonPrepare_30-0
https://www.hollywoodreporter.com/news/scarlett-johanssons-avengers-workout-how-get-a-black-widow-body-1204043
/wiki/The_Hollywood_Reporter
https://web.archive.org/web/20190502183033/https://www.hollywoodreporter.com/news/scarlett-johanssons-avengers-workout-how-get-a-black-widow-body-1204043
结论
在本指南中,您了解了使用 Python 中流行的BeautifulSoup库进行 Web 抓取的基础知识。您了解了如何访问 Web 数据并将其转换为 HTML 对象,以及使用BeautifulSoup库对其进行解析的基本方法。
要了解有关使用 Python 进行数据科学的更多信息,请参阅以下指南。
免责声明:本内容来源于第三方作者授权、网友推荐或互联网整理,旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有,其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况,请与我们取得联系,我们将尽快进行相关处理与修改。感谢您的理解与支持!
请先 登录后发表评论 ~