使用 BeautifulSoup 从 HTML 中提取数据

2019-12-19 08:00:00 · 飞浪

为了充分利用 BeautifulSoup，只需要具备本指南中涵盖的 HTML 基本知识。

介绍

如今，每个人都在谈论数据，以及它如何帮助人们了解隐藏的模式和新见解。正确的数据集可以帮助企业改进营销策略，从而提高整体销售额。我们不要忘记一个流行的例子，即政治家可以在选举前了解公众的意见。数据很强大，但它不是免费的。收集正确的数据总是很昂贵的；想想调查或营销活动等。

互联网是一个数据池，只要具备正确的技能，人们就可以利用这些数据来获取大量新信息。您可以随时将数据复制粘贴到 Excel 或 CSV 文件中，但这也非常耗时且昂贵。为什么不聘请一位软件开发人员，通过编写一些 jiber-jabber 将数据转换为可读格式呢？是的，可以从 Web 中提取数据，这种“jibber-jabber”称为Web Scraping。

根据维基百科，Web Scraping 是：

网络抓取、网络收集或网络数据提取是用于从网站提取数据的数据抓取

BeautifulSoup 是 Python 提供的一个流行的库，用于从网络上抓取数据。要充分利用它，只需要具备 HTML 的基本知识，指南中涵盖了这些知识。

网页的组成部分

如果您了解基本的 HTML，则可以跳过此部分。

任何网页的基本语法是：

      <!DOCTYPE html>  
<html markdown="1">  
    <head>
    <meta charset="utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=edge" />
    </head>
    <body>
        <h1 class = "heading"> My first Web Scraping with Beautiful soup </h1>
        <p>Let's scrap the website using python. </p>
    <body>
</html>
    

HTML 中的每个标签都可以具有属性信息（即 class、id、href 和其他有用信息），有助于唯一地识别元素。

有关基本 HTML 标签的更多信息，请查看w3schools。

抓取任何网站的步骤

要使用 Python 抓取网站数据，您需要执行以下四个基本步骤：

向要抓取的网页 URL 发送 HTTP GET 请求，该请求将以 HTML 内容作为响应。我们可以使用Python 的Request库来实现这一点。
使用Beautifulsoup获取和解析数据，并将数据保存在某些数据结构（如Dict或List）中。
分析 HTML 标签及其属性，例如 class、id 和其他 HTML 标签属性。此外，识别内容所在的 HTML 标签。
以任何文件格式输出数据，例如 CSV、XLSX、JSON 等。

理解和检查数据

现在您已经了解了基本的 HTML 及其标签，您需要首先检查要抓取的页面。检查是网页抓取中最重要的工作；如果不知道网页的结构，就很难获得所需的信息。为了帮助检查，每个浏览器（如 Google Chrome 或 Mozilla Firefox）都附带一个称为开发人员工具的便捷工具。

在本指南中，我们将与维基百科合作，从按 GDP（名义）列出的国家/地区列表页面中提取部分表格数据。此页面包含一个列表标题，其中包含三个国家/地区表格，这些国家/地区按其排名及其 GDP 值（按“国际货币基金组织”、“世界银行”和“联合国”排序）。请注意，这三个表格包含在一个外部表格中。

要了解您想要抓取的任何元素，只需右键单击该文本并检查该元素的标签和属性。

进入代码

在本指南中，我们将学习如何使用Python和BeautifulSoup进行简单的网络抓取。

安装必需的 Python 库

      pip3 install requests beautifulsoup4

注意：如果你使用的是 Windows，请使用pip而不是 pip3

导入必需的库

导入“requests”库来获取页面内容，并导入bs4（Beautiful Soup）来解析HTML页面内容。

      from bs4 import BeautifulSoup
import requests
    

收集并解析网页

下一步，我们将向 url 发出 GET 请求，并在 BeautifulSoup 和 Python 内置“lxml”解析器的帮助下创建一个解析树对象（soup）。

      # importing the libraries
from bs4 import BeautifulSoup
import requests

url="https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"

# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text

# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
print(soup.prettify()) # print the parsed data of html
    

使用我们的 BeautifulSoup 对象，即汤，我们可以继续并收集所需的表数据。

在开始实际的代码之前，让我们先来玩一下汤对象并从中打印一些基本信息：

示例 1：

我们首先打印网页的标题。

      print(soup.title)

它将给出如下输出：

      <title>List of countries by GDP (nominal) - Wikipedia</title>

要获取没有 HTML 标签的文本，我们只需使用.text：

      print(soup.title.text)

这将导致：

      List of countries by GDP (nominal) - Wikipedia

示例 2：

现在，让我们获取页面中的所有链接及其属性，例如href、title及其内部文本。

      for link in soup.find_all("a"):
    print("Inner Text: {}".format(link.text))
    print("Title: {}".format(link.get("title")))
    print("href: {}".format(link.get("href")))
    

这将输出页面上所有可用的链接及其提到的属性。

现在，让我们回到正轨并找到我们的目标表。

分析外部表格，我们可以看到它具有特殊属性，包括类为wikitable并且在 tbody 内有两个tr标签。

如果展开tr标签，您会发现第一个tr标签用于所有三个表的标题，下一个tr标签用于所有三个内部表的表格数据。

让我们首先获取所有三个表格标题：

请注意，我们使用 Python 中提供的简单字符串方法删除文本左侧和右侧的换行符和空格。

      gdp_table = soup.find("table", attrs={"class": "wikitable"})
gdp_table_data = gdp_table.tbody.find_all("tr")  # contains 2 rows

# Get all the headings of Lists
headings = []
for td in gdp_table_data[0].find_all("td"):
    # remove any newlines and extra spaces from left and right
    headings.append(td.b.text.replace('\n', ' ').strip())

print(headings)
    

这将给出如下输出：

      'Per the International Monetary Fund (2018)'

      data = {}
for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
    # Get headers of table i.e., Rank, Country, GDP.
    t_headers = []
    for th in table.find_all("th"):
        # remove any newlines and extra spaces from left and right
        t_headers.append(th.text.replace('\n', ' ').strip())
    # Get all the rows of table
    table_data = []
    for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody
        t_row = {}
        # Each table row is stored in the form of
        # t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''}

        # find all td's(3) in tr and zip it with t_header
        for td, th in zip(tr.find_all("td"), t_headers): 
            t_row[th] = td.text.replace('\n', '').strip()
        table_data.append(t_row)

    # Put the data for the table with his heading.
    data[heading] = table_data

print(data)
    

将数据写入 CSV

现在我们已经创建了数据结构，我们只需对其进行迭代即可将其导出到 CSV 文件。

      import csv

for topic, table in data.items():
    # Create csv file for each table
    with open(f"{topic}.csv", 'w') as out_file:
        # Each 3 table has headers as following
        headers = [ 
            "Country/Territory",
            "GDP(US$million)",
            "Rank"
        ] # == t_headers
        writer = csv.DictWriter(out_file, headers)
        # write the header
        writer.writeheader()
        for row in table:
            if row:
                writer.writerow(row)
    

整合起来

让我们把上述所有代码片段合并起来。

我们的完整代码如下：

      # importing the libraries
from bs4 import BeautifulSoup
import requests
import csv


# Step 1: Sending a HTTP request to a URL
url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text


# Step 2: Parse the html content
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html


# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
#Get the table having the class wikitable
gdp_table = soup.find("table", attrs={"class": "wikitable"})
gdp_table_data = gdp_table.tbody.find_all("tr")  # contains 2 rows

# Get all the headings of Lists
headings = []
for td in gdp_table_data[0].find_all("td"):
    # remove any newlines and extra spaces from left and right
    headings.append(td.b.text.replace('\n', ' ').strip())

# Get all the 3 tables contained in "gdp_table"
for table, heading in zip(gdp_table_data[1].find_all("table"), headings):
    # Get headers of table i.e., Rank, Country, GDP.
    t_headers = []
    for th in table.find_all("th"):
        # remove any newlines and extra spaces from left and right
        t_headers.append(th.text.replace('\n', ' ').strip())
    
    # Get all the ro

_{免责声明：本内容来源于第三方作者授权、网友推荐或互联网整理，旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有，其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况，请与我们取得联系，我们将尽快进行相关处理与修改。感谢您的理解与支持！}

_查看原文

技术指南

阅读全文

使用 BeautifulSoup 从 HTML 中提取数据

杭州电子商务研究院

5年前 · 面向社会、服务行业、政产学研结合、整合资源、和谐发展

<!DOCTYPE html> <html markdown="1"> <head> <meta charset="utf-8" /> <meta http-equiv="X-UA-Compatible" content="IE=edge" /> </head> <body> <h1 class = "heading"> My first Web Scraping with Beautiful soup </h1> <p>Let's scrap the website using python. </p> <body> </html>

# importing the libraries from bs4 import BeautifulSoup import requests url="https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)" # Make a GET request to fetch the raw HTML content html_content = requests.get(url).text # Parse the html content soup = BeautifulSoup(html_content, "lxml") print(soup.prettify()) # print the parsed data of html

gdp_table = soup.find("table", attrs={"class": "wikitable"}) gdp_table_data = gdp_table.tbody.find_all("tr") # contains 2 rows # Get all the headings of Lists headings = [] for td in gdp_table_data[0].find_all("td"): # remove any newlines and extra spaces from left and right headings.append(td.b.text.replace('\n', ' ').strip()) print(headings)

data = {} for table, heading in zip(gdp_table_data[1].find_all("table"), headings): # Get headers of table i.e., Rank, Country, GDP. t_headers = [] for th in table.find_all("th"): # remove any newlines and extra spaces from left and right t_headers.append(th.text.replace('\n', ' ').strip()) # Get all the rows of table table_data = [] for tr in table.tbody.find_all("tr"): # find all tr's from table's tbody t_row = {} # Each table row is stored in the form of # t_row = {'Rank': '', 'Country/Territory': '', 'GDP(US$million)': ''} # find all td's(3) in tr and zip it with t_header for td, th in zip(tr.find_all("td"), t_headers): t_row[th] = td.text.replace('\n', '').strip() table_data.append(t_row) # Put the data for the table with his heading. data[heading] = table_data print(data)

import csv for topic, table in data.items(): # Create csv file for each table with open(f"{topic}.csv", 'w') as out_file: # Each 3 table has headers as following headers = [ "Country/Territory", "GDP(US$million)", "Rank" ] # == t_headers writer = csv.DictWriter(out_file, headers) # write the header writer.writeheader() for row in table: if row: writer.writerow(row)

# importing the libraries from bs4 import BeautifulSoup import requests import csv # Step 1: Sending a HTTP request to a URL url = "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)" # Make a GET request to fetch the raw HTML content html_content = requests.get(url).text # Step 2: Parse the html content soup = BeautifulSoup(html_content, "lxml") # print(soup.prettify()) # print the parsed data of html # Step 3: Analyze the HTML tag, where your content lives # Create a data dictionary to store the data. data = {} #Get the table having the class wikitable gdp_table = soup.find("table", attrs={"class": "wikitable"}) gdp_table_data = gdp_table.tbody.find_all("tr") # contains 2 rows # Get all the headings of Lists headings = [] for td in gdp_table_data[0].find_all("td"): # remove any newlines and extra spaces from left and right headings.append(td.b.text.replace('\n', ' ').strip()) # Get all the 3 tables contained in "gdp_table" for table, heading in zip(gdp_table_data[1].find_all("table"), headings): # Get headers of table i.e., Rank, Country, GDP. t_headers = [] for th in table.find_all("th"): # remove any newlines and extra spaces from left and right t_headers.append(th.text.replace('\n', ' ').strip()) # Get all the ro