使用 Selenium 实现网页抓取

2020-02-15 08:00:00 · 飞浪

介绍近年来，Angular、React和Vue等前端框架数量激增，越来越受欢迎。动态生成的网页可以提供更快的用户体验；网页本身的元素是动态创建和修改的。这些网站有很大的好处，但当我们想

介绍

近年来，Angular、React 和 Vue 等前端框架数量激增，越来越受欢迎。动态生成的网页可以提供更快的用户体验；网页本身的元素是动态创建和修改的。这些网站有很大的好处，但当我们想从中抓取数据时可能会出现问题。抓取这类网站的最简单方法是使用自动化 Web 浏览器，例如 selenium webdriver，它可以由多种语言控制，包括 Python。

Selenium是一个旨在自动测试 Web 应用程序的框架。通过 Selenium Python API，您可以直观地访问 Selenium WebDriver 的所有功能。它提供了一种方便的方式来访问 Selenium Webdriver，例如 ChromeDriver、Firefox geckodriver 等。

在本指南中，我们将探索如何在 Selenium Webdriver 和 BeautifulSoup 的帮助下抓取网页。本指南将通过示例脚本进行演示，该脚本将使用给定的关键字从pluralsight.com抓取作者和课程。

安装

下载驱动程序

Selenium 需要驱动程序来与所选浏览器交互。以下是一些最流行的浏览器驱动程序的链接：。

确保驱动程序位于PATH文件夹中，例如，对于 Linux，将其放在/usr/bin或/usr/local/bin中。或者，您可以将驱动程序放在已知位置，然后提供executable_path。

浏览器	下载链接
边缘	https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
火狐	https://github.com/mozilla/geckodriver/releases
Safari	https://webkit.org/blog/6900/webdriver-support-in-safari-10/
铬合金	https://sites.google.com/a/chromium.org/chromedriver/downloads

安装所需的软件包

如果尚未安装，请安装Selenium Python 包。

      pip install selenium

      pip install bs4
pip install lxml
    

初始化 Webdriver

让我们创建一个函数来初始化 webdriver，通过添加一些选项，例如headless。在下面的代码中，我分别为 Chrome 和 Firefox 创建了两个不同的函数。

      from selenium import webdriver
from selenium.webdriver.chrome.options import Options as ChromeOptions
from selenium.webdriver.firefox.options import Options as FirefoxOptions

# configure Chrome Webdriver
def configure_chrome_driver():
    # Add additional Options to the webdriver
    chrome_options = ChromeOptions()
    # add the argument and make the browser Headless.
    chrome_options.add_argument("--headless")
    # Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
    # if driver is in PATH, no need to provide executable_path
    driver = webdriver.Chrome(executable_path="./chromedriver.exe", options = chrome_options)
    return driver

# configure Firefox Driver
def configure_firefox_driver():
    # Add additional Options to the webdriver
    firefox_options = FirefoxOptions()
    # add the argument and make the browser Headless.
    firefox_options.add_argument("--headless")

    # Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded
    # if driver is in PATH, no need to provide executable_path
    driver = webdriver.Firefox(executable_path = "./geckodriver.exe", options = firefox_options)
    return driver
    

使浏览器无头化

无头浏览器无需显示任何图形用户界面即可工作，这使得应用程序成为用户交互的单一来源，并提供流畅的用户体验。Selenium 通过添加选项参数--headless来帮助您使任何浏览器无头。您可以为 selenium webdriver 设置多个选项参数。在此处查看一些 Chrome WebDriver 选项。

定位页面上的元素

Selenium 提供了多种功能来定位网页上的元素：

      <div id="search-field">
  <input type="text" name = "search-container" id = "id_search_input" class = "search_input" autocomplete="off">
  <input type="submit" class = "search_submit btn btn-default" >
</div>
    

      element = driver.find_element_by_id("id_search_input") # by id
element = driver.find_element_by_class_name("search-container") # by class
element = driver.find_element_by_name("search-container") # by name
element = driver.find_element_by_xpath("//input[@type='text']") # by xpath
    

如果找不到该元素，则会引发NoSuchElementException 。您可以在此处阅读更多有关定位元素的策略。

XPath 是一种功能强大的语言，常用于抓取网络数据。您可以在此处了解有关 XPath 的更多信息。

您不仅可以定位页面上的元素，还可以通过发送关键输入来填写表单、添加 cookie、切换选项卡等。您可以在此处阅读更多相关信息。

数据提取

现在让我们看看如何从网页中提取所需的数据。在下面的代码中，我们定义了两个函数getCourses和getAuthors，并分别针对给定的搜索关键字查询打印课程和作者。

Beautiful Soup仍然是遍历 DOM 和抓取数据的最佳方式，因此在对 url 发出 GET 请求后，我们将把页面源转换为BeautifulSoup对象。在此之前，我们可以等待元素加载，也可以通过反复单击“加载更多”来加载所有分页内容（取消注释loadAllContent(driver)以查看实际操作）。之后，我们可以使用select方法从页面源快速获取所需信息。

      from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup

def getCourses(driver, search_keyword):
    # Step 1: Go to pluralsight.com
    driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=course")
    WebDriverWait(driver, 5).until(
        lambda s: s.find_element_by_id("search-results-category-target").is_displayed()
    )
    
    # Load all the page data, by clicking Load More button again and again
    # loadAllContent(driver) # Uncomment me for loading all the content of the page
    
    # Step 2: Create a parse tree of page sources after searching
    soup = BeautifulSoup(driver.page_source, "lxml")
    
    # Step 3: Iterate over the search result and fetch the course
    for course_page in soup.select("div.search-results-page"):
        for course in course_page.select("div.search-result"):
            # selectors for the required information
            title_selector = "div.search-result__info div.search-result__title a"
            author_selector = "div.search-result__details div.search-result__author"
            level_selector = "div.search-result__details div.search-result__level"
            length_selector = "div.search-result__details div.search-result__length"
            print({
                "title": course.select_one(title_selector).text,
                "author": course.select_one(author_selector).text,
                "level": course.select_one(level_selector).text,
                "length": course.select_one(length_selector).text,
            })
            
# Driver code
# create the driver object.
driver = configure_chrome_driver()
search_keyword = "Machine Learning"
getCourses(driver, search_keyword)
# close the driver.
driver.close()
    

类似地，您可以对getAuthors函数执行相同的操作。

      from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup

def getAuthors(driver, search_keyword):
    driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=aem-author")
    WebDriverWait(driver, 5).until(
        lambda s: s.find_element_by_id("author-list-target").is_displayed()
    )
    
    # Load all the page data, by clicking Load More button again and again
    # loadAllContent(driver) ## Uncomment me for loading all the content of the page

    # Step 1: Create a parse tree of page sources after searching
    soup = BeautifulSoup(driver.page_source, "lxml")
    # Step 2: Iterate over the search result and fetch the author
    for author_page in soup.select("div.author-list-page"):
        for author in author_page.select("div.columns"):
            author_name = "div.author-name"
            author_img = "div.author-list-thumbnail img"
            author_profile = "a.cludo-result"
            print({
                "name": author.select_one(author_name).text,
                "img": author.select_one(author_img)["src"],
                "profile": author.select_one(author_profile)["href"]
            })
            
# Driver code
# create the driver object.
driver = configure_chrome_driver()
search_keyword = "Machine Learning"
getAuthors(driver, search_keyword)
# close the driver.
driver.close()
    

等待

如今，大多数网页都使用动态加载技术，例如 AJAX。当浏览器加载页面时，该页面中的元素可能会以不同的时间间隔加载，这使得元素的定位变得困难，有时脚本会抛出ElementNotVisibleException异常。

使用等待，我们可以解决这个问题。等待有两种类型：隐式等待和显式等待。显式等待特定条件发生后再继续执行，而隐式等待一段固定的时间。您可以在此处了解更多信息。

因此，对于我们的示例，我使用WebDriverWait显式方法来等待元素加载。

      from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException

def loadAllContent(driver):
    WebDriverWait(driver, 5).until(
        lambda s: s.find_element_by_class_name("cookie_notification").is_displayed()
    )
    driver.find_element_by_class_name('cookie_notification--opt_in').click()
    while True:
        try:
            WebDriverWait(driver, 3).until(
                lambda s: s.find_element_by_id('search-results-section-load-more').is_displayed()
            )
        except TimeoutException:
            break
        driver.find_element_by_id('search-results-section-load-more').click()
    

填写表格

在网页上填写表单通常涉及设置文本框的值，可能从下拉框或单选框中选择选项，然后单击提交按钮。我们已经了解了如何识别，现在有许多方法可用于将数据发送到输入框，例如send_keys和 click 方法。

请在此处查看更多相关信息。

      def login(driver, credentials):
    driver.get("https://app.pluralsight.com/")
    uname_element = driver.find_element_by_name("Username")
    uname_element.send_keys(credentials["username"])

    pwd_element = driver.find_element_by_name("Password")
    pwd_element.send_keys(credentials["password"])

    login_btn = driver.find_element_by_id("login")
    login_btn.click()
    

结论

使用 Selenium 和 BeautifulSoup 进行网页抓取可以成为您 Python 和数据知识技巧包中的一个方便的工具，尤其是当您面对动态页面和大量 JavaScript 呈现的网站时。本指南仅涵盖了 Selenium 和网页抓取的一些方面。要了解有关抓取高级网站的更多信息，请访问Python Selenium的官方文档。

如果您想深入了解网络抓取，请查看我发布的一些有关网络抓取的指南。

_{免责声明：本内容来源于第三方作者授权、网友推荐或互联网整理，旨在为广大用户提供学习与参考之用。所有文本和图片版权归原创网站或作者本人所有，其观点并不代表本站立场。如有任何版权侵犯或转载不当之情况，请与我们取得联系，我们将尽快进行相关处理与修改。感谢您的理解与支持！}

_查看原文

技术指南

阅读全文

使用 Selenium 实现网页抓取

杭州电子商务研究院

5年前 · 面向社会、服务行业、政产学研结合、整合资源、和谐发展

浏览器

下载链接

边缘

https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/

火狐

https://github.com/mozilla/geckodriver/releases

Safari

https://webkit.org/blog/6900/webdriver-support-in-safari-10/

铬合金

https://sites.google.com/a/chromium.org/chromedriver/downloads

from selenium import webdriver from selenium.webdriver.chrome.options import Options as ChromeOptions from selenium.webdriver.firefox.options import Options as FirefoxOptions # configure Chrome Webdriver def configure_chrome_driver(): # Add additional Options to the webdriver chrome_options = ChromeOptions() # add the argument and make the browser Headless. chrome_options.add_argument("--headless") # Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded # if driver is in PATH, no need to provide executable_path driver = webdriver.Chrome(executable_path="./chromedriver.exe", options = chrome_options) return driver # configure Firefox Driver def configure_firefox_driver(): # Add additional Options to the webdriver firefox_options = FirefoxOptions() # add the argument and make the browser Headless. firefox_options.add_argument("--headless") # Instantiate the Webdriver: Mention the executable path of the webdriver you have downloaded # if driver is in PATH, no need to provide executable_path driver = webdriver.Firefox(executable_path = "./geckodriver.exe", options = firefox_options) return driver

<div id="search-field"> <input type="text" name = "search-container" id = "id_search_input" class = "search_input" autocomplete="off"> <input type="submit" class = "search_submit btn btn-default" > </div>

element = driver.find_element_by_id("id_search_input") # by id element = driver.find_element_by_class_name("search-container") # by class element = driver.find_element_by_name("search-container") # by name element = driver.find_element_by_xpath("//input[@type='text']") # by xpath

from selenium.webdriver.support.ui import WebDriverWait from selenium.common.exceptions import TimeoutException from bs4 import BeautifulSoup def getCourses(driver, search_keyword): # Step 1: Go to pluralsight.com driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=course") WebDriverWait(driver, 5).until( lambda s: s.find_element_by_id("search-results-category-target").is_displayed() ) # Load all the page data, by clicking Load More button again and again # loadAllContent(driver) # Uncomment me for loading all the content of the page # Step 2: Create a parse tree of page sources after searching soup = BeautifulSoup(driver.page_source, "lxml") # Step 3: Iterate over the search result and fetch the course for course_page in soup.select("div.search-results-page"): for course in course_page.select("div.search-result"): # selectors for the required information title_selector = "div.search-result__info div.search-result__title a" author_selector = "div.search-result__details div.search-result__author" level_selector = "div.search-result__details div.search-result__level" length_selector = "div.search-result__details div.search-result__length" print({ "title": course.select_one(title_selector).text, "author": course.select_one(author_selector).text, "level": course.select_one(level_selector).text, "length": course.select_one(length_selector).text, }) # Driver code # create the driver object. driver = configure_chrome_driver() search_keyword = "Machine Learning" getCourses(driver, search_keyword) # close the driver. driver.close()

from selenium.webdriver.support.ui import WebDriverWait from selenium.common.exceptions import TimeoutException from bs4 import BeautifulSoup def getAuthors(driver, search_keyword): driver.get(f"https://www.pluralsight.com/search?q={search_keyword}&categories=aem-author") WebDriverWait(driver, 5).until( lambda s: s.find_element_by_id("author-list-target").is_displayed() ) # Load all the page data, by clicking Load More button again and again # loadAllContent(driver) ## Uncomment me for loading all the content of the page # Step 1: Create a parse tree of page sources after searching soup = BeautifulSoup(driver.page_source, "lxml") # Step 2: Iterate over the search result and fetch the author for author_page in soup.select("div.author-list-page"): for author in author_page.select("div.columns"): author_name = "div.author-name" author_img = "div.author-list-thumbnail img" author_profile = "a.cludo-result" print({ "name": author.select_one(author_name).text, "img": author.select_one(author_img)["src"], "profile": author.select_one(author_profile)["href"] }) # Driver code # create the driver object. driver = configure_chrome_driver() search_keyword = "Machine Learning" getAuthors(driver, search_keyword) # close the driver. driver.close()

from selenium.webdriver.support.ui import WebDriverWait from selenium.common.exceptions import TimeoutException def loadAllContent(driver): WebDriverWait(driver, 5).until( lambda s: s.find_element_by_class_name("cookie_notification").is_displayed() ) driver.find_element_by_class_name('cookie_notification--opt_in').click() while True: try: WebDriverWait(driver, 3).until( lambda s: s.find_element_by_id('search-results-section-load-more').is_displayed() ) except TimeoutException: break driver.find_element_by_id('search-results-section-load-more').click()

def login(driver, credentials): driver.get("https://app.pluralsight.com/") uname_element = driver.find_element_by_name("Username") uname_element.send_keys(credentials["username"]) pwd_element = driver.find_element_by_name("Password") pwd_element.send_keys(credentials["password"]) login_btn = driver.find_element_by_id("login") login_btn.click()