Test a Selenium Web Scraper

In previous posts we considered a few approaches for testing scrapers targeting static sites. Sometimes you won’t be able to get away with these static tools and you’ll be forced to use browser automation. In this post I’ll look at some options for testing a Selenium web scraper.

The Scraper

Below is a simple Selenium scraper for https://books.toscrape.com/. 📢 The target site is static, so it doesn’t actually require Selenium. However, our focus is on testing, so any target site will do.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
import polars as pl


class BooksScraper:
    def __init__(self):
        self.options = Options()
        self.options.add_argument("--headless")
        self.options.add_argument("--disable-gpu")

        self.driver = webdriver.Chrome(options=self.options)

    def __del__(self):
        self.driver.quit()

    def download(self, url: str) -> str:
        # This is where the network requests happen.
        self.driver.get(url)
        WebDriverWait(self.driver, 10).until(
            EC.visibility_of_element_located((By.CSS_SELECTOR, "ul.breadcrumb"))
        )
        return self.driver.page_source

    def parse(self, html: str) -> list:
        soup = BeautifulSoup(html, "html.parser")
        return [a.get("href") for a in soup.select("h3 > a", href=True)]

    def transform(self, links: list) -> pl.DataFrame:
        return pl.DataFrame({"url": links})


if __name__ == "__main__":
    scraper = BooksScraper()

    url = "https://books.toscrape.com/"

    html = scraper.download(url)
    links = scraper.parse(html)

    for link in links:
        print(link)

    df = scraper.transform(links)

    with pl.Config(fmt_str_lengths=120, tbl_formatting="MARKDOWN"):
        print(df.head())

The scraper class, BooksScraper, has three methods:

  • download() — retrieves the HTML content;
  • parse() — extracts the link URLs from the HTML; and
  • transform() — creates a table of links.

In a more complete example I would persist both the HTML (as .html) and parsed output (as .json). Keeping a copy of the HTML is good practice because it means that you are able to reparse it at a later stage, potentially extracting additional information or fixing errors.

For the purpose of testing we need to handle the network requests being generated in the download() method.

Sample Output

The scraper doesn’t implement pagination, so it only returns the first page of results. Here aresome links from the parse() method:

catalogue/a-light-in-the-attic_1000/index.html
catalogue/tipping-the-velvet_999/index.html
catalogue/soumission_998/index.html
catalogue/sharp-objects_997/index.html
catalogue/sapiens-a-brief-history-of-humankind_996/index.html

And this is the top of the data frame returned by the transform() method.

| url                                                           |
|---------------------------------------------------------------|
| catalogue/a-light-in-the-attic_1000/index.html                |
| catalogue/tipping-the-velvet_999/index.html                   |
| catalogue/soumission_998/index.html                           |
| catalogue/sharp-objects_997/index.html                        |
| catalogue/sapiens-a-brief-history-of-humankind_996/index.html |

Testing the Scaper

When testing we will need to simulate the interaction of the scraper with the web site. The tests will then be self-contained and won’t rely on any network interactions.

Since Selenium interacts with a browser, which in turn communicates with the web site, there is no simple way to simulate the web site responses. The cleanest approach is to simulate the entire download process, effectively removing Selenium. This approach won’t actually test the Selenium automation. It will, however, allow you to robustly test the parse() and transform() methods.

The simulation will need a local copy of the page.

curl https://books.toscrape.com/ | tidy -i >books-to-scrape.html

The tidy tool will clean up the raw HTML without making any changes to the content or structure. You could equally just capture the output from curl.

Mocking

We’ll start by using mocking. This technique was explored previously here.

import pytest
from unittest.mock import MagicMock
import polars as pl
from scraper import BooksScraper

# The URL should not really matter since we're not actually making a network request.
URL = "https://books.toscrape.com/"

# Load the reference data.
LINKS = pl.read_csv("books-to-scrape.csv")


@pytest.fixture
def scraper():
    bs = BooksScraper()

    with open("books-to-scrape.html", "r") as f:
        # Mock the download() method, loading HTML from file.
        bs.download = MagicMock(return_value=f.read())

    return bs


def test_scraper(scraper):
    html = scraper.download(URL)

    links = scraper.parse(html)
    assert links == LINKS["url"].to_list()

    df = scraper.transform(links)
    assert df.equals(LINKS)

The scraper() fixture encapsulates the creation of a BooksScraper object, and uses the MagicMock() class to mock the download() method. The mocked method returns the contents of the HTML file.

Patching

Now let’s look at patching. This technique was explored previously here.

import pytest
from unittest.mock import patch
import polars as pl
from scraper import BooksScraper

# The URL should not really matter since we're not actually making a network request.
URL = "https://books.toscrape.com/"

# Load the reference data.
LINKS = pl.read_csv("books-to-scrape.csv")


@pytest.fixture
def scraper():
    return BooksScraper()


@patch.object(BooksScraper, "download")
def test_scraper(patched_download, scraper):
    with open("books-to-scrape.html", "r") as f:
        patched_download.return_value = f.read()

    html = scraper.download(URL)

    links = scraper.parse(html)
    assert links == LINKS["url"].to_list()

    df = scraper.transform(links)
    assert df.equals(LINKS)

This does not differ substantially from the approach using mocking: in both cases the HTML content is loaded from a file rather than being retrieved via Selenium.

Selenium Wire

The test coverage obtained using mocking and patching will always be incomplete. In neither case have we actually tested the download() method. There’s another approach that makes it possible to test the download() method. Well sort of. We’ll intercept the requests and responses being routed through Selenium.

Selenium Wire is an extension of Selenium that lets you inspect and modify HTTP(S) requests and responses made by the automated browser. It works by redirecting browser traffic through a local mitmproxy (man-in-the-middle proxy) instance.

Intercepting

The Selenium Wire driver is a drop-in replacement for the Selenium driver. It has attributes that make it possible to view and manipulate browser traffic:

  • request_interceptor — intercepts requests (on the way out) and
  • response_interceptor — intercepts responses (on the way in).

The test below uses request_interceptor to intercept the browser requests and returns the same static HTML response as above. The remaining requests, which are really just for resources like images, CSS and JavaScript, are blocked.

from seleniumwire import webdriver
import pytest
import polars as pl
from scraper import BooksScraper

URL = "https://books.toscrape.com/"

LINKS = pl.read_csv("books-to-scrape.csv")


@pytest.fixture
def mock_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    driver = webdriver.Chrome(options=options)

    with open("books-to-scrape.html") as f:
        MOCK_RESPONSE = f.read()

    def interceptor(request):
        print(f"URL: {request.url}.")
        if request.url == URL:
            # Intercept the request for main page and return mock response.
            request.create_response(
                status_code=200,
                headers={"Content-Type": "text/html"},
                body=MOCK_RESPONSE.encode("utf-8"),
            )
            print("🟢 Mock response.")
        else:
            print("🔴 Blocked!")
            request.abort()

    driver.request_interceptor = interceptor
    return driver


@pytest.fixture
def scraper(mock_driver):
    scraper = BooksScraper()
    scraper.driver = mock_driver
    return scraper


def test_scraper(scraper):
    html = scraper.download(URL)

    links = scraper.parse(html)
    assert links == LINKS["url"].to_list()

    df = scraper.transform(links)
    assert df.equals(LINKS)

The test has two fixtures:

  • scraper — creates a BookScraper object and patches the driver attribute with the result of the mock_driver fixture; and
  • mock_driver — creates a driver object using Selenium Wire that returns HTML from a file for a specific request URL and ignores all others. The call to request.abort() causes a (Forbidden) response.

The request to https://books.toscrape.com/ is intercepted and the content of the HTML file is returned. All of the other requests are blocked. You could, however, go all the way down the rabbit hole and intercept other requests. It just depends on how much functionality of the site you need to replicate for your tests to make sense.

The timing statistics below reflect the relative execution time of these three options. Mocking and patching are blazing! Request interception takes more time but also gives a more complete test. Choose your poison.

============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-8.3.5, pluggy-1.5.0
configfile: pytest.ini
plugins: socket-0.7.0, timer-1.0.0, timeout-2.3.1
timeout: 5.0s
timeout method: signal
timeout func_only: False
collected 3 items                                                              

test_books_scraper_intercepted.py .                                      [ 33%]
test_books_scraper_mocked.py .                                           [ 66%]
test_books_scraper_patched.py .                                          [100%]

================================= pytest-timer =================================
[success] 97.03% test_books_scraper_intercepted.py::test_scraper: 0.7645s
[success] 1.63% test_books_scraper_mocked.py::test_scraper: 0.0128s
[success] 1.34% test_books_scraper_patched.py::test_scraper: 0.0105s
======================= 3 passed, 617 warnings in 2.13s ========================

Proxies

If these approaches don’t suffice then a proxy might be the next best alternative. To use a proxy you’d have to do something like this:

  1. Set up a proxy service that mimics the functionality of the site.
  2. Direct the tests to route their requests through the proxy.

The mechanics of how this would be done will depend on which proxy you use. These options will be discussed in future posts.

Real World Test

Do these approaches to testing work in the real world? Sure! Of course they do. It might just require a bit more work. Suppose, for example, that we wanted to scrape product data for Citric Acid from Avantor’s site.

Screenshot of Citric Acid product page with a table of item sizes and corresponding prices.

Here’s a test for a scraper that uses a similar framework to the Books to Scrape example above. Now, rather than replacing the response for a single request, we intercept six distinct requests and provide responses from local files. The remaining requests are aborted.

from seleniumwire import webdriver
import json
import pytest
import polars as pl
from urllib.parse import urlparse
from avantor import AvantorScraper

PRODUCTS_DICT = json.load(open("products.json"))
PRODUCTS_TABLE = pl.read_csv("products.csv")

urls = {
    "/store/product/725074/citric-acid-anhydrous-99-5-acs": (
        "product.html",
        "text/html",
    ),
    "/cdn-cgi/scripts/7d0fa10a/cloudflare-static/rocket-loader.min.js": (
        "rocket-loader.min.js",
        "application/javascript",
    ),
    "/responsive/js/vendor/jquery-3.6.1.min.js": (
        "jquery.min.js",
        "application/javascript",
    ),
    "/responsive/js/unified-responsive-2022.min.js": (
        "unified-responsive.min.js",
        "application/javascript",
    ),
    "/store/services/catalog/json/stiboOrderTableRender.jsp": (
        "product-table.html",
        "text/html",
    ),
    "/store/services/pricing/json/skuPricing.jsp": (
        "sku-pricing.json",
        "application/json",
    ),
}


@pytest.fixture
def mock_driver():
    options = webdriver.ChromeOptions()
    options.add_argument("--headless")
    driver = webdriver.Chrome(options=options)

    def interceptor(request):
        try:
            path, mime = urls[urlparse(request.url).path]
            request.create_response(
                status_code=200,
                headers={"Content-Type": mime},
                body=open(path).read(),
            )
        except KeyError:
            request.abort()

    driver.request_interceptor = interceptor
    return driver


@pytest.fixture
def scraper(mock_driver):
    scraper = AvantorScraper("725074/citric-acid-anhydrous-99-5-acs")
    scraper.driver = mock_driver
    return scraper


def test_scraper(scraper):
    html = scraper.download()

    products = scraper.parse(html)
    assert products == PRODUCTS_DICT

    df = scraper.transform(products)
    assert df.equals(PRODUCTS_TABLE)

The selection of intercepted requests was determined by trial and error. We need to provide those resources required to make the page “work”. Nothing more, nothing less.

Conclusion

We have looked at three options for testing a Selenium scraper. The first two options, mocking and patching, are rather crude and simply override the method that retrieves page content. They will not exercise any of the Selenium code. The third option, request interception, will allow you to emulate essentially all of the functionality of the target site and thereby make it possible to test the Selenium code too. This could be as simple as intercepting a request for an HTML page, but might also handle requests for JavaScript or data from an API. The rabbit hole is as deep as you want to make it.

📌 For reference, I’m using selenium==4.30.0 and selenium-wire==5.1.0 with Chrome 131.0.6778.69 and ChromeDriver 134.0.6998.165.