Unit Test a Web Scraper using Responses

As mentioned in the introduction to web scraper testing, unit tests should be self-contained and not involve direct access to the target website. The responses package allows you to easily mock the responses returned by a website, so it’s well suited to the job. The package is stable and well documented.

Install

Getting the responses package installed is simple.

pip3 install responses

It requires a recent versions of Python and the requests package.

The Scraper

Here’s a simple scraper for the Quotes to Scrape website. It’s not the way that I would structure a production web scraper, but it’s convenient for illustrating the testing process. Shortly we’ll build some unit tests.

import requests
import re
from bs4 import BeautifulSoup
import pandas as pd


URL = "https://quotes.toscrape.com/"


class QuotesScraper:
    def __init__(self):
        self.client = requests.Session()
        self.html: str | None = None
        self.quotes: list | None = None

    def __del__(self):
        self.client.close()

    def download(self) -> None:
        response = self.client.get(URL)
        response.raise_for_status()
        self.html = response.text

    def parse(self) -> None:
        soup = BeautifulSoup(self.html, "html.parser")
        self.quotes = [
            {
                "quote": quote.select_one(".text").text,
                "author": quote.select_one(".author").text,
            }
            for quote in soup.select(".quote")
        ]

    def normalise(self) -> None:
        for quote in self.quotes:
            # Remove enclosing quotes.
            quote["quote"] = re.sub(r"[“”]", "", quote["quote"])
            # Clean up whitespace.
            quote["quote"] = re.sub(r"\s+", " ", quote["quote"]).strip()

    def transform(self) -> pd.DataFrame:
        return pd.DataFrame(self.quotes)


if __name__ == "__main__":
    scraper = QuotesScraper()

    scraper.download()
    scraper.parse()
    scraper.normalise()

    df = scraper.transform()
    print(df.author)

The QuotesScraper class has four methods:

  • download() — Download the page content using requests and store the response HTML.
  • parse() — Parse the stored HTML using Beautiful Soup.
  • normalise() — Clean up the data.
  • transform() — Convert the data to a data frame.

Running the module as a script dumps a data frame to the consoleb. Here’s the author column from the data frame. Looks like the scraper is doing its job.

0      Albert Einstein
1         J.K. Rowling
2      Albert Einstein
3          Jane Austen
4       Marilyn Monroe
5      Albert Einstein
6           André Gide
7     Thomas A. Edison
8    Eleanor Roosevelt
9         Steve Martin

The Tests

The test below uses the pytest framework to test that the transform() method returns the expected data frame. The scraped data are compared to reference data loaded from a CSV file.

import pytest
import pandas as pd
from scraper import QuotesScraper

# The URL should not really matter since we're not actually making a network request.
URL = "https://books.toscrape.com/"

# Load the reference data.
QUOTES = pd.read_csv("quotes-to-scrape.csv")


@pytest.fixture
def scraper():
    return QuotesScraper()


def test_scraper(scraper):
    scraper.download()
    scraper.parse()
    scraper.normalise()

    assert scraper.transform().equals(QUOTES)

This test should pass ✅ provided that there are no major changes to the target website. However, it has a couple of problems because it retrieves content from the live website:

  1. It’s slow. 🚧 Even with a blazing fast network connection the HTTP request takes a little time. Insignificant for a single test, but the delays accumulates with multiple tests.
  2. It’s fragile. 🚧 The test will break if either the structure or the content of the site changes. Suppose that the target site was a news feed, then the reference content would rapidly get out of date and the test would soon fail.

We can get around both of these issues by eliminating the HTTP request and using a mock instead.

Mocking Responses

The responses package mocks responses for the requests from the requests package. I’ll illustrate how it works with a simple example.

The DummyJSON API provides endpoints which return JSON objects. Hit the API from the command line using curl.

curl -X GET "https://dummyjson.com/test"
{"status":"ok","method":"GET"}

Equivalently in Python, but making things slightly more interesting by adding the appropriate status code to the response payload.

from http import HTTPStatus
import requests


def dummyjson() -> dict:
    response = requests.get("https://dummyjson.com/test")
    response.raise_for_status()
    data = response.json()
    data["status"] = data["status"].upper()
    data["status_code"] = HTTPStatus[data["status"]].value
    return data


if __name__ == "__main__":
    print(dummyjson())
{'status': 'OK', 'method': 'GET', 'status_code': 200}

Test that using the pytest framework.

from dummyjson import dummyjson


def test_scraper():
    data = dummyjson()

    assert data == {"status": "OK", "method": "GET", "status_code": 200}

This test should pass ✅. But it suffers from the same issues that we had with the earlier test. However, we can use the responses package to mock the response.

import responses
from dummyjson import dummyjson


@responses.activate
def test_scraper():
    responses.add(
        responses.GET,
        "https://dummyjson.com/test",
        json={"status": "ok", "method": "GET"},
        status=200,
    )

    data = dummyjson()

    assert data == {"status": "OK", "method": "GET", "status_code": 200}

Dissecting the updated tests:

  1. The @responses.activate decorator, as the name implies, activates the responses package for the following test.
  2. The responses.add() function links a specific URL (and HTTP method) to a response status (200) and content (JSON). Within the context of this test any requests to that URL (using the specified HTTP method) will return the specified content.

The results of the two tests indicate that they both pass.

pytest
============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-8.3.4, pluggy-1.5.0
plugins: timer-1.0.0, anyio-4.8.0
collected 2 items                                                              

test_dummyjson.py .                                                      [ 50%]
test_dummyjson_responses.py .                                            [100%]

================================= pytest-timer =================================
[success] 99.15% test_dummyjson.py::test_scraper: 0.2610s
[success] 0.85% test_dummyjson_responses.py::test_scraper: 0.0022s
============================== 2 passed in 0.33s ===============================

Both tests pass. ✅ I’m using the pytest-timer plugin for pytest to dump timing data. The test that uses responses to mock the request is enormously (factor of a hundred times) faster!

The Responses

In the example above the reference response was a short JSON document. For the Quotes to Scrape website the response content is an entire HTML document. Fortunately the responses package has a _recorder utility (still in beta) that can be used to easily dump the contents of a response to a YAML file.

import requests
from responses import _recorder


@_recorder.record(file_path="quotes-response.yaml")
def record():
    requests.get("https://quotes.toscrape.com/")


if __name__ == "__main__":
    record()

The @_recorder.record() decorator causes all responses within its context to be written to a YAML file. In this case there’s just a single GET request. However, in principle, there can be a series of HTTP requests, the responses from all of which will be written to the file.

The YAML for the Quotes to Scrape response is too chunky to include here. But here’s the YAML for the DummyJSON response.

responses:
  - response:
      auto_calculate_content_length: false
      body: '{"status":"ok","method":"GET"}'
      content_type: text/plain
      headers:
        Access-Control-Allow-Origin: '*'
        Etag: W/"1e-X/ZTgL0+qpgHuGmBISFBqxNg62E"
        Strict-Transport-Security: max-age=15552000; includeSubDomains
        Vary: Accept-Encoding
        Via: 1.1 vegur
        X-Content-Type-Options: nosniff
        X-Dns-Prefetch-Control: 'off'
        X-Download-Options: noopen
        X-Frame-Options: SAMEORIGIN
        X-Powered-By: Cats on Keyboards
        X-Ratelimit-Limit: '100'
        X-Ratelimit-Remaining: '99'
        X-Ratelimit-Reset: '1738395977'
        X-Xss-Protection: 1; mode=block
      method: GET
      status: 200
      url: https://dummyjson.com/test

The YAML captures all of the response details, including headers (and cookies if present). This means that it can be used to perfectly replicate this response in future.

The Tests with Mocking

The test can be updated to use the response stored in the YAML file. The _add_from_file() function (also still in beta) is used to load the response from a file.

import pytest
import responses
import pandas as pd
from scraper import QuotesScraper

# The URL should not really matter since we're not actually making a network request.
URL = "https://books.toscrape.com/"

# Load the reference data.
QUOTES = pd.read_csv("quotes-to-scrape.csv")


@pytest.fixture
def scraper():
    return QuotesScraper()


@responses.activate
def test_scraper(scraper):
    responses._add_from_file(file_path="quotes-response.yaml")

    scraper.download()
    scraper.parse()
    scraper.normalise()

    assert scraper.transform().equals(QUOTES)

Here are the results from the two tests.

pytest
============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-8.3.4, pluggy-1.5.0
plugins: timer-1.0.0, anyio-4.8.0
collected 2 items                                                              

test_quotes_scraper.py .                                                 [ 50%]
test_quotes_scraper_responses.py .                                       [100%]

================================= pytest-timer =================================
[success] 97.66% test_quotes_scraper.py::test_scraper: 0.3808s
[success] 2.34% test_quotes_scraper_responses.py::test_scraper: 0.0091s
============================== 2 passed in 0.69s ===============================

Both tests pass. ✅ The mocked test won’t be affected by any changes in the target website. It is also substantially (around 50 times) faster. The difference between these two tests times is less dramatic than in the earlier example because loading the mocked response from a file takes a little longer than calling the responses.add() function.

What about HTTPX?

If you’ve moved from requests to HTTPX then you will be disappointed to learn that it’s not supported by the responses package. However, RESPX looks like a promising alternative to responses that’s tailored for HTTPX.

Conclusion

Effectively testing a web scraper means striking a balance between realism and reliability. Hitting a live website might be the most realistic approach, but it’s unstable and introduces delays. Using the responses package you can create fast, stable and robust tests that don’t depend on the target website.

If you’re still running unmocked tests against live websites, now is the time to rethink your approach.