As mentioned in the introduction to web scraper testing, unit tests should be self-contained and not involve direct access to the target website. The responses
package allows you to easily mock the responses returned by a website, so it’s well suited to the job. The package is stable and well documented.
Install
Getting the responses
package installed is simple.
pip3 install responses
It requires a recent versions of Python and the requests
package.
The Scraper
Here’s a simple scraper for the Quotes to Scrape website. It’s not the way that I would structure a production web scraper, but it’s convenient for illustrating the testing process. Shortly we’ll build some unit tests.
import requests
import re
from bs4 import BeautifulSoup
import pandas as pd
URL = "https://quotes.toscrape.com/"
class QuotesScraper:
def __init__(self):
self.client = requests.Session()
self.html: str | None = None
self.quotes: list | None = None
def __del__(self):
self.client.close()
def download(self) -> None:
response = self.client.get(URL)
response.raise_for_status()
self.html = response.text
def parse(self) -> None:
soup = BeautifulSoup(self.html, "html.parser")
self.quotes = [
{
"quote": quote.select_one(".text").text,
"author": quote.select_one(".author").text,
}
for quote in soup.select(".quote")
]
def normalise(self) -> None:
for quote in self.quotes:
# Remove enclosing quotes.
quote["quote"] = re.sub(r"[“”]", "", quote["quote"])
# Clean up whitespace.
quote["quote"] = re.sub(r"\s+", " ", quote["quote"]).strip()
def transform(self) -> pd.DataFrame:
return pd.DataFrame(self.quotes)
if __name__ == "__main__":
scraper = QuotesScraper()
scraper.download()
scraper.parse()
scraper.normalise()
df = scraper.transform()
print(df.author)
The QuotesScraper
class has four methods:
download()
— Download the page content usingrequests
and store the response HTML.parse()
— Parse the stored HTML using Beautiful Soup.normalise()
— Clean up the data.transform()
— Convert the data to a data frame.
Running the module as a script dumps a data frame to the consoleb. Here’s the author
column from the data frame. Looks like the scraper is doing its job.
0 Albert Einstein
1 J.K. Rowling
2 Albert Einstein
3 Jane Austen
4 Marilyn Monroe
5 Albert Einstein
6 André Gide
7 Thomas A. Edison
8 Eleanor Roosevelt
9 Steve Martin
The Tests
The test below uses the pytest
framework to test that the transform()
method returns the expected data frame. The scraped data are compared to reference data loaded from a CSV file.
import pytest
import pandas as pd
from scraper import QuotesScraper
# The URL should not really matter since we're not actually making a network request.
URL = "https://books.toscrape.com/"
# Load the reference data.
QUOTES = pd.read_csv("quotes-to-scrape.csv")
@pytest.fixture
def scraper():
return QuotesScraper()
def test_scraper(scraper):
scraper.download()
scraper.parse()
scraper.normalise()
assert scraper.transform().equals(QUOTES)
This test should pass ✅ provided that there are no major changes to the target website. However, it has a couple of problems because it retrieves content from the live website:
- It’s slow. 🚧 Even with a blazing fast network connection the HTTP request takes a little time. Insignificant for a single test, but the delays accumulates with multiple tests.
- It’s fragile. 🚧 The test will break if either the structure or the content of the site changes. Suppose that the target site was a news feed, then the reference content would rapidly get out of date and the test would soon fail.
We can get around both of these issues by eliminating the HTTP request and using a mock instead.
Mocking Responses
The responses
package mocks responses for the requests from the requests
package. I’ll illustrate how it works with a simple example.
The DummyJSON API provides endpoints which return JSON objects. Hit the API from the command line using curl
.
curl -X GET "https://dummyjson.com/test"
{"status":"ok","method":"GET"}
Equivalently in Python, but making things slightly more interesting by adding the appropriate status code to the response payload.
from http import HTTPStatus
import requests
def dummyjson() -> dict:
response = requests.get("https://dummyjson.com/test")
response.raise_for_status()
data = response.json()
data["status"] = data["status"].upper()
data["status_code"] = HTTPStatus[data["status"]].value
return data
if __name__ == "__main__":
print(dummyjson())
{'status': 'OK', 'method': 'GET', 'status_code': 200}
Test that using the pytest
framework.
from dummyjson import dummyjson
def test_scraper():
data = dummyjson()
assert data == {"status": "OK", "method": "GET", "status_code": 200}
This test should pass ✅. But it suffers from the same issues that we had with the earlier test. However, we can use the responses
package to mock the response.
import responses
from dummyjson import dummyjson
@responses.activate
def test_scraper():
responses.add(
responses.GET,
"https://dummyjson.com/test",
json={"status": "ok", "method": "GET"},
status=200,
)
data = dummyjson()
assert data == {"status": "OK", "method": "GET", "status_code": 200}
Dissecting the updated tests:
- The
@responses.activate
decorator, as the name implies, activates theresponses
package for the following test. - The
responses.add()
function links a specific URL (and HTTP method) to a response status (200) and content (JSON). Within the context of this test any requests to that URL (using the specified HTTP method) will return the specified content.
The results of the two tests indicate that they both pass.
pytest
============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-8.3.4, pluggy-1.5.0
plugins: timer-1.0.0, anyio-4.8.0
collected 2 items
test_dummyjson.py . [ 50%]
test_dummyjson_responses.py . [100%]
================================= pytest-timer =================================
[success] 99.15% test_dummyjson.py::test_scraper: 0.2610s
[success] 0.85% test_dummyjson_responses.py::test_scraper: 0.0022s
============================== 2 passed in 0.33s ===============================
Both tests pass. ✅ I’m using the pytest-timer
plugin for pytest
to dump timing data. The test that uses responses
to mock the request is enormously (factor of a hundred times) faster!
The Responses
In the example above the reference response was a short JSON document. For the Quotes to Scrape website the response content is an entire HTML document. Fortunately the responses
package has a _recorder
utility (still in beta) that can be used to easily dump the contents of a response to a YAML file.
import requests
from responses import _recorder
@_recorder.record(file_path="quotes-response.yaml")
def record():
requests.get("https://quotes.toscrape.com/")
if __name__ == "__main__":
record()
The @_recorder.record()
decorator causes all responses within its context to be written to a YAML file. In this case there’s just a single GET request. However, in principle, there can be a series of HTTP requests, the responses from all of which will be written to the file.
The YAML for the Quotes to Scrape response is too chunky to include here. But here’s the YAML for the DummyJSON response.
responses:
- response:
auto_calculate_content_length: false
body: '{"status":"ok","method":"GET"}'
content_type: text/plain
headers:
Access-Control-Allow-Origin: '*'
Etag: W/"1e-X/ZTgL0+qpgHuGmBISFBqxNg62E"
Strict-Transport-Security: max-age=15552000; includeSubDomains
Vary: Accept-Encoding
Via: 1.1 vegur
X-Content-Type-Options: nosniff
X-Dns-Prefetch-Control: 'off'
X-Download-Options: noopen
X-Frame-Options: SAMEORIGIN
X-Powered-By: Cats on Keyboards
X-Ratelimit-Limit: '100'
X-Ratelimit-Remaining: '99'
X-Ratelimit-Reset: '1738395977'
X-Xss-Protection: 1; mode=block
method: GET
status: 200
url: https://dummyjson.com/test
The YAML captures all of the response details, including headers (and cookies if present). This means that it can be used to perfectly replicate this response in future.
The Tests with Mocking
The test can be updated to use the response stored in the YAML file. The _add_from_file()
function (also still in beta) is used to load the response from a file.
import pytest
import responses
import pandas as pd
from scraper import QuotesScraper
# The URL should not really matter since we're not actually making a network request.
URL = "https://books.toscrape.com/"
# Load the reference data.
QUOTES = pd.read_csv("quotes-to-scrape.csv")
@pytest.fixture
def scraper():
return QuotesScraper()
@responses.activate
def test_scraper(scraper):
responses._add_from_file(file_path="quotes-response.yaml")
scraper.download()
scraper.parse()
scraper.normalise()
assert scraper.transform().equals(QUOTES)
Here are the results from the two tests.
pytest
============================= test session starts ==============================
platform linux -- Python 3.12.3, pytest-8.3.4, pluggy-1.5.0
plugins: timer-1.0.0, anyio-4.8.0
collected 2 items
test_quotes_scraper.py . [ 50%]
test_quotes_scraper_responses.py . [100%]
================================= pytest-timer =================================
[success] 97.66% test_quotes_scraper.py::test_scraper: 0.3808s
[success] 2.34% test_quotes_scraper_responses.py::test_scraper: 0.0091s
============================== 2 passed in 0.69s ===============================
Both tests pass. ✅ The mocked test won’t be affected by any changes in the target website. It is also substantially (around 50 times) faster. The difference between these two tests times is less dramatic than in the earlier example because loading the mocked response from a file takes a little longer than calling the responses.add()
function.
What about HTTPX?
If you’ve moved from requests
to HTTPX then you will be disappointed to learn that it’s not supported by the responses
package. However, RESPX looks like a promising alternative to responses
that’s tailored for HTTPX.
Conclusion
Effectively testing a web scraper means striking a balance between realism and reliability. Hitting a live website might be the most realistic approach, but it’s unstable and introduces delays. Using the responses
package you can create fast, stable and robust tests that don’t depend on the target website.
If you’re still running unmocked tests against live websites, now is the time to rethink your approach.