Test a Web Scraper using Mocking

Previous posts in this series used the responses and vcr packages to mock HTTP responses. Now we’re going to look at the capabilities for mocking in the unittest package, which is part of the Python Standard Library. Relative to responses and vcr this functionality is rather low-level. There’s more work required, but as a result there’s potential for greater control.

Mocks are (Deceptively) Simple

The unittest.mock module has two classes for mocking: Mock and MagicMock.

Mock Objects

The Mock class offers a number of constructor arguments. These are the most important ones:

return_value — Specifies the value to be returned when the mock is called.
side_effect — The side_effect argument can provide either a function, an iterable or an exception.
- Function The function is called each time the mock is invoked. The return value of the function is returned as the mock’s return value.
- Iterable Each time that the mock is called it will return the next value from the iterable.
- Exception The specified exception is raised when the mock is called

There are a few other arguments (spec, spec_set, wraps, name and unsafe) that are a bit niche for the moment.

A Vanilla Mock

We’ll start by creating a plain Mock object.

from unittest.mock import Mock, MagicMock

rng = Mock()
rng

<Mock id='131922784248480'>

The ID in the __repr__() string is simply the result of the standard Python id() function and is not specific to the class.

We can call the Mock object like a function.

rng()

<Mock name='mock()' id='131922849355712'>

The result is another Mock object (it’s a distinct object because it has a different ID). We can provide arbitrary arguments.

rng(min=0, max=100, quantum_fluctuation=99, luck_modifier="rabbit foot")

<Mock name='mock()' id='131922849355712'>

The ID has not changed, so the arguments at present have no effect: we get back the same object as without arguments.

Seems flexible… but also somewhat meaningless. Why? How could this be remotely useful? Bear with me.

Mocked Return Value

Add some real functionality by using the return_value to set the value returned by the mock.

rng = Mock(return_value=0.42)
rng()

0.42

And if we call it again?

rng()

0.42

Not at all random. But it does what we asked: it returns the specified value.

Mocked Side Effect

If we want a function to be called whenever we use the mock then use the side_effect argument. Here, for example, is a mock that counts the number of times that it’s been called. Forgive the heinous use of global. 😕

counter = 0

def mock_random():
    global counter
    counter += 1
    return 0.42

rng = Mock(side_effect=mock_random)

Call the mock a couple of times.

rng()
rng()

Then check on the counter.

counter

As mentioned earlier, the side_effect argument can be used for a host of other purposes. Here are a couple of examples.

# Yields a series of values then raises StopIteration.
rng = Mock(side_effect=[0.135, 0.52, 0.9])
# Raising an exception.
rng = Mock(side_effect=RuntimeError("Randomness tank empty. Refill with chaos and try again!"))

Mock Attributes & Methods

You can access arbitrary attributes and methods on the mock object.

rng.state

<Mock name='mock.state' id='131923072201504'>

rng.random()

<Mock name='mock.random()' id='131923193252848'>

But they don’t do anything meaningful until you define them, simply returning other Mock objects.

rng.state = 299792
rng.random = lambda: 0.13

Now let’s try them again.

rng.state

rng.random()

0.13

You could also set these via constructor arguments.

rng = Mock(state = 299792, random = lambda: 0.13)

Magic Mock Objects 🧙

The MagicMock class is just like Mock except it can also mock dunder methods.

days = MagicMock()

For the purpose of illustration I’m going to need a sorted list of the days of the week.

import calendar

SORTED_DAYS_OF_WEEK = sorted(list(calendar.day_name))
SORTED_DAYS_OF_WEEK

['Friday', 'Monday', 'Saturday', 'Sunday', 'Thursday', 'Tuesday', 'Wednesday']

Mock the __getitem__ method so that the mock can be indexed.

days.__getitem__.side_effect = lambda key: SORTED_DAYS_OF_WEEK[key]
days[2]

'Saturday'

Mock the __iter__ method so that it can be treated as a generator.

days.__iter__.return_value = iter(SORTED_DAYS_OF_WEEK)

for day in days:
  print(day)

Friday
Monday
Saturday
Sunday
Thursday
Tuesday
Wednesday

Give a meaningful value for the str() function.

days.__str__.return_value = "Sorted Days of the Week"

str(rng)

"<Mock id='131923193520704'>"

Right, that’s quite enough background. Let’s get down to testing some scrapers.

The Scrapers

For continuity we’ll build tests for the Quotes to Scrape scraper considered in previous posts. However, in order to illustrate a wider range of capabilities we’ll introduce a second scraper with a different architecture. The scraper below extracts the paginated data from Books to Scrape. Feel free to skip over this code for the moment.

import logging
import re
from typing import Iterator
from urllib.parse import urljoin

import bs4
import pandas as pd
import requests

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s [%(levelname)7s] %(message)s",
)

pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", None)

BASE_URL = "https://books.toscrape.com/catalogue/"


class BooksScraper:
    def __init__(self):
        self.client = requests.Session()

    def __del__(self):
        self.client.close()

    def download(self) -> Iterator[str]:
        page = 1
        while True:
            logging.info(f"Get page {page}.")
            url = urljoin(BASE_URL, f"page-{page}.html")
            response = self.client.get(url)
            if response.status_code != 200:
                break
            yield response.text
            page += 1

    def parse(self, html: str) -> list[dict]:
        soup = bs4.BeautifulSoup(html, "html.parser")

        def stars(tag):
            ratings = {"One": 1, "Two": 2, "Three": 3, "Four": 4, "Five": 5}
            return ratings.get(tag.get("class")[1])

        return [
            {
                "title": book.select_one("h3 a")["title"],
                "url": urljoin(BASE_URL, book.select_one("h3 a")["href"]),
                "price": book.select_one(".product_price > p").text,
                "img": urljoin(BASE_URL, book.select_one("img")["src"]),
                "in_stock": book.select_one(".availability").text.strip() == "In stock",
                "stars": stars(book.select_one(".star-rating")),
            }
            for book in soup.select("article")
        ]

    def normalise(self, books: list[dict]) -> list[dict]:
        for book in books:
            # Remove currency.
            book["price"] = re.sub(r"^[^0-9]+", "", book["price"])
            # Convert to float.
            book["price"] = float(book["price"])

        return books

    def transform(self, books: list[dict]) -> pd.DataFrame:
        return pd.DataFrame(books).sort_values(by="title").reset_index(drop=True)

    def crawl(self) -> pd.DataFrame:
        parsed = []
        for html in self.download():
            parsed.extend(self.parse(html))

        normalised = self.normalise(parsed)
        return self.transform(normalised)


if __name__ == "__main__":
    scraper = BooksScraper()

    df = scraper.crawl()
    print(df.head())

The QuotesScraper downloader stored the HTML content in an object attribute, making the method a procedure with a side effect (but no return value) rather than a function. The BooksScraper downloader, by contrast, actually returns the HTML content.

The BooksScraper class has a crawl() method that orchestrates the scraping process. It retrieves HTML from download(), passing it to parse(), which extracts the required data and returns a list of dictionaries. That in turn is passed to normalise() for cleaning and another list of dictionaries is returned. Finally that’s passed to transform(), which returns a data frame. No data is stored in the object.

Reference HTML

Both the responses and vcr packages provide functionality for storing the reference HTML response in a YAML file. We don’t have that luxury now and we need to do it ourselves. We’ll use curl to harvest copies of the HTML content from the target sites and redirect the output to files. You could equally do this by saving the page directly from your browser.

curl https://quotes.toscrape.com/ >quotes-to-scrape.html
curl https://books.toscrape.com/ >books-to-scrape.html

Tests with Mocking

Mocking a Return Value

Let’s test the BookScraper class. In order to decouple our tests from the target site we need to ensure that the download() method doesn’t actually issue a network request. We’ll mock the download() method to load the expected content from file downloaded a moment ago.

from unittest.mock import Mock

import pandas as pd
import pytest

from scraper.books import BooksScraper

BOOKS = pd.read_csv("books-to-scrape.csv")
HTML = "books-to-scrape.html"


@pytest.fixture
def scraper():
    # Load the HTML from file.
    with open(HTML, "r") as f:
        html = f.read()

    bs = BooksScraper()
    # Mock the download() method, setting the return value to the HTML string.
    bs.download = Mock(return_value=html)

    return bs


def test_scraper(scraper):
    # This method is mocked.
    html = scraper.download()
    # The remaining methods use original implementation.
    parsed = scraper.parse(html)
    normalised = scraper.normalise(parsed)
    books = scraper.transform(normalised)

    assert books.equals(BOOKS)

Relative to the implementations using responses and vcr the test itself, test_scraper(), is clean and simple: no decorators or code for loading data from a YAML file.

All of the action is in the scraper() fixture. First the HTML is loaded from the HTML file. Then a BookScraper object is created. Normally the download() method on this object retrieves HTML from https://books.toscrape.com/. However, in the test it’s replaced with a mocked method via the Mock class. When creating the Mock() object the return_value parameter is set to the content of the HTML file. Now, rather than making a network request the download() method simply returns the loaded HTML. It’s fast and robust.

Mocking with Side Effects

As discussed earlier, the QuotesScraper class works differently, so the approach used above won’t work. The download() method doesn’t return a value, but instead sets an attribute.

from unittest.mock import Mock

import pandas as pd
import pytest

from scraper.quotes import QuotesScraper

QUOTES = pd.read_csv("quotes-to-scrape.csv")
HTML = "quotes-to-scrape.html"


@pytest.fixture
def scraper():
    with open(HTML, "r") as f:
        html = f.read()

    def mock_download():
        bs.html = html

    bs = QuotesScraper()
    bs.download = Mock(side_effect=mock_download)

    return bs


def test_scraper(scraper):
    scraper.download()

    scraper.parse()
    scraper.normalise()
    quotes = scraper.transform()

    assert quotes.equals(QUOTES)

The implementation of the scraper() fixture is analogous to that in the previous example. However, rather than using the return_value parameter when creating the Mock object we use side_effect. The mocked download() method assigns the loaded HTML to the html attribute on the QuotesScraper object. The inner function, mock_download(), is required because the side_effect parameter expects a callable.

Mocking a Generator

There’s a weakness in the BookScraper test above: the mock returns a single HTML document. If you look back at the code for the download() method then you’ll see that it’s actually a generator, yielding HTML for each of a series of pages. To properly test this class we should really mock this behaviour.

from unittest.mock import Mock

import pandas as pd
import pytest

from scraper.books import BooksScraper

BOOKS = pd.read_csv("books-to-scrape.csv")
HTML = "books-to-scrape.html"


@pytest.fixture
def scraper():
    # Load the HTML from file.
    with open(HTML, "r") as f:
        html = f.read()

    def mock_download():
        yield html
        yield html
        yield html

    bs = BooksScraper()
    # Mock the download() method, setting the return value to the HTML string.
    bs.download = Mock(side_effect=mock_download)

    return bs


def test_scraper(scraper):
    parsed = []
    for html in scraper.download():
        parsed.extend(scraper.parse(html))

    assert len(parsed) == 60

Once again we use the side_effect argument to create a Mock object, this time passing a generator function which yields three copies of the HTML document. The test simply checks that the content from each of those documents is parsed. A more complete test would also check that the normalise() and transform() methods also work as intended.

📌 This could have been implemented more concisely by providing a list of items to the side_effect argument (which treats an iterable as a generator), but I prefer this approach because it’s explicit.

Mocks are Flexible

The examples above have just scratched the surface of what’s possible using the Mock class. Here are some other bells and whistles.

Mocking with Reckless Abandon

Suppose we wanted to mock the response from https://dummyjson.com/user/1. To keep things simple we’ll only implement part of the response payload. First create a mocked response object. Start with a plain vanilla Mock object and then add status_code and text attributes.

import json
from unittest.mock import Mock

emily = {
  "id": 1,
  "firstName": "Emily",
  "lastName": "Johnson",
  "age": 28,
  "gender": "female"
}

mock_response = Mock()
mock_response.status_code = 200
mock_response.text = json.dumps(emily)

Assign the mock to the requests.get() function.

import requests

# Mock the requests.get() method.
requests.get = Mock(return_value=mock_response)

📢 Due to the simple way that we have mocked requests.get() we’ll get the same response regardless of the provided URL, so https://www.example.com/ will yield the same response.

Now when you call requests.get(), it returns the mock_response object.

response = requests.get("https://dummyjson.com/user/1")

Check the status code.

response.status_code

Looks good. What about the text?

response.text

'{"id": 1, "firstName": "Emily", "lastName": "Johnson", "age": 28, "gender": "female"}'

Nice!

For future reference, let’s see what happens if we request another attribute on the mock object.

response.previous

<Mock name='mock.previous' id='131922784306816'>

It works but it doesn’t return anything meaningful. Hold this in your working memory for a short while.

Mocking with Rules

The previous example illustrates just how flexible the Mock class can be. But perhaps you don’t want to be quite that flexible. Maybe you want to ensure that the mocked response object only has the attributes and methods of a “proper” response? Enter the spec parameter.

from requests import Response

mock_response = Mock(
  # The Response class is used as the specification template for the mocked object.
  spec=Response,
  # Mock a few attributes and methods.
  status_code=200,
  text=json.dumps(emily),
  json=lambda: emily
)

The text attribute works as before.

mock_response.text

'{"id": 1, "firstName": "Emily", "lastName": "Johnson", "age": 28, "gender": "female"}'

And I’ve added in a mocked json() method as well.

mock_response.json()

{'id': 1, 'firstName': 'Emily', 'lastName': 'Johnson', 'age': 28, 'gender': 'female'}

Both of these work because they’re part of the Response interface. What if we stray a little further?

# Will work because .next is part of the Response spec (won't return anything meaningful!).
mock_response.next
# Will not work because .previous is not part of the Response spec.
mock_response.previous

We can access the next attribute because it’s part of the Response interface, although it doesn’t return any useful information because it was not explicitly mocked. There is no previous attribute on a Response object, so trying to access it on the mock will now fail 🚧.

Conclusion

The mocking capacity in unittest.mock is an excellent alternative to the responses and vcr packages if you want to have more granular control over what’s happening in your tests.

References:

There are two relevant sections in Effective Python by Brett Slatkin (second edition), “Use Mocks to Test Code with Complex Dependencies” and “Encapsulate Dependencies to Facilitate Mocking and Testing”.