Test a Web Scraper using VCR

In the previous post I used the responses package to mock HTTP responses, producing tests that were quick and stable. Now I’ll look at an alternative approach to mocking using VCR.py.

VCR.py is the Python version of the original VCR library for Ruby. It supports requests from selection of HTTP libraries, including

boto3
http.client
requests
urllib3 and
httpx.

📢 Like responses it doesn’t support requests made via browser-based tools like Selenium and Playwright.

The documentation for VCR.py is comprehensive and well worth consulting if you plan on using it for your tests.

Install

How you install the vcr package will depend on your package manager. I’ll assuming that you’re using pip3.

pip3 install vcrpy

After that you will be able to import the package.

import vcr

How VCR.py Works

The vcr package is like a video recorder for HTTP requests. It records network requests and their associated responses, and then replays the responses during tests so you don’t have to make repeated HTTP requests.

I realise that this explanation might not help if you’re not familiar with what a “video recorder” actually is! This is now an antiquated device, wildly popular in my youth, used to record television programmes so that you could watch (and rewatch and potentially re-rewatch) at a later date. Recordings were made onto cassettes (image below). These cassettes contain spools of magnetic tape. The VCR reads and writes analog recordings on the tape. As is the case with many analog technologies, the quality of the recordings typically deteriorates with time and use.

The implementation of the vcr package is different in the sense that recordings are digital. The underlying principles are the same though.

This is typically the way that vcr works for testing:

You run a test for the first time. It makes an HTTP request and vcr records both the request parameters and the response into a cassette file.
Every subsequent time you run the test vcr will locate it in the cassette file and simply return the response without having to make an HTTP request.

The Scraper

We’ll build tests for the same scraper considered in the previous post, which extracted data from the Quotes to Scrape website.

Tests with VCR

Let’s jump right into it. Below are the tests using the vcr package to manage mocking of responses. The structure differs slightly from the previous post:

There’s now a test for the export() method.
I’m using class-based rather than function-based tests.

import logging
import os

import pandas as pd
import pytest
import vcr

from scraper.quotes import QuotesScraper

# Load the reference data.
QUOTES = pd.read_csv("quotes-to-scrape.csv")

CASSETTE = "quotes-cassette.yaml"


@pytest.fixture
def scraper():
    return QuotesScraper()


@pytest.fixture
def filename(tmp_path):
    # Uses tmp_path fixture provided by pytest.
    path = os.path.join(tmp_path, "quotes.csv")
    yield path
    os.remove(path)


@pytest.mark.usefixtures("scraper")
class TestQuotesScraper:
    def test_transform(self, scraper):
        with vcr.use_cassette(CASSETTE, record_mode="once"):
            scraper.download()
            scraper.parse()
            scraper.normalise()
            data = scraper.transform()

        assert data.equals(QUOTES)

    def test_export(self, scraper, filename):
        with vcr.use_cassette(CASSETTE, record_mode="once"):
            scraper.download()
            scraper.parse()
            scraper.normalise()
            logging.info(f"Writing data to {filename}.")
            scraper.export(filename)

        assert pd.read_csv(filename).equals(QUOTES)

Within each test the vcr.use_cassette() method creates a context within which all HTTP requests are managed by the vcr package. This method has one mandatory argument, which is the path to the cassette file where the mocked responses are stored. In this case it’s quotes-cassette.yaml. The optional record_mode argument specifies how responses are recorded and accepts the following values:

"once" — only record responses if there’s no existing cassette file (this is what I generally use);
"new_episodes" — only record new responses but return any previously recorded responses;
"none" — don’t record any responses; and
"all" — record all responses but never return previously recorded responses (this effectively just keeps a record of the most recent responses but doesn’t do any mocking).

The above example illustrates the vcr features that should satisfy the majority of your testing requirements.

Recording a Cassette

As mentioned a little earlier, the first time that you run the test it will fail because there’s no cassette to load. However, unless you have specified "none" for the record mode, it will immediately make a request and record the response in a cassette.

A couple of pytest plugins might interfere with this process though.

If you’re using the pytest-socket plugin then the request will likely be blocked. You temporarily allow the request using the --allow-hosts option. If there’s more than one IP then just provide a comma-separated list.

pytest --allow-hosts=35.211.122.109

If you’re using the pytest-timeout plugin with a stringent timeout setting and you’re on a sluggish network then you might also run into a timeout error. You can temporarily disable this by running pytest with the --timeout=0 argument.

Configuration

The vcr package defines a VCR class. The vcr.use_cassette() method uses a global instance of the VCR class. You can, however, create your own VCR object with specific configuration options.

# Example of creating a VCR object with specific configuration options.
#
vhs = vcr.VCR(
    serializer="yaml",
    cassette_library_dir=".",
    record_mode="once",
    match_on=["method", "scheme", "host", "port", "path", "query"],
)

Now, rather than using vcr.use_cassette() to create a context you’d use vhs.use_cassette(). You would not need to specify the record_mode argument since it’s already handled via the constructor arguments.

Useful constructor parameters:

serializer — the format used for recording responses (either "yaml" or "json");
cassette_library_dir — the location of the cassette files;
record_mode — how responses are recorded (see above for more details); and
match_on — how requests are matched to responses in the cassette (the above options are the default). This can be used to implement more or less flexible matching. There are a few other options ("uri", "raw_body", "body" and "headers") which can be used at your discretion.

If neither YAML nor JSON are suitable for serialising your data then you can add your own custom serialiser. You can also create bespoke matchers if the existing options don’t quite cut the mustard.

Conclusion

Using the vcr package will result in faster and more reliable tests. You’ll also be able to run your tests as often as you like without any concerns about getting blocked or rate limited.

The vcr package feels more streamlined than responses. The fact that the same code effectively creates and then subsequently reads the cassettes is extremely convenient. If I have to choose between the two then I’d definitely go with vcr.

Some related projects that are worth considering:

pytest-vcr — A pytest plugin for using VCR, providing a decorator which simplifies cassette management.
pytest-recording — Another pytest plugin for using VCR, which provides some useful fixtures.
betamax — A project like VCR but only supporting the requests HTTP library.

FAQ

Q. I’m getting an error message about decompressing data.

The error message might look something like this:

Received response with content-encoding: gzip, but failed to decode it.
Error -3 while decompressing data: incorrect header check

I had a case like this and found that in the cassette response I had

    headers:
      Content-Encoding:
      - gzip
      Content-Type:
      - application/json

The content encoding indicated that the response was compressed. However, the actual data in the response was not compressed! Removing the Content-Encoding header did the job.