Web Scraper Testing

Site evolution. DOM drift. Selector decay. XPath extinction. Scraper rot. CAPTCHA catastrophe. Anti-bot apocalypse.

Inevitably even a carefully crafted web scraper will fail because the target site has changed in some way. Regular systematic testing is vital to ensure that you don’t lose valuable data.

The various types of testing are broken down below. In upcoming posts I’ll be digging into each of these and showing simple examples of how they can be applied in your workflows. The definitions of the various types of testing are blurry. Certainly the boundaries between the different test types are often very fuzzy indeed. I’m not attempting to make definitive statements about any of these but rather providing some context for the posts to follow.

Unit Tests

Unit tests ensures that individual components (like request handling, parsing logic and data extraction) work as expected, in isolated and independent of external factors. In the context of web scraping those “external factors” are primarily the network and (sometimes) unpredictable websites.

These tests will generally use a package like unittest.mock, responses or vcr to simulate HTTP requests. A static, controlled HTML response allows the test to focus exclusively on things that you can control (as opposed to the network and website, which are almost completely out of your control!). The goal of the tests is to ensure the scraper parses and transforms known content correctly without relying on the live site.

As their name implies, unit tests are intended to validate the smallest individual components (or “unit”) of the web scraper. Some things that should be considered in unit tests:

correct selection and parsing of elements;
handling of missing or unexpected elements; and
consistency of data types (for example, prices should be floats, dates should parse).

Unit tests should be run whenever the scraper is modified. They can be run automatically via pre-commit (provided that they are fast) and CI. The more complete the suite of unit tests, the more likely they are to identify any regressions in the code.

Integrat(ion|ed) Tests

As opposed to unit tests, which operate in isolation and test the individual components of the web scraper, integration (or integrated) tests ensure that all of those components work correctly together. They can take into account the interaction between the scraper and the actual website. They should also ensure that the scraped data are being stored correctly, either in files or a database.

Typically integration tests include coverage of:

network requests (Can the scraper correctly fetch a webpage without getting blocked?)
parsing (Are data extracted correctly?)
consistency and completeness (Does the extracted data match a defined schema? Are there missing data?)
error handling (What happens if the site structure changes or access is denied?) and
persistence (Is the scraper able to write data to disk or database?).

Functional Tests

Functional tests overlap with both Unit Tests and Integration Tests. These tests assess whether the scraper operates correctly as a complete unit, but in isolation from the outside world. Ideally they would run against a mocked instance of the target site and test all stages of the process: scraping, parsing and transforming.

Performance/Stress Tests

Performance tests are used to measure the scraper’s speed and efficiency and should address questions like:

How fast does it fetch and process data?
Can it handle large datasets efficiently?

Stress tests assess whether the scraper and website will tolerate sending requests at a specific rate. Will the scraper scale to a high request rate? There are two components to these tests:

Will the website permit a high volume of requests? As the request rate increases the likelihood that you will run into anti-bot mechanisms rises rapidly.
Is your local infrastructure able to handle the processing of a high volume of requests?

Security Tests

Security tests ensure that a web scraper doesn’t expose sensitive data and complies with legal or ethical requirements. Typically these tests should check that the scraper

respects the robots.txt file;
doesn’t violate the Terms and Conditions of the site;
securely stores any secrets or credentials used to authenticate;
doesn’t expose anything sensitive in logs;
doesn’t unintentionally store sensitive user data and is compliant with GDPR;
applies SSL/TLS verification;
implements data sanitation;
guards against attacks like SQL injection or malicious JavaScript.

When to Write Tests?

This can be a contentious issue! There are various approaches, all of which have their merits. These are the ones that I most frequently encounter:

Write tests after. Taking this approach can be useful for capturing regressions. They document the way that the finished web scraper works and will guard against future changes breaking that functionality.
Write tests first. You then develop the web scraper to satisfy the tests. This is effective because the tests can capture the design requirements.
Write tests during.
Write tests when things break. Web scrapers often require regular maintenance. This is especially the case when the target site is fluid, with content or layout changing regularly. It’s likely that you’ll encounter edge cases that were not present at the time of original implementation (missing page elements, for example!). Whenever this happens you should write a test to cover this scenario.

Conclusion

Tests are often an afterthought. The fun bit is writing the scraper itself. Testing might not be that much fun. However, having a collection of tests to ensure that a scraper continues to do its job is vital. I’d much rather hear 🔔 warning bells 📢 and 🚨 sirens ⚠️ now than learn in the future that there’s an enormous gap in my data because a scraper has been silently failing. That would not be fun at all.

Move fast and break things. Mark Zuckerberg (Internal Facebook motto.)

Having a decent set of tests for your code allows you to adopt this cavalier approach to development, which can be very fruitful. The tests should ensure that you are quickly aware when you have broken something. This allows you to experiment with relative impunity.

Take a look at the upcoming posts in this series to see how these tests can be applied in practice.

Some relevant external resources:

Unit Testing vs. Integration Testing: Test Automation Basics.