Web Scraper Testing

Site evolution. DOM drift. Selector decay. XPath extinction. Scraper rot. CAPTCHA catastrophe. Anti-bot apocalypse.

Inevitably even a carefully crafted web scraper will fail because the target site has changed in some way. Regular systematic testing is vital to ensure that you don’t lose valuable data.

The various types of testing are broken down below. In upcoming posts I’ll be digging into each of these and showing simple examples of how they can be applied in your workflows.

Unit Tests

Unit tests ensures that individual components (like request handling, parsing logic and data extraction) work as expected, independent of external factors. In the context of web scraping those “external factors” are primarily the network and (sometimes) unpredictable websites.

These tests will generally use a package like unittest.mock, responses or vcr to simulate HTTP requests. A static, controlled HTML response allows the test to focus exclusively on things that you can control (as opposed to the network and website, which are almost completely out of your control!). The goal of the tests is to ensure the scraper parses and transforms known content correctly without relying on the live site.

As their name implies, unit tests are intended to validate the smallest individual components of the web scraper. Some things that should be considered in unit tests:

  • correct selection and parsing of elements;
  • handling of missing or unexpected elements; and
  • consistency of data types (for example, prices should be floats, dates should parse).

Unit tests should be run whenever the scraper is modified. The more complete the suite of unit tests, the more likely they are to identify any regressions in the code.

Integration Tests

As opposed to unit tests, which operate in isolation and test the individual components of the web scraper, integration tests ensure that all of those components work correctly together and take into account the interaction between the scraper and the actual website. They should also ensure that the scraped data are being stored correctly, either in files or a database.

Typically integration tests include coverage of:

  • network requests (Can the scraper correctly fetch a webpage without getting blocked?)
  • parsing (Are data extracted correctly?)
  • consistency and completeness (Does the extracted data match a defined schema? Are there missing data?)
  • error handling (What happens if the site structure changes or access is denied?) and
  • persistence (Is the scraper able to write data to disk or database?).

Performance/Stress Tests

Performance tests are used to measure the scraper’s speed and efficiency and should address questions like:

  • How fast does it fetch and process data?
  • Can it handle large datasets efficiently?

Stress tests assess whether the scraper and website will tolerate sending requests at a specific rate. Will the scraper scale to a high request rate? There are two components to these tests:

  • Will the website permit a high volume of requests? As the request rate increases the likelihood that you will run into anti-bot mechanisms rises rapidly.
  • Is your local infrastructure able to handle the processing of a high volume of requests?

Security Tests

Security tests ensure that a web scraper doesn’t expose sensitive data and complies with legal or ethical requirements. Typically these tests should check that the scraper

  • respects the robots.txt file;
  • doesn’t violate the Terms and Conditions of the site;
  • securely stores any secrets or credentials used to authenticate;
  • doesn’t expose anything sensitive in logs;
  • doesn’t unintentionally store sensitive user data and is compliant with GDPR;
  • applies SSL/TLS verification;
  • implements data sanitation;
  • guards against attacks like SQL injection or malicious JavaScript.

Conclusion

Tests are often an afterthought. The fun bit is writing the scraper itself. Testing might not be that much fun. However, having a collection of tests to ensure that a scraper continues to do its job is vital. I’d much rather hear 🔔 warning bells 📢 and 🚨 sirens ⚠️ now than learn in the future that there’s an enormous gap in my data because a scraper has been silently failing. That would not be fun at all.

Take a look at the upcoming posts in this series to see how these tests can be applied in practice.