Test a Playwright Web Scraper

In the previous post we considered a few approaches to testing a Selenium web scraper. Now we’ll do the same for web scrapers using Playwright.

Intercepting Requests

We needed to use Selenium Wire to intercept requests issued via Selenium. Playwright, however, has an integrated routing capability for managing requests and responses.

The route() method on a Page object is used to attach handlers to specific requests. For example, the code below shows two ways to attach the handler function to requests for PNG and JPG images.

# Using globbing to match route.
page.route("**/*.{png,jpg}", handler)
# Using REGEX to match route.
page.route(re.compile(r"(\.png$)|(\.jpg$)"), handler)

You can also attach a handler that will be applied to all requests.

page.route("**/*", handler)

The handler function will generally use one or more of the following routing methods to determine the action applied to a specific route:

  • .abort() — aborts the request;
  • .continue_() — the request is passed to the network;
  • .fallback() — allows other handlers to be applied;
  • .fetch() — execute request but don’t return response (used in conjunction with .fulfill()); and
  • .fulfill() — return response.

Each of these methods provides a selection of parameters that can be used to customise their behaviour. For example, by manipulating characteristics of the request like headers, payload or URL.

Example: Logging Requests

The following handler simply logs the request URL and then proceeds to process it.

import logging

def handler(route, request):
  """Log and process request."""
  logging.debug(request.url)
  route.continue_()

# Attach the handler to all routes.
page.route("**/*", handler)

page.goto("https://example.com")

Example: Aborting Requests

Specific actions will normally be applied to route depending on the characteristics of request. For example, abort requests to https://fonts.gstatic.com but process all others.

def handler(route, request):
  if "fonts.gstatic.com" in request.url:
    route.abort()
  else:
    route.continue_()

Example: Mock Response

Provide a mocked response for requests to https://api.ipify.org?format=json.

import json

def handler(route, request):
  if request.url == "https://api.ipify.org/?format=json":
    body = {"ip": "127.0.0.1"}
    route.fulfill(
      status=200,
      content_type="application/json",
      body=json.dumps(body),
    )
  else:
    route.continue_()

That will return {"ip": "127.0.0.1"} regardless of the actual IP and no request will be sent to https://api.ipify.org/.

Example: Modify Response

Using .fetch() and .fulfill() you can execute the original request but modify the response.

from datetime import datetime

def handler(route, request):
  if request.url == "https://api.ipify.org/?format=json":
    response = route.fetch()
    json = response.json()
    json["timestamp"] = datetime.now().isoformat()
    route.fulfill(response=response, json=json)
  else:
    route.continue_()

That adds a timestamp field to the response from https://api.ipify.org/.

The Scraper

Enough preliminaries. Let’s get onto scraping. Here’s another version of the Books to Scrape scraper using Playwright.

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import polars as pl


class BooksScraper:
    def __init__(self):
        self.playwright = sync_playwright().start()
        self.browser = self.playwright.chromium.launch(headless=True)
        self.context = self.browser.new_context()
        self.page = self.context.new_page()

    def __del__(self):
        self.page.close()
        self.context.close()
        self.browser.close()
        self.playwright.stop()

    def download(self, url: str) -> str:
        self.page.goto(url)
        self.page.wait_for_selector("ul.breadcrumb", timeout=10000)
        return self.page.content()

    def parse(self, html: str) -> list:
        soup = BeautifulSoup(html, "html.parser")
        return [a.get("href") for a in soup.select("h3 > a", href=True)]

    def transform(self, links: list) -> pl.DataFrame:
        return pl.DataFrame({"url": links})


if __name__ == "__main__":
    scraper = BooksScraper()

    url = "https://books.toscrape.com/"

    html = scraper.download(url)
    links = scraper.parse(html)

    for link in links:
        print(link)

    df = scraper.transform(links)

    with pl.Config(fmt_str_lengths=120, tbl_formatting="MARKDOWN"):
        print(df.head())

This is very similar to the previous implementation of the scraper using Selenium.

Testing the Scraper

It’s possible to test this scraper using mocking and patching. The approach would be similar to that used in the previous post. Both of those techniques would completely circumvent Playwright. If, however, we want to include Playwright in the tests then we’d use its integrated request routing.

import pytest
import polars as pl
from scraper import BooksScraper

URL = "https://books.toscrape.com/"

LINKS = pl.read_csv("books-to-scrape.csv")


@pytest.fixture(scope="function")
def scraper():
    scraper = BooksScraper()

    with open("books-to-scrape.html") as f:
        MOCK_RESPONSE = f.read()

    def handler(route, request):
        if request.url == URL:
            route.fulfill(
                status=200, headers={"Content-Type": "text/html"}, body=MOCK_RESPONSE
            )
        else:
            route.abort()

    scraper.page.route("**/*", handler)

    yield scraper


def test_scraper(scraper):
    html = scraper.download(URL)

    links = scraper.parse(html)
    assert links == LINKS["url"].to_list()

    df = scraper.transform(links)
    assert df.equals(LINKS)

In the scraper fixture we create an instance of the BooksScraper class, then attach a handler to its page property. The handler returns a response from a file for requests to https://books.toscrape.com/ and simply aborts all other requests.

🚨 It’s important to use the Page object already attached to the BooksScraper object. If you attempt to launch Playwright in the test and create another Page object then you’ll get an error:

It looks like you are using Playwright Sync API inside the asyncio loop.

Real World Test

The test for the Books to Scrape scraper illustrates the general principles. But the target site is simple and specifically set up for scraping.

What about a “real world” target site? We’ll return to the same page on the Avantor site considered previously. Here’s the test for the Playwright version of the scraper.

import json
import pytest
import polars as pl
from urllib.parse import urlparse
from pathlib import Path
from avantor import AvantorScraper

PRODUCTS_DICT = json.load(open("products.json"))
PRODUCTS_TABLE = pl.read_csv("products.csv")

urls = {
    "/store/product/725074/citric-acid-anhydrous-99-5-acs": (
        "product.html",
        "text/html",
    ),
    "/cdn-cgi/scripts/7d0fa10a/cloudflare-static/rocket-loader.min.js": (
        "rocket-loader.min.js",
        "application/javascript",
    ),
    "/responsive/js/vendor/jquery-3.6.1.min.js": (
        "jquery.min.js",
        "application/javascript",
    ),
    "/responsive/js/unified-responsive-2022.min.js": (
        "unified-responsive.min.js",
        "application/javascript",
    ),
    "/store/services/catalog/json/stiboOrderTableRender.jsp": (
        "product-table.html",
        "text/html",
    ),
    "/store/services/pricing/json/skuPricing.jsp": (
        "sku-pricing.json",
        "application/json",
    ),
}


@pytest.fixture(scope="function")
def scraper():
    scraper = AvantorScraper("725074/citric-acid-anhydrous-99-5-acs")

    def handle_route(route, request):
        path = urlparse(request.url).path
        try:
            file_name, mime_type = urls[path]
            body = Path(file_name).read_bytes()
            route.fulfill(status=200, headers={"Content-Type": mime_type}, body=body)
        except KeyError:
            route.abort()

    scraper.page.route("**/*", handle_route)

    yield scraper


def test_scraper(scraper):
    html = scraper.download()

    products = scraper.parse(html)
    assert products == PRODUCTS_DICT

    df = scraper.transform(products)
    assert df.equals(PRODUCTS_TABLE)

Again we need to handle a set of six specific requests, while the rest can be aborted. The content required to fulfill each of these requests is loaded from a file, so no actual network requests are required.

Conclusion

Playwright’s integrated facility for intercepting requests makes it possible to write detailed tests. For many target sites you’ll need to intercept a few requests, while the majority (certainly those for images, fonts and CSS) can simply be aborted.