
In the previous post we considered a few approaches to testing a Selenium web scraper. Now we’ll do the same for web scrapers using Playwright.
Intercepting Requests
We needed to use Selenium Wire to intercept requests issued via Selenium. Playwright, however, has an integrated routing capability for managing requests and responses.
The route()
method on a Page
object is used to attach handlers to specific requests. For example, the code below shows two ways to attach the handler
function to requests for PNG and JPG images.
# Using globbing to match route.
page.route("**/*.{png,jpg}", handler)
# Using REGEX to match route.
page.route(re.compile(r"(\.png$)|(\.jpg$)"), handler)
You can also attach a handler that will be applied to all requests.
page.route("**/*", handler)
The handler function will generally use one or more of the following routing methods to determine the action applied to a specific route:
.abort()
— aborts the request;.continue_()
— the request is passed to the network;.fallback()
— allows other handlers to be applied;.fetch()
— execute request but don’t return response (used in conjunction with.fulfill()
); and.fulfill()
— return response.
Each of these methods provides a selection of parameters that can be used to customise their behaviour. For example, by manipulating characteristics of the request like headers, payload or URL.
Example: Logging Requests
The following handler simply logs the request URL and then proceeds to process it.
import logging
def handler(route, request):
"""Log and process request."""
logging.debug(request.url)
route.continue_()
# Attach the handler to all routes.
page.route("**/*", handler)
page.goto("https://example.com")
Example: Aborting Requests
Specific actions will normally be applied to route
depending on the characteristics of request
. For example, abort requests to https://fonts.gstatic.com but process all others.
def handler(route, request):
if "fonts.gstatic.com" in request.url:
route.abort()
else:
route.continue_()
Example: Mock Response
Provide a mocked response for requests to https://api.ipify.org?format=json.
import json
def handler(route, request):
if request.url == "https://api.ipify.org/?format=json":
body = {"ip": "127.0.0.1"}
route.fulfill(
status=200,
content_type="application/json",
body=json.dumps(body),
)
else:
route.continue_()
That will return {"ip": "127.0.0.1"}
regardless of the actual IP and no request will be sent to https://api.ipify.org/.
Example: Modify Response
Using .fetch()
and .fulfill()
you can execute the original request but modify the response.
from datetime import datetime
def handler(route, request):
if request.url == "https://api.ipify.org/?format=json":
response = route.fetch()
json = response.json()
json["timestamp"] = datetime.now().isoformat()
route.fulfill(response=response, json=json)
else:
route.continue_()
That adds a timestamp
field to the response from https://api.ipify.org/.
The Scraper
Enough preliminaries. Let’s get onto scraping. Here’s another version of the Books to Scrape scraper using Playwright.
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
import polars as pl
class BooksScraper:
def __init__(self):
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch(headless=True)
self.context = self.browser.new_context()
self.page = self.context.new_page()
def __del__(self):
self.page.close()
self.context.close()
self.browser.close()
self.playwright.stop()
def download(self, url: str) -> str:
self.page.goto(url)
self.page.wait_for_selector("ul.breadcrumb", timeout=10000)
return self.page.content()
def parse(self, html: str) -> list:
soup = BeautifulSoup(html, "html.parser")
return [a.get("href") for a in soup.select("h3 > a", href=True)]
def transform(self, links: list) -> pl.DataFrame:
return pl.DataFrame({"url": links})
if __name__ == "__main__":
scraper = BooksScraper()
url = "https://books.toscrape.com/"
html = scraper.download(url)
links = scraper.parse(html)
for link in links:
print(link)
df = scraper.transform(links)
with pl.Config(fmt_str_lengths=120, tbl_formatting="MARKDOWN"):
print(df.head())
This is very similar to the previous implementation of the scraper using Selenium.
Testing the Scraper
It’s possible to test this scraper using mocking and patching. The approach would be similar to that used in the previous post. Both of those techniques would completely circumvent Playwright. If, however, we want to include Playwright in the tests then we’d use its integrated request routing.
import pytest
import polars as pl
from scraper import BooksScraper
URL = "https://books.toscrape.com/"
LINKS = pl.read_csv("books-to-scrape.csv")
@pytest.fixture(scope="function")
def scraper():
scraper = BooksScraper()
with open("books-to-scrape.html") as f:
MOCK_RESPONSE = f.read()
def handler(route, request):
if request.url == URL:
route.fulfill(
status=200, headers={"Content-Type": "text/html"}, body=MOCK_RESPONSE
)
else:
route.abort()
scraper.page.route("**/*", handler)
yield scraper
def test_scraper(scraper):
html = scraper.download(URL)
links = scraper.parse(html)
assert links == LINKS["url"].to_list()
df = scraper.transform(links)
assert df.equals(LINKS)
In the scraper
fixture we create an instance of the BooksScraper
class, then attach a handler to its page
property. The handler returns a response from a file for requests to https://books.toscrape.com/ and simply aborts all other requests.
🚨 It’s important to use the Page
object already attached to the BooksScraper
object. If you attempt to launch Playwright in the test and create another Page
object then you’ll get an error:
It looks like you are using Playwright Sync API inside the asyncio loop.
Real World Test
The test for the Books to Scrape scraper illustrates the general principles. But the target site is simple and specifically set up for scraping.
What about a “real world” target site? We’ll return to the same page on the Avantor site considered previously. Here’s the test for the Playwright version of the scraper.
import json
import pytest
import polars as pl
from urllib.parse import urlparse
from pathlib import Path
from avantor import AvantorScraper
PRODUCTS_DICT = json.load(open("products.json"))
PRODUCTS_TABLE = pl.read_csv("products.csv")
urls = {
"/store/product/725074/citric-acid-anhydrous-99-5-acs": (
"product.html",
"text/html",
),
"/cdn-cgi/scripts/7d0fa10a/cloudflare-static/rocket-loader.min.js": (
"rocket-loader.min.js",
"application/javascript",
),
"/responsive/js/vendor/jquery-3.6.1.min.js": (
"jquery.min.js",
"application/javascript",
),
"/responsive/js/unified-responsive-2022.min.js": (
"unified-responsive.min.js",
"application/javascript",
),
"/store/services/catalog/json/stiboOrderTableRender.jsp": (
"product-table.html",
"text/html",
),
"/store/services/pricing/json/skuPricing.jsp": (
"sku-pricing.json",
"application/json",
),
}
@pytest.fixture(scope="function")
def scraper():
scraper = AvantorScraper("725074/citric-acid-anhydrous-99-5-acs")
def handle_route(route, request):
path = urlparse(request.url).path
try:
file_name, mime_type = urls[path]
body = Path(file_name).read_bytes()
route.fulfill(status=200, headers={"Content-Type": mime_type}, body=body)
except KeyError:
route.abort()
scraper.page.route("**/*", handle_route)
yield scraper
def test_scraper(scraper):
html = scraper.download()
products = scraper.parse(html)
assert products == PRODUCTS_DICT
df = scraper.transform(products)
assert df.equals(PRODUCTS_TABLE)
Again we need to handle a set of six specific requests, while the rest can be aborted. The content required to fulfill each of these requests is loaded from a file, so no actual network requests are required.
Conclusion
Playwright’s integrated facility for intercepting requests makes it possible to write detailed tests. For many target sites you’ll need to intercept a few requests, while the majority (certainly those for images, fonts and CSS) can simply be aborted.
