Andrew B. Collier / @datawookie


Social links and a link to my CV.

Public datasets:


Hasler Statistics

An image of three kayak paddlers in triangle formation viewed from in front of the foremost paddler.

The distances of Hasler kayak races for various divisions are nominally 4, 8 and 12 miles. However, the actual distances vary to some degree from one race venue to another. This makes it difficult to compare race times across different races. Using data from Paddle UK I attempt to estimate the actual distances.

Read More →

Headless Browser Hacks

Sometimes a site will work fine with Selenium or Playwright until you try headless mode. Then it might fling up some anti-both mechanism. Or just stop responding altogether. Fortunately there are some simple things that you can do to work around this.

These are the approaches that I usually take.

Read More →

Iterating over a Paginated List of Links

A grader working on a construction site.

A common web crawler requirement is to iterate over a paginated list of links, following each link to retrieve detailed data. For example:

  • iterating over a list of profiles on LinkedIn and then retrieving employment history from each profile;
  • traversing a list of products on Amazon and then retrieving reviews for each product; or
  • navigating the list of items on BBC News and retrieving the full articles.
Read More →

Handling HTML Entities and Unicode

A robot hand using a paint scraper to remove paint from a wall.

What if your text data is contaminated with Unicode characters and HTML entities? Ideally you want your persisted data to be pristine. Metaphorically it should be prêt à manger (ready to eat). In principle I also want my text to be as simple as possible: ASCII characters, nothing else. This is sometimes achievable without the loss of too much information.

Read More →

Scraping JSON-LD Data

A robot hand using a paint scraper to remove paint from a wall.

JSON-LD (JavaScript Object Notation for Linked Data) is a lightweight, flexible and standardised format intended to provide context and meaning to the data on a webpage. It’s easy and convenient for both humans and machines to read and write.

Read More →

Web Scraper Testing

A crash test dummy sitting at a computer in a cosy study.

Site evolution. DOM drift. Selector decay. XPath extinction. Scraper rot. CAPTCHA catastrophe. Anti-bot apocalypse.

Inevitably even a carefully crafted web scraper will fail because the target site has changed in some way. Regular systematic testing is vital to ensure that you don’t lose valuable data.

Read More →