Andrew B. Collier / @datawookie


Social links and a link to my CV.

Public datasets:


Playwright Browser Footprint

An elephant footprint in muddy ground with footprints of smaller animals.

Playwright launches a browser. And browsers can be resource hungry beasts.

I often run Playwright on small, resource constrained virtual machines or in a serverless environment. These normally don’t have a lot of memory or disk space. Running out of either of these resources will cause Playwright (and potentially other processes) to fall over.

Is it possible to prune Playwright so that it plays better in a resource constrained environment? Let’s see.

Read More →

Hasler Statistics

An image of three kayak paddlers in triangle formation viewed from in front of the foremost paddler.

The distances of Hasler kayak races for various divisions are nominally 4, 8 and 12 miles. However, the actual distances vary to some degree from one race venue to another. This makes it difficult to compare race times across different races. Using data from Paddle UK I attempt to estimate the actual distances.

Read More →

Headless Browser Hacks

Sometimes a site will work fine with Selenium or Playwright until you try headless mode. Then it might fling up some anti-both mechanism. Or just stop responding altogether. Fortunately there are some simple things that you can do to work around this.

These are the approaches that I usually take.

Read More →

Iterating over a Paginated List of Links

A grader working on a construction site.

A common web crawler requirement is to iterate over a paginated list of links, following each link to retrieve detailed data. For example:

  • iterating over a list of profiles on LinkedIn and then retrieving employment history from each profile;
  • traversing a list of products on Amazon and then retrieving reviews for each product; or
  • navigating the list of items on BBC News and retrieving the full articles.
Read More →

Handling HTML Entities and Unicode

A robot hand using a paint scraper to remove paint from a wall.

What if your text data is contaminated with Unicode characters and HTML entities? Ideally you want your persisted data to be pristine. Metaphorically it should be prêt à manger (ready to eat). In principle I also want my text to be as simple as possible: ASCII characters, nothing else. This is sometimes achievable without the loss of too much information.

Read More →

Scraping JSON-LD Data

A robot hand using a paint scraper to remove paint from a wall.

JSON-LD (JavaScript Object Notation for Linked Data) is a lightweight, flexible and standardised format intended to provide context and meaning to the data on a webpage. It’s easy and convenient for both humans and machines to read and write.

Read More →

Web Scraper Testing

A crash test dummy sitting at a computer in a cosy study.

Site evolution. DOM drift. Selector decay. XPath extinction. Scraper rot. CAPTCHA catastrophe. Anti-bot apocalypse.

Inevitably even a carefully crafted web scraper will fail because the target site has changed in some way. Regular systematic testing is vital to ensure that you don’t lose valuable data.

Read More →

Optimisation with Pyomo

Pyomo is another flexible Open Source optimisation modelling language for Python. It can be used to define, solve, and analyse a wide range of optimisation problems, including Linear Programming (LP) and Mixed-Integer Programming (MIP), nonlinear programming (NLP), and differential equations.

📢 The book Hands-On Mathematical Optimization with Python (available free online) is an excellent resource on optimisation with Python and Pyomo.

Read More →

Optimisation with CVXPY

CVXPY is a powerful, Open Source optimization modelling library for Python. It provides an interface for defining, solving, and analysing a wide range of convex optimization problems, including Linear Programming (LP), Quadratic Programming (QP), Second-Order Cone Programming (SOCP), and Semidefinite Programming (SDP).

Read More →

Global versus Sequential Optimisation

We will be considering two types of optimisation problems: sequential optimisation and global optimisation. These approaches can be applied to the same problem but will generally yield distinctly different results. Depending on your objective one or the other might be the best fit for your problem.

Read More →

Optimisation Reference Problem

A concrete water tank surrounded by arid scenery typical of the Karoo.

I’m evaluating optimisation systems for application to a large scale solar energy optimisation project. My primary concerns are with efficiency, flexibility and usability. Ideally I’d like to evaluate all of them on a single, well defined problem. And, furthermore, that problem should at least resemble the solar energy project.

Read More →

What is a Proxy?

A reception desk at an old world hotel.

A proxy is a server or software that acts as an intermediary between a client (often a web browser) and one or more servers, typically on the internet. Proxies are used for a variety of purposes, including improving security, enhancing privacy, managing network traffic, and bypassing restrictions.

Read More →

Migrating from GitLab Pages to Vercel

A bird with a white body and dark cap against a sky with scattered clouds.

I recently migrated this blog from GitLab Pages to Vercel. There were two main reasons for the move:

  1. The blog was taking too long to build on GitLab Pages, which hindered efficient updates and added unnecessary delays to my workflow. Admittedly, this was partially my own doing since my build process was far too complicated.
  2. I want to have greater control over redirects (specifically the ability to redirect URLs that didn’t end in a slash to ones that did, which was apparently important for SEO purposes).
Read More →

Caching & Avoiding Duplication

An image of a computer on a wooden desk. On either side of the computer is a table lamp. The computer screen has an image of two cows.

Avoiding data duplication is a persistent challenge with acquiring data from websites or APIs. You can try to brute force it: pull the data again and then compare it locally to establish whether it’s fresh or stale. But there are other approaches that, if supported, can make this a lot simpler.

Read More →

Bypassing Cloudflare with Cloudscraper

An image of blue sky with clouds.

Cloudflare is a service that aims improve the performance and security of websites. It operates as a content delivery network (CDN) to ensure faster load times and consequently better user experience. However, it also protects against online threats by filtering “malicious” traffic.

Web scraping requests are often deemed to be malicious (certainly by Cloudflare!) and thus blocked. There are various approaches to circumventing this, most of which involve running a live browser instance. For some applications though, this is a bit hammer for a small nail. The cloudscraper package provides a lightweight option for dealing with Cloudflare and has an API similar to the requests package.

Read More →

Unpacking cURL Commands

An abstract pattern, reminiscent of Paisley.

cURL is the ultimate Swiss Army Knife for interacting with network protocols. But to be honest, I really only scratch the surface of what’s possible. Usually my workflow is something like this:

  1. Copy a cURL command from my browser’s Developer Tools.
  2. Test out the cURL command in a terminal.
  3. Convert the cURL command into a programming language (normally Python or R).
  4. Prosper.

I’m going to take a look at my favourite online tool for converting a cURL command to code and then see what other tools there, focusing on Python and R as target languages.

Read More →

Updates to the Big Book of R

Title image for the "**Big Book of R**".

The Big Book of R provides a comprehensive and ever-growing overview of a broad selection of R programming books. It was created and is maintained by Oscar Baruffa. The collection began with approximately 100 books and, with the help of contributions from the R community, has subsequently expanded to over 400. The books are grouped into topics such as geospatial, machine learning, statistics, text analysis, and many more. The Big Book of R is an excellent resource for anyone learning R programming, whether they are a beginner or advanced user.

Read More →

Desert Island Docker: Python Edition

Over the years that I’ve been dabbling in public speaking I’ve generally developed a talk, presented it once and then moved on. However, I’ve noticed other speakers who give the same (or similar) talk at different events, where the talk evolves and improves over time.

Read More →