Andrew B. Collier / @datawookie


Social links and a link to my CV.

Public datasets:


Iterating over a Paginated List of Links

A grader working on a construction site.

A common web crawler requirement is to iterate over a paginated list of links, following each link to retrieve detailed data. For example:

  • iterating over a list of profiles on LinkedIn and then retrieving employment history from each profile;
  • traversing a list of products on Amazon and then retrieving reviews for each product; or
  • navigating the list of items on BBC News and retrieving the full articles.
Read More →

Handling HTML Entities and Unicode

A robot hand using a paint scraper to remove paint from a wall.

What if your text data is contaminated with Unicode characters and HTML entities? Ideally you want your persisted data to be pristine. Metaphorically it should be prêt à manger (ready to eat). In principle I also want my text to be as simple as possible: ASCII characters, nothing else. This is sometimes achievable without the loss of too much information.

Read More →

Scraping JSON-LD Data

A robot hand using a paint scraper to remove paint from a wall.

JSON-LD (JavaScript Object Notation for Linked Data) is a lightweight, flexible and standardised format intended to provide context and meaning to the data on a webpage. It’s easy and convenient for both humans and machines to read and write.

Read More →

Web Scraper Testing

A crash test dummy sitting at a computer in a cosy study.

Site evolution. DOM drift. Selector decay. XPath extinction. Scraper rot. CAPTCHA catastrophe. Anti-bot apocalypse.

Inevitably even a carefully crafted web scraper will fail because the target site has changed in some way. Regular systematic testing is vital to ensure that you don’t lose valuable data.

Read More →

Optimisation with Pyomo

Pyomo is another flexible Open Source optimisation modelling language for Python. It can be used to define, solve, and analyse a wide range of optimisation problems, including Linear Programming (LP) and Mixed-Integer Programming (MIP), nonlinear programming (NLP), and differential equations.

📢 The book Hands-On Mathematical Optimization with Python (available free online) is an excellent resource on optimisation with Python and Pyomo.

Read More →

Optimisation with CVXPY

CVXPY is a powerful, Open Source optimization modelling library for Python. It provides an interface for defining, solving, and analysing a wide range of convex optimization problems, including Linear Programming (LP), Quadratic Programming (QP), Second-Order Cone Programming (SOCP), and Semidefinite Programming (SDP).

Read More →

Global versus Sequential Optimisation

We will be considering two types of optimisation problems: sequential optimisation and global optimisation. These approaches can be applied to the same problem but will generally yield distinctly different results. Depending on your objective one or the other might be the best fit for your problem.

Read More →

Optimisation Reference Problem

A concrete water tank surrounded by arid scenery typical of the Karoo.

I’m evaluating optimisation systems for application to a large scale solar energy optimisation project. My primary concerns are with efficiency, flexibility and usability. Ideally I’d like to evaluate all of them on a single, well defined problem. And, furthermore, that problem should at least resemble the solar energy project.

Read More →