Andrew B. Collier / @datawookie


Social links and a link to my CV.

Public datasets:


Minecraft Worlds

In Minecraft, worlds are self-contained game environments that players can build, explore and modify. Each world is generated based on a seed, which is a number (or string) that the game uses to the world.

Read More →

Iterating over a Paginated List of Links

A grader working on a construction site.

A common web crawler requirement is to iterate over a paginated list of links, following each link to retrieve detailed data. For example:

  • iterating over a list of profiles on LinkedIn and then retrieving employment history from each profile;
  • traversing a list of products on Amazon and then retrieving reviews for each product; or
  • navigating the list of items on BBC News and retrieving the full articles.
Read More →

Handling HTML Entities and Unicode

A robot hand using a paint scraper to remove paint from a wall.

What if your text data is contaminated with Unicode characters and HTML entities? Ideally you want your persisted data to be pristine. Metaphorically it should be prêt à manger (ready to eat). In principle I also want my text to be as simple as possible: ASCII characters, nothing else. This is sometimes achievable without the loss of too much information.

Read More →

Scraping JSON-LD Data

A robot hand using a paint scraper to remove paint from a wall.

JSON-LD (JavaScript Object Notation for Linked Data) is a lightweight, flexible and standardised format intended to provide context and meaning to the data on a webpage. It’s easy and convenient for both humans and machines to read and write.

Read More →

Web Scraper Testing

A crash test dummy sitting at a computer in a cosy study.

Site evolution. DOM drift. Selector decay. XPath extinction. Scraper rot. CAPTCHA catastrophe. Anti-bot apocalypse.

Inevitably even a carefully crafted web scraper will fail because the target site has changed in some way. Regular systematic testing is vital to ensure that you don’t lose valuable data.

Read More →

Optimisation with Pyomo

Pyomo is another flexible Open Source optimisation modelling language for Python. It can be used to define, solve, and analyse a wide range of optimisation problems, including Linear Programming (LP) and Mixed-Integer Programming (MIP), nonlinear programming (NLP), and differential equations.

📢 The book Hands-On Mathematical Optimization with Python (available free online) is an excellent resource on optimisation with Python and Pyomo.

Read More →