Training for London to Brighton 2024

I’m running the London to Brighton Challenge (2025) to raise funds for the Alzheimer’s Society. I’ll be tracking my high-level training progress here.
Read More →Social links CV.
and a link to my
Public datasets:
I’m running the London to Brighton Challenge (2025) to raise funds for the Alzheimer’s Society. I’ll be tracking my high-level training progress here.
Read More →A common web crawler requirement is to iterate over a paginated list of links, following each link to retrieve detailed data. For example:
What if your text data is contaminated with Unicode characters and HTML entities? Ideally you want your persisted data to be pristine. Metaphorically it should be prêt à manger (ready to eat). In principle I also want my text to be as simple as possible: ASCII characters, nothing else. This is sometimes achievable without the loss of too much information.
Read More →JSON-LD (JavaScript Object Notation for Linked Data) is a lightweight, flexible and standardised format intended to provide context and meaning to the data on a webpage. It’s easy and convenient for both humans and machines to read and write.
Read More →The previous post in this series considered the mocking capabilities in the unittest
package. Now we’ll look at what it offers for patching.
Previous posts in this series used the responses
and vcr
packages to mock HTTP responses. Now we’re going to look at the capabilities for mocking in the unittest
package, which is part of the Python Standard Library. Relative to responses
and vcr
this functionality is rather low-level. There’s more work required, but as a result there’s potential for greater control.
In the previous post I used the responses
package to mock HTTP responses, producing tests that were quick and stable. Now I’ll look at an alternative approach to mocking using VCR.py.
As mentioned in the introduction to web scraper testing, unit tests should be self-contained and not involve direct access to the target website. The responses
package allows you to easily mock the responses returned by a website, so it’s well suited to the job. The package is stable and well documented.
Site evolution. DOM drift. Selector decay. XPath extinction. Scraper rot. CAPTCHA catastrophe. Anti-bot apocalypse.
Inevitably even a carefully crafted web scraper will fail because the target site has changed in some way. Regular systematic testing is vital to ensure that you don’t lose valuable data.
Read More →The Zyte API implements session management, which makes it possible to emulate a browser session when interacting with a site via the API.
Read More →In a previous post I looked at various ways to use the Zyte API to retrieve web content. Now I’m going to delve into options for managing cookies via the Zyte API.
Read More →Zyte is a data extraction platform, useful for web scraping and data processing at scale. It’s intended to simplify data collection and, based on my experience certainly does!
Read More →Quick notes on the process of installing the CPLEX optimiser.
Read More →Quick notes on the process of installing the MOSEK optimiser.
Read More →Pyomo is another flexible Open Source optimisation modelling language for Python. It can be used to define, solve, and analyse a wide range of optimisation problems, including Linear Programming (LP) and Mixed-Integer Programming (MIP), nonlinear programming (NLP), and differential equations.
📢 The book Hands-On Mathematical Optimization with Python (available free online) is an excellent resource on optimisation with Python and Pyomo.
Read More →CVXPY is a powerful, Open Source optimization modelling library for Python. It provides an interface for defining, solving, and analysing a wide range of convex optimization problems, including Linear Programming (LP), Quadratic Programming (QP), Second-Order Cone Programming (SOCP), and Semidefinite Programming (SDP).
Read More →SciPy is a general-purpose scientific computing library for Python, with an optimize
module for optimisation.
We will be considering two types of optimisation problems: sequential optimisation and global optimisation. These approaches can be applied to the same problem but will generally yield distinctly different results. Depending on your objective one or the other might be the best fit for your problem.
Read More →I’m evaluating optimisation systems for application to a large scale solar energy optimisation project. My primary concerns are with efficiency, flexibility and usability. Ideally I’d like to evaluate all of them on a single, well defined problem. And, furthermore, that problem should at least resemble the solar energy project.
Read More →In a previous post I looked at the HTTP request headers used to manage browser caching. In this post I’ll look at a real world example. It’s a rather deep dive into something that’s actually quite simple. However, I find it helpful for my understanding to pick things apart and understand how all of the components fit together.
Read More →