Camoufox in Docker
My scrapers often run in a serverless environment. If I’m using Camoufox then that needs to be baked into my Docker image too.
Read More →Social links CV.
and a link to my
Public datasets:
My scrapers often run in a serverless environment. If I’m using Camoufox then that needs to be baked into my Docker image too.
Read More →Playwright launches a browser. And browsers can be resource hungry beasts.
I often run Playwright on small, resource constrained virtual machines or in a serverless environment. These normally don’t have a lot of memory or disk space. Running out of either of these resources will cause Playwright (and potentially other processes) to fall over.
Is it possible to prune Playwright so that it plays better in a resource constrained environment? Let’s see.
Read More →The distances of Hasler kayak races for various divisions are nominally 4, 8 and 12 miles. However, the actual distances vary to some degree from one race venue to another. This makes it difficult to compare race times across different races. Using data from Paddle UK I attempt to estimate the actual distances.
Read More →Sometimes you’ll want to initiate a Selenium or Playwright session with an existing set of cookies. My approach to this is to retrieve those cookies using a browser and save them to a file so that I can easily load them into my script.
Read More →Sometimes a site will work fine with Selenium or Playwright until you try headless mode. Then it might fling up some anti-both mechanism. Or just stop responding altogether. Fortunately there are some simple things that you can do to work around this.
These are the approaches that I usually take.
Read More →Notes to self on using pytest
.
I previously looked at the NetNut proxies. This post reviews the Webshare proxy service.
Read More →In the previous post we considered a few approaches to testing a Selenium web scraper. Now we’ll do the same for web scrapers using Playwright.
Read More →In previous posts we considered a few approaches for testing scrapers targeting static sites. Sometimes you won’t be able to get away with these static tools and you’ll be forced to use browser automation. In this post I’ll look at some options for testing a Selenium web scraper.
Read More →A common web crawler requirement is to iterate over a paginated list of links, following each link to retrieve detailed data. For example:
What if your text data is contaminated with Unicode characters and HTML entities? Ideally you want your persisted data to be pristine. Metaphorically it should be prêt à manger (ready to eat). In principle I also want my text to be as simple as possible: ASCII characters, nothing else. This is sometimes achievable without the loss of too much information.
Read More →JSON-LD (JavaScript Object Notation for Linked Data) is a lightweight, flexible and standardised format intended to provide context and meaning to the data on a webpage. It’s easy and convenient for both humans and machines to read and write.
Read More →The previous post in this series considered the mocking capabilities in the unittest
package. Now we’ll look at what it offers for patching.
Previous posts in this series used the responses
and vcr
packages to mock HTTP responses. Now we’re going to look at the capabilities for mocking in the unittest
package, which is part of the Python Standard Library. Relative to responses
and vcr
this functionality is rather low-level. There’s more work required, but as a result there’s potential for greater control.
In the previous post I used the responses
package to mock HTTP responses, producing tests that were quick and stable. Now I’ll look at an alternative approach to mocking using VCR.py.
As mentioned in the introduction to web scraper testing, unit tests should be self-contained and not involve direct access to the target website. The responses
package allows you to easily mock the responses returned by a website, so it’s well suited to the job. The package is stable and well documented.
Site evolution. DOM drift. Selector decay. XPath extinction. Scraper rot. CAPTCHA catastrophe. Anti-bot apocalypse.
Inevitably even a carefully crafted web scraper will fail because the target site has changed in some way. Regular systematic testing is vital to ensure that you don’t lose valuable data.
Read More →The Zyte API implements session management, which makes it possible to emulate a browser session when interacting with a site via the API.
Read More →In a previous post I looked at various ways to use the Zyte API to retrieve web content. Now I’m going to delve into options for managing cookies via the Zyte API.
Read More →Zyte is a data extraction platform, useful for web scraping and data processing at scale. It’s intended to simplify data collection and, based on my experience certainly does!
Read More →Quick notes on the process of installing the CPLEX optimiser.
Read More →Quick notes on the process of installing the MOSEK optimiser.
Read More →Pyomo is another flexible Open Source optimisation modelling language for Python. It can be used to define, solve, and analyse a wide range of optimisation problems, including Linear Programming (LP) and Mixed-Integer Programming (MIP), nonlinear programming (NLP), and differential equations.
📢 The book Hands-On Mathematical Optimization with Python (available free online) is an excellent resource on optimisation with Python and Pyomo.
Read More →CVXPY is a powerful, Open Source optimization modelling library for Python. It provides an interface for defining, solving, and analysing a wide range of convex optimization problems, including Linear Programming (LP), Quadratic Programming (QP), Second-Order Cone Programming (SOCP), and Semidefinite Programming (SDP).
Read More →SciPy is a general-purpose scientific computing library for Python, with an optimize
module for optimisation.
We will be considering two types of optimisation problems: sequential optimisation and global optimisation. These approaches can be applied to the same problem but will generally yield distinctly different results. Depending on your objective one or the other might be the best fit for your problem.
Read More →I’m evaluating optimisation systems for application to a large scale solar energy optimisation project. My primary concerns are with efficiency, flexibility and usability. Ideally I’d like to evaluate all of them on a single, well defined problem. And, furthermore, that problem should at least resemble the solar energy project.
Read More →In a previous post I looked at the HTTP request headers used to manage browser caching. In this post I’ll look at a real world example. It’s a rather deep dive into something that’s actually quite simple. However, I find it helpful for my understanding to pick things apart and understand how all of the components fit together.
Read More →In this post I’ll be testing the proxy service provided by NetNut. For a bit of context take a look at my What is a Proxy? post.
Read More →A proxy is a server or software that acts as an intermediary between a client (often a web browser) and one or more servers, typically on the internet. Proxies are used for a variety of purposes, including improving security, enhancing privacy, managing network traffic, and bypassing restrictions.
Read More →I recently migrated this blog from GitLab Pages to Vercel. There were two main reasons for the move:
For a side project I needed to scrape data for the NYSE Composite Index going back as far as possible.
Read More →In a previous post I looked at retrieving a list of assets from the Alpaca API using the {alpacar}
R package. Now we’ll explore how to retrieve historical and current price data.
How to list assets available to trade via the Alpaca API using the {alpacar}
R package.
The {alpacar}
package for R is a wrapper around the Alpaca API. API documentation can be found here. In this introductory post I show how to install and load the package, then authenticate with the API and retrieve account information.
A few days ago I wrote about a scraper for gathering economic calendar data. Well, I’m back again to write about another aspect of the same project: acquiring earnings calendar data.
Read More →Avoiding data duplication is a persistent challenge with acquiring data from websites or APIs. You can try to brute force it: pull the data again and then compare it locally to establish whether it’s fresh or stale. But there are other approaches that, if supported, can make this a lot simpler.
Read More →If you use Selenium for browser automation then at some stage you are likely to need to download a file by clicking a button or link on a website. Sometimes this just works. Other times it doesn’t.
Read More →I needed an offline copy of an economic calendar with all of the major international economic events. After grubbing around the internet I found the Economic Calendar on Myfxbook which had everything that I needed.
Read More →A few months ago I listened to an episode on the Founder’s Journal podcast that reviewed an essay, The Opportunity Cost of Everything, by Jack Raines. If you haven’t read it, then I suggest you invest 10 minutes in doing so. It will be time well spent.
Read More →Cloudflare is a service that aims improve the performance and security of websites. It operates as a content delivery network (CDN) to ensure faster load times and consequently better user experience. However, it also protects against online threats by filtering “malicious” traffic.
Web scraping requests are often deemed to be malicious (certainly by Cloudflare!) and thus blocked. There are various approaches to circumventing this, most of which involve running a live browser instance. For some applications though, this is a bit hammer for a small nail. The cloudscraper
package provides a lightweight option for dealing with Cloudflare and has an API similar to the requests
package.
cURL is the ultimate Swiss Army Knife for interacting with network protocols. But to be honest, I really only scratch the surface of what’s possible. Usually my workflow is something like this:
I’m going to take a look at my favourite online tool for converting a cURL command to code and then see what other tools there, focusing on Python and R as target languages.
Read More →The Big Book of R provides a comprehensive and ever-growing overview of a broad selection of R programming books. It was created and is maintained by Oscar Baruffa. The collection began with approximately 100 books and, with the help of contributions from the R community, has subsequently expanded to over 400. The books are grouped into topics such as geospatial, machine learning, statistics, text analysis, and many more. The Big Book of R is an excellent resource for anyone learning R programming, whether they are a beginner or advanced user.
Read More →The ability to specify a message ID in emails sent from the {emayili}
package makes it possible to create email threads.
A new minor version of the openai-python
package was released late on Friday 7 June 2024, only a couple of days after the last minor release. This release adds a chunking_strategy
argument to the methods for adding files to vector stores.
Quick notes on installing Docker on various platforms:
Read More →This question on Stack Overflow was a fun challenge: extract the markers off an embedded Google Map.
Read More →The R version of my Desert Island Docker talk. Similar idea to Desert Island Docker: Python Edition.
Read More →Over the years that I’ve been dabbling in public speaking I’ve generally developed a talk, presented it once and then moved on. However, I’ve noticed other speakers who give the same (or similar) talk at different events, where the talk evolves and improves over time.
Read More →From time to time I want to extract the table of contents from a PDF. Here’s how I do that using simple shell tools.
Read More →