Blog Posts by Andrew B. Collier / @datawookie


Undetected ChromeDriver: Stay Below the Radar

There’s one major problem with ChromeDriver: anti-bot services are able to detect that a browser session is being automated (as opposed to being used by a regular meat sack) and will often impose restrictions or deny connections altogether. The Undetected ChromeDriver (undetected-chromedriver) Python package is a patched version of ChromeDriver which avoids triggering a selection of anti-bot services, allowing it to glide under the anti-bot radar.

Read More →

{pagedown} Page Size & Margins

At Fathom Data we have been doing a lot of automated documentation and automated reporting. Although many of these documents are rendered to HTML, there’s an increasing demand for PDF documents. So we’ve had to raise out game in that department. The {pagedown} package has become invaluable. This is a short note showing how we tweak the page size and margins for PDF documents.

Read More →

Scaling Density Plots

I’m a density plot devotee. And, using geom_density() from {ggplot2} these plots are effortless to produce. However, sometimes the results of geom_density() are not exactly what I’m after. Here’s how I tweak them to give me precisely what I need.

Read More →

Vertically Align Image & Text

I’m not a web developer. However, I do regularly use HTML and CSS to layout pages which are then transformed into PDF documents. One of the requirements that I encounter fairly often is vertically aligning images and text. Fairly often, but not often enough to remember the solution. So I inevitably have to rediscover the solution (thanks StackOverflow!). Jotting this down for posterity.

Read More →

Using Shiny Server in Docker

A quick note on how to use the Shiny Server Docker image, rocker/shiny.

I’m a big believer in starting with the simplest possible setup, getting that to work and then adding complexity in layers. We’ll start with a simple Shiny application in app.R.

Read More →

Linux Packages for R

Getting R set up on Linux can be somewhat frustrating. Many of the fundamental packages (like {devtools} or {remotes}) have implicit system dependencies. So installing these packages can involve numerous iterations back and forth between R and the shell while you figure out what those dependencies are and get them all installed.

I’ve been through this process many times now and finally just created a quick script that will get most of it done quickly and easily.

Read More →

Historical Weather Data

I’m building a model which requires historical weather data from a selection of locations in South Africa. In this post I demonstrate the process of acquiring the data and doing some simple processing.

Read More →

Unravelling Transparency in Coverage Data

I have a challenge: extracting data from an enormous JSON file. The structure of the file is not ideal: it’s a mapping at the top level, which means that for most standard approaches the entire document needs to be loaded before it can be processed. It would have been so much easier if the top level structure was an array. But, alas. It’s almost as if the purveyors of the data have made it intentionally inaccessible.

Read More →

Persisting Data with Pickle & S3

I occasionally write scripts where I need to persist some information between runs. These scripts are often wrapped in a Docker image and deployed on Amazon ECS. This means that there is no persistent storage. I could use a database, but this would be overkill for the volume of data involved. This post describes a simple approach to storing these data on S3 using a pickle file.

Read More →

Great Britain Railway Network

Introducing the nascent R package {blimey} (repository). At this stage it contains only the following data:

  • railways — latitude and longitude segments along railway lines (wide format);
  • railways_pivot — latitude and longitude segments along railway lines (long format); and
  • railway_stations — codes, names and locations of railway stations.
Read More →