Andrew B. Collier / @datawookie

Social links and a link to my CV.

Public datasets:

British Canoeing Results

An Environment for Reliably Rendering Figures in R

2021-03-23 Docker R

Fathom Data is working on a project to reproduce the figures from the CORE textbook The Economy using R and {ggplot2}. There’s a strict style guide which specifies the figure aesthetics including colours and font. We’re a team of seven people working on as many different setups. The principle challenges have been package versions and fonts.

Flexible Environment Variables for a Docker Image

2021-03-22 Docker CI GitLab

I’ve been following an excellent tutorialfor deploying a Docker image on an EC2 instance via GitLab CI/CD. It covers every step in the process in great detail. If you follow the steps then you’ll definitely end up with a working pipeline.

However, I still wasn’t quite sure how to handle the environment variables and credentials that I wanted to bake into the image, and which varied between my local development environment and the final deployed image.

Install GitLab Runner with Docker

2021-03-21 Docker CI GitLab

📢 An updated version of this post reflecting recent changes in GitLab Runner can be found here.

I’ve got a project which takes a long time to build. And I rebuild it regularly. I’ve been using the shared runners on GitLab. However, the total time constraint has become a limitation. I’m going to install GitLab Runner as a Docker service on an underutilised EC2 instance.

{emayili} UTF-8 Filenames & Setting Sender

2021-03-08 {emayili} R

Two new features in the {emayili} (0.4.6) package for easily sending emails from R.

Package Setup

If you have not already installed the package, then grab it from CRAN or GitHub.

# From CRAN.
install.packages("emayili")
# From GitHub.
remotes::install_github("datawookie/emayili")

Load the package.

library(emayili)

Check that you have the current version.

packageVersion("emayili")

[1] ‘0.4.6’

Let’s quickly set up an SMTP server. We’ll use SMTP Bucket, which is incredibly convenient for testing.

SMTP_SERVER   = "mail.smtpbucket.com"
SMTP_PORT     = 8025

smtp <- server(host = SMTP_SERVER, port = SMTP_PORT)

UTF-8 Characters in Attachment Filenames

It’s now possible to attach files with names that include non-ASCII characters. Suppose I wanted to send this image (source) of Wenceslao Moreno.

Resurrecting MySQL into PostgreSQL with PGLoader

2021-03-02 MySQL PostgreSQL Docker PGLoader

I’ve been hosting a MySQL database on a DigitalOcean server for a few of years. The project has been on hold for a while. Entropy kicked in and the server became unreachable. Fortunately I was still able to access the server via a recovery console to export the database using mysqldump and download the resulting SQL dump file.

Now I want to resurrect the database locally but I also want to migrate it to PostgreSQL.

{blogdown}: Optimise PNG Image Size

2021-02-21 {blogdown} R

Inspired by the informative post from Jumping Rivers about selecting the correct image file type, I decided to optimise PNG file size as part of this blog’s CI pipeline.

{emayili} Sending Birthday Messages

2021-02-18 {emayili} R

Suppose that you want to use {emayili} to send birthday messages (this post motivated by issue #61).

Setting up postref Shortcode for Remote Blog

2021-02-10 {blogdown} Hugo R

I ran into a bit of a snag when updating a {blogdown} site. Suddenly, inexplicably, the images were no longer present.

Launching Selenium with JavaScript Disabled

2021-02-03 Python Selenium web scraping

I have a rather obscure situation where I want to launch Selenium… but with JavaScript disabled.

Levies, Tax and the Fuel Price in South Africa

2021-02-01 {saffer} R

According to the Automobile Association (AA) the fuel price is the sum of four main components:

the basic fuel price
the general fuel levy
the Road Accident Fund (RAF) levy and
wholesale and retail margins, distribution and transport costs.

This article suggests that almost 70% of the fuel price in South Africa is due to taxes and levies.

How much of South Africa's petrol price goes to taxes https://t.co/m96j0TaKlS
— BusinessTech (@BusinessTechSA) January 29, 2021

I used data from {saffer} to examine this assertion.

This is not Rain: It’s a Trickle

2021-01-30 Selenium web scraping R

I started using rain (a South African ISP) back in March 2019. The coverage was good (the only place I couldn’t get a signal was at Lanseria Airport), while the bandwidth was consistently high. I loved the fact that it was affordable, reliable and portable.

Much has changed.

Persistent Selenium Sessions

2021-01-28 Python Selenium

I have a project where I need to have a persistent Selenium session. There’s a script which will leave a browser window open when it exits. When the script runs again it should connect to the same window.

Cyril’s Speeches

2021-01-14 {saffer} R

The transcripts for the South African President’s speeches are available here. I’ve just added these data to the {saffer} package.

Topographic Maps for South Africa

2021-01-12 R {saffer} spatial GDAL Docker

I’ve been adding topological maps of South Africa to the saffer-data-map repository. The maps are originally in MrSID format (.sid files), which is a proprietary file format developed by LizardTech (a company which has been consumed by Extensis).

Price of Fuel in South Africa

2021-01-10 {saffer} R

The {saffer} package is a nascent collection of things relating to South Africa. I’ve just added fuel price data.

SQLAlchemy: Efficient Counting

2021-01-09 SQLAlchemy SQL

I have a SQLAlchemy count() query which is being called fairly frequently in my API. The query itself is not terribly inefficient, but it’s being called with sufficient frequency that it has a performance impact.

Retail Pricing: Latex Gloves

2021-01-06 {trundler} R

A couple of days ago I posted an analysis of the price of nitrile gloves at Dischem.

Running History: Garmin Connect

2021-01-05 running R

I had a suspicion that there was more data beyond the history that I got from Strava (see previous post). And indeed my hunch was confirmed by downloading my history from Garmin Connect.

Retail Pricing: Nitrile Gloves

2021-01-04 {trundler} R

My colleague, Matt, noted that nitrile gloves are getting really expensive.

Running History: Strava

2021-01-02 running R

I’ve been itching to do some analytics on my running data. Today seemed like a good time to actually do it.

I’ll be generating some plots using the {strava} package developed by Marcus Volz.

GitLab CI: Services

2020-12-30 GitLab

I needed to have a Redis server available as part of the GitLab CI pipeline for this blog (simply because I wanted to use the {rredis} package). After fiddling around for some time trying to install the redis-server package using apt I discovered that GitLab CI actually provides Redis as a service, which makes the process remarkably easy.

Some details of the “standard” services (Redis, PostgreSQL and MySQL) supported by GitLab CI can be found here:

Rendering an R Markdown Presentation to GitLab Pages

2020-09-23 R GitLab

I’m busy preparing slides for the Why R? conference using the brilliant {xaringan} package along with {xaringanthemer} to tweak the styles. There are plots rendered into the document as well as static images.

Scrapy Ban Policies with Rotating Proxies

2020-09-17 Scrapy

The scrapy-rotating-proxies package makes it simple to use rotating proxies with Scrapy.

One issue that I’ve run into though is that pages which return a 404 error are retried (and the corresponding proxy is marked as dead). This does not make sense to me since if a server returns a 404 error this generally means that the requested page is just not available. It’s not a proxy problem; it’s a URL problem.

Uploading CSV to MySQL

2020-09-01 MySQL

I occasionally need to upload the contents of a CSV file to a MySQL database. It happens sufficiently infrequently that I need to remind myself how it works each time. Hopefully this will make it easier next time around.

The Data.

Suppose that you have a file, prices.csv, that looks like this:

"time","product","price"
2020-06-03 22:33:39,"Basic T-Shirt",299
2020-07-22 21:32:21,"Pique Polo",429
2020-04-07 05:38:17,"COUNTRY ROAD Slub Frill T-Shirt",299
2020-04-23 03:54:09,"Caribbean Tan Mousse Gradual A 150ml",95.95
2020-04-01 05:01:29,"Pulled Pork Shoulder 500g",79.99
2020-05-15 12:26:48,"Back To Work Blazer",2299
2020-07-13 06:28:27,"Funnel Neck Cardigan",1499
2020-06-03 17:07:50,"Extra Depth 180TC Cotton Blend Fitted Sheet",279
2020-07-28 02:00:29,"Clover Seal Full Cream Fresh Milk 1l",17.99

Create Table

First we need to create a table.

Configuring a Development Database

2020-08-30 Docker

What’s the quickest way to spin up a local development database? Using Docker, of course!

Resizing a Volume on an EC2 Linux Instance

2020-08-04 AWS

Resizing a Volume on an EC2 Linux Instance.

From time to time you might need to resize one of the volumes attached to an EC2 instance. Perhaps it’s too big and you’re wanting to downsize? Or maybe it’s too small and you’re wanting to upscale? You only have the option of increasing the size of a volume. If you want a smaller one, then you’ll need to create a new volume and migrate the data across. However, if you’re making it bigger, then everything you need to know is in this post.

Shiny App in Docker with HTTP Authentication

2020-06-29 Docker Shiny NGINX R

Suppose you have an app running on a Shiny server and you want to add HTTP authentication so that it’s only accessible via a username and password. This can be done using NGINX.

Test Shiny Server

The Shiny server should be accessible at http://localhost:3838/ (assuming you’re running Shiny server on localhost). 📢 Substitute another IP address or DNS entry if you’re running on another machine.

Retail Data: R Package

2020-03-15 {trundler} R

Have you ever noticed how things seem to get really expensive at specific times of the year? Like Mother’s Day and Valentine’s Day? Have you ever felt a bit ripped off when buying an over-priced bouquet of flowers or box of chocolates? Have you ever wondered just how much those prices have been inflated?

Of course you have!

But it’s always been a niggling suspicion, never a fact. Where’s the evidence?

Retail Data: Scraping & API

2020-03-15 web scraping Python

I’ve been wanting to gather data on retail prices for quite some time. Finally, just before Christmas 2019, I had some time on my hands, so I started to put something together.

R Package for @racently

2019-12-06 R running

An R wrapper for the @racently API. Read More →

Durban EDGE DataQuest

2019-11-13 R

A couple of quick R starter scripts for the Durban EDGE DataQuest. Read More →

An API for @racently

2019-11-12 R running

Retrieving running data using the @racently API. Read More →

Scraping Machinery Parts

2019-11-11 R web scraping

Scraping prices from a supplier of replacement parts for heavy machinery. Read More →

Installing Prophet on CentOS

2019-11-04 R Linux

How to install the Prophet package for R on RHEL or CentOS. Read More →

Private Security and the Pareto Principle

2019-10-16 Data Science R

Private Security is a big industry in South Africa. Most Private Security companies promise to provide a rapid response to every callout generated by any of their customers. There is a delicate balance between the number of response vehicles and the number of customers (and the frequency of their callouts!), which determines whether or not they are able to honour this promise.

On the one hand, more response vehicles result in lower response times. However, these vehicles are expensive to maintain and staff. Fewer vehicles are more cost effective, but make it difficult to maintain a high level of service.

Tweaking Linux for Pernickety Projectors

2019-10-12 Linux speaking

Linux has really come a long way. I used to arrive at the podium and hook up my (Linux) laptop with the resigned expectation that there would be some tweaking involved to get it to speak to the projector. However the support for video hardware has evolved massive and nowadays I don’t ever think about this: it just works.

Until it doesn’t.

This week I was speaking at a conference where the video setup was extremely pernickety. It required a resolution of 1280 by 720 at a frequency of 50 Hz. Try and setup that up using the desktop display configuration tools in Ubuntu… it just doesn’t seem to be possible.

MySQL Backups

2019-09-17 MySQL

Your data are valuable. If, God forbid, some disaster befalls your database then you should have a plan in place for how to recover your data. In this post I describe a simple strategy for backing up a MySQL database. This might not be the best approach, but it has worked for me.

R, Docker and Checkpoint: A Route to Reproducibility

2019-08-28 R Docker

I need to deploy Shiny on a Windows machine. I also need to use {checkpoint} for package management. Using Docker seems to be the only reasonable approach to Shiny on Windows. But how easy would it be to also factor {checkpoint} into this setup?

Only one reasonable way to find out: give it a try.

Below is the simple Dockerfile I used. Here are the fundamental components of what it does:

All Roads Lead to Rome

2019-07-28 R OSRM

I was inspired by this visualisation, showing the optimal routes (by car) from the geographic centre of the USA to all counties.

Using Shared Memory with OSRM

2019-07-26 OSRM Linux

If you have multiple applications accessing OSRM data then it does not make sense for each of those to have a separate copy of the data resident in memory. This is especially true if you’re using a relatively large map, in which case memory consumed by multiple processes might be enormous.

An alternative is to store the map data in shared memory, allowing multiple processes to access a single copy of the data. The official OSRM documentation for using shared memory can be found here. This post gives further details.

Recreating ‘Unknown Pleasures’ graphic

2019-07-15 R

For some time I’ve wanted to recreate the cover art from Joy Division’s Unknown Pleasures album. The visualisation depicts successive pulses from the pulsar PSR B1919+21, discovered by Jocelyn Bell in 1967.

Comrades Marathon (2019) Splits

2019-07-01 R running

I’m looking at ways to effectively visualise the splits data for the 2019 edition of the Comrades Marathon. My objectives are to provide:

an overall view of the splits across the entire field and
a detailed view for individual runners (relative to the rest of the field).

Medal Breakdown at Comrades Marathon (2019)

2019-06-30 R running

A quick breakdown of the medal distribution at the 2019 Comrades Marathon. Read More →

Comrades Marathon (2019) Start Delay

2019-06-15 R running

How long does it take to cross the start line at the Comrades Marathon? If you’re lucky enough to be starting in one of the batches which is close to the front then this might be a matter of seconds to a couple of minutes. But if you’re in a batch closer to the back then this could be anything up to ten or eleven minutes. This is an agonising wait when all you want to do is start running.

A Shiny Comrades Marathon Pacing App

2019-06-04 R Shiny running

The Comrades Marathon is an epic ultramarathon run each year between Durban and Pietermaritzburg (South Africa).

A few years ago I put together a simple spreadsheet for generating a Comrades Marathon pacing strategy. But the spreadsheet was clunky to use and laborious to maintain. Plus I was frustrated by the crude plots (largely due to my limited spreadsheet proficiency). It seemed like an excellent opportunity to create a Shiny app.

{emayili} Sending Email from R

2019-05-27 {emayili} R

At Fathom Data we do a lot of automated reporting with R. Being able to easily and reliably send emails is a high priority.

There is already a selection of packages for sending email from R:

We’ve had the most experience with the first two, both of which are really solid packages. However, {gmailr} uses the Google Mail API so it doesn’t work with all SMTP servers and {mailR} has a dependency on {rJava} which can be a bit of a hurdle for deploying in some environments.

Setting up an R Admin Group

2019-04-11 R

When I set up an R server for clients they often want to be able to install packages so that all users on the machine have access to them. This requires them to be able to install the packages onto the root filesystem rather that under their individual home directories.

It would be easy enough to give them su access, but this is a risky approach. There are so many other things on the system that they could break with this level of power.

Sliding Puzzle Solvable?

2019-04-10 Python

I’m helping develop a new game concept, which is based on the sliding puzzle game. The idea is to randomise the initial configuration of the puzzle. However, I quickly discovered that half of the resulting configurations were not solvable. Not good! Here are two approaches to getting a solvable puzzle:

build it (by randomly moving tiles from a known solvable configuration) or
generate random configurations and check whether solvable.

The first option is obviously more robust. It’s also a bit more work. The second option might require a few iterations, but it’s easy to implement.

Integrating Qlik Sense and R

2019-03-26 R Docker

Qlik Sense is a tool for exploratory data analysis and visualisation. It’s powerful and versatile. It’s can, however, be significantly enhanced by interfacing with R.

satRday (Paris) 2019

2019-02-25 conference

21 February 2019

Arrived in Paris rather late after catching the Eurostar from London. Trip nearly started on a bad note when I underestimated the time required to check-in, get through passport control and security. Sat down on the train literally as it departed.

22 February 2019

Early start, working on my tutorial for satRday. When the Sun came up I went out for a trot, primarily to get acquainted with the neighbourhood but also to locate the grave of Jim Morrison. Arrived at Père Lachaise Cemetery to find that it only opened at 08:00. Mildly disappointed. The breakfast that I had back at the hotel made up for that though.

1
2
3
5
6
7
12