Andrew B. Collier / @datawookie


Social links and a link to my CV.

Public datasets:


{hagr} Linnaean Classification

I’ve taken another look at the {hagr} data, which I wrote about previously. This time I’m focusing on the hierarchy of creatures.

Taxonomic Rank

The Linnaean Taxonomy is a hierarchical classification system for organisms devised by Carl Linnaeus. An organism is assigned to the following levels in the hierarchy (in increasing order or granularity):

  • domain
  • kingdom
  • phylum
  • class
  • order
  • family
  • genus and
  • species.

The relative level of a group of organisms in this hierarchy determines its taxonomic rank.

Read More →

{hagr} Database of Animal Ageing and Longevity

I came across the Human Ageing Genomic Resources. They are doing some fascinating work and expose some engrossing data. I wanted to make the data easier for me to work with, and an R package seemed to be the natural vehicle to do this.

For more information on these data, take a look at this article: Tacutu, Craig, Budovsky, Wuttke, Lehmann, Taranukha, Costa, Fraifeld and de Magalhaes, “Human Ageing Genomic Resources: Integrated databases and tools for the biology and genetics of ageing,” Nucleic Acids Research 41(D1):D1027-D1033, 2013.

Read More →

Making the Most of Mobility

The Google Mobility Data (or Community Mobility Reports) refers to the datasets provided by Google which track how people move and congregate in various locations during specific time periods. The data is based on anonymised location information from users who have opted into Location History on their Google accounts.

Read More →

Flexible Environment Variables for a Docker Image

I’ve been following an excellent tutorialfor deploying a Docker image on an EC2 instance via GitLab CI/CD. It covers every step in the process in great detail. If you follow the steps then you’ll definitely end up with a working pipeline.

However, I still wasn’t quite sure how to handle the environment variables and credentials that I wanted to bake into the image, and which varied between my local development environment and the final deployed image.

Read More →

Install GitLab Runner with Docker

📢 An updated version of this post reflecting recent changes in GitLab Runner can be found here.

I’ve got a project which takes a long time to build. And I rebuild it regularly. I’ve been using the shared runners on GitLab. However, the total time constraint has become a limitation. I’m going to install GitLab Runner as a Docker service on an underutilised EC2 instance.

Read More →

{emayili} UTF-8 Filenames & Setting Sender

Banner for the {emayili} package.

Two new features in the {emayili} (0.4.6) package for easily sending emails from R.

Package Setup

If you have not already installed the package, then grab it from CRAN or GitHub.

# From CRAN.
install.packages("emayili")
# From GitHub.
remotes::install_github("datawookie/emayili")

Load the package.

library(emayili)

Check that you have the current version.

packageVersion("emayili")
[1] ‘0.4.6’

Let’s quickly set up an SMTP server. We’ll use SMTP Bucket, which is incredibly convenient for testing.

SMTP_SERVER   = "mail.smtpbucket.com"
SMTP_PORT     = 8025

smtp <- server(host = SMTP_SERVER, port = SMTP_PORT)

UTF-8 Characters in Attachment Filenames

It’s now possible to attach files with names that include non-ASCII characters. Suppose I wanted to send this image (source) of Wenceslao Moreno.

Read More →

Resurrecting MySQL into PostgreSQL with PGLoader

I’ve been hosting a MySQL database on a DigitalOcean server for a few of years. The project has been on hold for a while. Entropy kicked in and the server became unreachable. Fortunately I was still able to access the server via a recovery console to export the database using mysqldump and download the resulting SQL dump file.

Now I want to resurrect the database locally but I also want to migrate it to PostgreSQL.

Read More →

Levies, Tax and the Fuel Price in South Africa

According to the Automobile Association (AA) the fuel price is the sum of four main components:

  • the basic fuel price
  • the general fuel levy
  • the Road Accident Fund (RAF) levy and
  • wholesale and retail margins, distribution and transport costs.

This article suggests that almost 70% of the fuel price in South Africa is due to taxes and levies.

I used data from {saffer} to examine this assertion.

Read More →

Persistent Selenium Sessions

I have a project where I need to have a persistent Selenium session. There’s a script which will leave a browser window open when it exits. When the script runs again it should connect to the same window.

Read More →

SQLAlchemy: Efficient Counting

I have a SQLAlchemy count() query which is being called fairly frequently in my API. The query itself is not terribly inefficient, but it’s being called with sufficient frequency that it has a performance impact.

Read More →

GitLab CI: Services

I needed to have a Redis server available as part of the GitLab CI pipeline for this blog (simply because I wanted to use the {rredis} package). After fiddling around for some time trying to install the redis-server package using apt I discovered that GitLab CI actually provides Redis as a service, which makes the process remarkably easy.

Some details of the “standard” services (Redis, PostgreSQL and MySQL) supported by GitLab CI can be found here:

Read More →

Scrapy Ban Policies with Rotating Proxies

The scrapy-rotating-proxies package makes it simple to use rotating proxies with Scrapy.

One issue that I’ve run into though is that pages which return a 404 error are retried (and the corresponding proxy is marked as dead). This does not make sense to me since if a server returns a 404 error this generally means that the requested page is just not available. It’s not a proxy problem; it’s a URL problem.

Read More →

Uploading CSV to MySQL

I occasionally need to upload the contents of a CSV file to a MySQL database. It happens sufficiently infrequently that I need to remind myself how it works each time. Hopefully this will make it easier next time around.

The Data.

Suppose that you have a file, prices.csv, that looks like this:

"time","product","price"
2020-06-03 22:33:39,"Basic T-Shirt",299
2020-07-22 21:32:21,"Pique Polo",429
2020-04-07 05:38:17,"COUNTRY ROAD Slub Frill T-Shirt",299
2020-04-23 03:54:09,"Caribbean Tan Mousse Gradual A 150ml",95.95
2020-04-01 05:01:29,"Pulled Pork Shoulder 500g",79.99
2020-05-15 12:26:48,"Back To Work Blazer",2299
2020-07-13 06:28:27,"Funnel Neck Cardigan",1499
2020-06-03 17:07:50,"Extra Depth 180TC Cotton Blend Fitted Sheet",279
2020-07-28 02:00:29,"Clover Seal Full Cream Fresh Milk 1l",17.99

Create Table

First we need to create a table.

Read More →

Resizing a Volume on an EC2 Linux Instance

Resizing a Volume on an EC2 Linux Instance.

From time to time you might need to resize one of the volumes attached to an EC2 instance. Perhaps it’s too big and you’re wanting to downsize? Or maybe it’s too small and you’re wanting to upscale? You only have the option of increasing the size of a volume. If you want a smaller one, then you’ll need to create a new volume and migrate the data across. However, if you’re making it bigger, then everything you need to know is in this post.

Read More →

Shiny App in Docker with HTTP Authentication

A banner image featuring the NGINX logo and the name of the NGINX package.

Suppose you have an app running on a Shiny server and you want to add HTTP authentication so that it’s only accessible via a username and password. This can be done using NGINX.

Test Shiny Server

The Shiny server should be accessible at http://localhost:3838/ (assuming you’re running Shiny server on localhost). 📢 Substitute another IP address or DNS entry if you’re running on another machine.

Read More →

Retail Data: R Package

Have you ever noticed how things seem to get really expensive at specific times of the year? Like Mother’s Day and Valentine’s Day? Have you ever felt a bit ripped off when buying an over-priced bouquet of flowers or box of chocolates? Have you ever wondered just how much those prices have been inflated?

Of course you have!

But it’s always been a niggling suspicion, never a fact. Where’s the evidence?

Read More →

Private Security and the Pareto Principle

Private Security is a big industry in South Africa. Most Private Security companies promise to provide a rapid response to every callout generated by any of their customers. There is a delicate balance between the number of response vehicles and the number of customers (and the frequency of their callouts!), which determines whether or not they are able to honour this promise.

On the one hand, more response vehicles result in lower response times. However, these vehicles are expensive to maintain and staff. Fewer vehicles are more cost effective, but make it difficult to maintain a high level of service.

Read More →

Tweaking Linux for Pernickety Projectors

Linux has really come a long way. I used to arrive at the podium and hook up my (Linux) laptop with the resigned expectation that there would be some tweaking involved to get it to speak to the projector. However the support for video hardware has evolved massive and nowadays I don’t ever think about this: it just works.

Until it doesn’t.

This week I was speaking at a conference where the video setup was extremely pernickety. It required a resolution of 1280 by 720 at a frequency of 50 Hz. Try and setup that up using the desktop display configuration tools in Ubuntu… it just doesn’t seem to be possible.

Read More →

MySQL Backups

Your data are valuable. If, God forbid, some disaster befalls your database then you should have a plan in place for how to recover your data. In this post I describe a simple strategy for backing up a MySQL database. This might not be the best approach, but it has worked for me.

Read More →

R, Docker and Checkpoint: A Route to Reproducibility

I need to deploy Shiny on a Windows machine. I also need to use {checkpoint} for package management. Using Docker seems to be the only reasonable approach to Shiny on Windows. But how easy would it be to also factor {checkpoint} into this setup?

Only one reasonable way to find out: give it a try.

Below is the simple Dockerfile I used. Here are the fundamental components of what it does:

Read More →

Using Shared Memory with OSRM

If you have multiple applications accessing OSRM data then it does not make sense for each of those to have a separate copy of the data resident in memory. This is especially true if you’re using a relatively large map, in which case memory consumed by multiple processes might be enormous.

An alternative is to store the map data in shared memory, allowing multiple processes to access a single copy of the data. The official OSRM documentation for using shared memory can be found here. This post gives further details.

Read More →