Andrew B. Collier / @datawookie


Link to CV.


Comrades Marathon (2019) Start Delay

How long does it take to cross the start line at the Comrades Marathon? If you’re lucky enough to be starting in one of the batches which is close to the front then this might be a matter of seconds to a couple of minutes. But if you’re in a batch closer to the back then this could be anything up to ten or eleven minutes. This is an agonising wait when all you want to do is start running.

Read More →

A Shiny Comrades Marathon Pacing App

The Comrades Marathon is an epic ultramarathon run each year between Durban and Pietermaritzburg (South Africa).

A few years ago I put together a simple spreadsheet for generating a Comrades Marathon pacing strategy. But the spreadsheet was clunky to use and laborious to maintain. Plus I was frustrated by the crude plots (largely due to my limited spreadsheet proficiency). It seemed like an excellent opportunity to create a Shiny app.

Read More →

{emayili} Sending Email from R

Banner for the {emayili} package.

At Fathom Data we do a lot of automated reporting with R. Being able to easily and reliably send emails is a high priority.

There is already a selection of packages for sending email from R:

We’ve had the most experience with the first two, both of which are really solid packages. However, {gmailr} uses the Google Mail API so it doesn’t work with all SMTP servers and {mailR} has a dependency on {rJava} which can be a bit of a hurdle for deploying in some environments.

Read More →

Setting up an R Admin Group

When I set up an R server for clients they often want to be able to install packages so that all users on the machine have access to them. This requires them to be able to install the packages onto the root filesystem rather that under their individual home directories.

It would be easy enough to give them su access, but this is a risky approach. There are so many other things on the system that they could break with this level of power.

Read More →

Sliding Puzzle Solvable?

I’m helping develop a new game concept, which is based on the sliding puzzle game. The idea is to randomise the initial configuration of the puzzle. However, I quickly discovered that half of the resulting configurations were not solvable. Not good! Here are two approaches to getting a solvable puzzle:

  • build it (by randomly moving tiles from a known solvable configuration) or
  • generate random configurations and check whether solvable.

The first option is obviously more robust. It’s also a bit more work. The second option might require a few iterations, but it’s easy to implement.

Read More →

satRday (Paris) 2019

21 February 2019

Arrived in Paris rather late after catching the Eurostar from London. Trip nearly started on a bad note when I underestimated the time required to check-in, get through passport control and security. Sat down on the train literally as it departed.

22 February 2019

Early start, working on my tutorial for satRday. When the Sun came up I went out for a trot, primarily to get acquainted with the neighbourhood but also to locate the grave of Jim Morrison. Arrived at Père Lachaise Cemetery to find that it only opened at 08:00. Mildly disappointed. The breakfast that I had back at the hotel made up for that though.

Read More →

Docker Images for R: r-base versus r-apt

I need to deploy a Plumber API in a Docker container. The API has some R package dependencies which need to be baked into the Docker image. A few options for the base image:

The first option, r-base, would require building the dependencies from source, a somewhat time consuming operation. The last option, r-apt, makes it possible to install most packages using apt, which is likely to be much quicker. I’ll immediately eliminate the other option, tidyverse, because although it already contains a load of packages, many of those are not required and, in addition, it incorporates RStudio Server, which is definitely not necessary for this project.

Read More →

Where does .Renviron live on Citrix?

Logos for Citrix and R.

At one of my clients I run RStudio under Citrix in order to have access to their data.

For the most part this works fine. However, every time I visit them I spend the first few minutes of my day installing packages because my environment does not seem to be persisted from one session to the next.

I finally had a gap and decided to fix the problem.

Where are the packages being installed?

Installed packages just spontaneously disappear… That’s weird. Where are they being installed?

Read More →

Survey Raking: An Illustration

Banner for Survey Raking post.

Analysing survey data can be tricky. There’s often a mismatch between the characteristics of the survey respondents and those of the general population. If the discrepancies are not accounted for then the survey results can (and generally will!) be misleading.

Read More →

Scraping the Turkey Accordion

Banner for Scraping the Turkey Accordion post.

One of the things I like most about web scraping is that almost every site comes with a new set of challenges.

The Accordion Concept

I recently had to scrape a few product pages from the site of a large retailer. I discovered that these pages use an “accordion” to present the product attributes. Only a single panel of the accordion is visible at any one time. For example, you toggle the Details panel open to see the associated content.

Read More →

Installing RStudio & Shiny Servers

I did a remote install of Ubuntu Server today. This was somewhat novel because it’s the first time that I have not had physical access to the machine I was installing on. The server install went very smoothly indeed.

The next tasks were to install RStudio Server and Shiny Server. The installation process for each of these is well documented on the RStudio web site:

These are my notes. Essentially the same, with some small variations.

Read More →

Embedding Dependencies into a HTML File

I use HTML to generate slide decks. Usually my HTML will reference a host of other files on my machine (CSS, JavaScript and images). If I want to distribute my deck then I have a couple of options:

  • just send the HTML file without all of the dependencies or
  • send the HTML file and dependencies (normally wrapped up in some sort of archive).

Both of these have problems. In the former case the HTML just ends up looking like ass because it relies on all of those dependencies to sort out the aesthetics. In the latter case I need to take care of the directory structure and, if those dependencies are distributed across my file system (which they generally are!) then this can be a challenge.

Read More →

DNS on Ubuntu

For years it’s been simple to set up DNS on a Linux machine. Just add a couple of entries to /etc/resolv.conf and you’re done.

Read More →

@pyconza (2018): Data Science and Bayes with Python

I’ve just returned from PyConZA (2018), held at the Birchwood Hotel in Boksburg North (Johannesburg) on 11-12 October. A great conference with a super selection of talks and great catering.

Obviously when the PyCon call for papers came out I was feeling ambitious because I submitted a Workshop and a Talk. They were both accepted, so that put the pressure on a bit.

Workshop

I gave a full day pre-conference workshop on 10 October entitled “Introduction to Python for Data Science”. In retrospect it would have been a better idea to call it “Introduction to Data Science using Python”.

Read More →