Andrew B. Collier / @datawookie


Social links and a link to my CV.

Public datasets:


Machine Learning with R Cookbook

Cover of 'Machine Learning with R Cookbook'.

“Machine Learning with R Cookbook” by Chiu Yu-Wei is nothing more or less than it purports to be: a collection of 110 recipes for applying Data Analysis and Machine Learning techniques in R. I was asked by the publishers to review this book and found it to be an interesting and informative read. It will not help you understand how Machine Learning works (that’s not the goal!) but it will help you quickly learn how to apply Machine Learning techniques to you own problems.

Read More →

Disney: Quality over Quantity

I read an interesting article in the 10 June 2015 edition of The Wall Street Journal. The graphic below concisely illustrates the focused strategy being adopted by Disney: a steady decline in the number of movies released per year accompanied by a steady ascent in operating margin.

Read More →

Hosting Shiny on Amazon EC2

I recently finished some work on a Shiny application which incorporated a Random Forest model. The model was stored in a .rda file and loaded by server.R during initialisation. This worked fine when tested locally but when I tried to deploy the application on shinyapps.io I ran into a problem: evidently you can only upload server.R and ui.R files. Nothing else.

Read More →

Comrades Marathon Medal Predictions

With only a few days to go until race day, most Comrades Marathon athletes will focusing on resting, getting enough sleep, hydrating, eating and giving a wide berth to anybody who looks even remotely ill.

They will probably also be thinking a lot about Sunday’s race. What will the weather be like? Will it be cold at the start? (Unlikely since it’s been so warm in Durban.) How will they feel on the day? Will they manage to find their seconds along the route?

Read More →

R Recipe: Aligning Axes in ggplot2

Faceted plots in ggplot2 are phenomenal. They give you a simple way to break down an array of plots according to the values of one or more categorical variables. But what if you want to stack plots of different variables? Not quite so simple. But certainly possible. I gathered together this solution from a variety of sources on Stack Overflow, notably this one and this other one. A similar issue for vertical alignment is addressed here.

Read More →

Comrades Marathon Finish Predictions

Predicting Comrades Marathon finishing times is challenging and people have tried a few different approaches. Lindsey Parry, for example, suggests that you use two and a half times your recent marathon time. Sports Digest provides a calculator which predicts finishing time using recent times over three distances. I understand that this calculator is based on the work of Norrie Williamson.

Read More →

The Price of Fuel: How Bad Could It Get?

The cost of fuel in South Africa (and I imagine pretty much everywhere else) is a contentious topic. It varies from month to month and, although it is clearly related to the price of crude oil and the exchange rate, various other forces play an influential role.

Read More →

Dealing with a Byte Order Mark (BOM)

I have just been trying to import some data into R. The data were exported from a SQL Server client in tab-separated value (TSV) format. However, reading the data into R the “usual” way produced unexpected results:

Read More →

R for Business Analytics

The book R for Business Analytics by Ajay Ohri sets out to look at “some of the most common tasks performed by business analysts and helps the user navigate the wealth of information in R and its 4000 packages.” In my opinion it succeeds in covering an extensive range of topics but fails to provide anything of substantial use to its intended audience. At least, not anything that could not be uncovered by a brief internet search.

Read More →

Simulating Intricate Branching Patterns with DLA

Manfred Schroeder’s book Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise is a fruitful source of interesting topics and projects. He gives a thorough description of Diffusion-Limited Aggregation (DLA) as a technique for simulating physical processes which produce intricate branching structures. Examples, as illustrated below, include Lichtenberg Figures, dielectric breakdown, electrodeposition and Hele-Shaw flow.

Read More →

Creating More Effective Graphs

A few years ago I ordered a copy of the 2005 edition of Creating More Effective Graphs by Naomi Robbins. Somewhat shamefully I admit that the book got buried beneath a deluge of papers and other books and never received the attention it was due. Having recently discovered the R Graph Catalog, which implements many of the plots from the book using ggplot2, I had to dig it out and give it some serious attention.

Read More →

Standard Bank: Striving for Mediocrity

Recently I was in my local Standard Bank branch. After finally reaching the front of the queue and being helped by a reasonably courteous young man, I was asked if I would mind filling out a survey. Sure. No problem. I had been in the bank for 30 minutes, I could probably afford another 30 seconds.

Read More →

Plotting Flows with {riverplot}

I have been looking for an intuitive way to plot flows or connections between states in a process. An obvious choice is a Sankey Plot, but I could not find a satisfactory implementation in R… until I read the post by January Weiner. His {riverplot} package does precisely what I am need.

Read More →

Comrades Marathon: A Race for Geriatrics?

It has been suggested that the average Comrades Marathon runner is gradually getting older. As an “average runner” myself, I will not deny that I am personally getting older. But, what I really mean is that the average age of all runners taking part in this great event is gradually increasing. This is not just an idle hypothesis: it is supported by the data. If you’re interested in the technical details of the analysis, these are included at the end, otherwise read on for the results.

Read More →

Where to Put EAs and Indicators in New MT4 Builds

If you are creating an EA or indicator from scratch, then the MetaTrader editor places the files in the correct location and the terminal is automatically able to find them. However, if the files originate from a third party then you will need to know where to insert them so that they show up in the terminal. For older builds of MetaTrader 4 the directory structure was fairly simple.

Read More →

Twins, Tripods and Phantoms at the Comrades Marathon

Having picked up a viral infection days before this year’s Comrades Marathon, on 1 June I was left with time on my hands and somewhat desperate for any distraction. I spent some time looking at my archive of Comrades data and considering some new questions. For example, what are the chances of two runners passing through halfway and the finish line at exactly the same time? How likely is it that three runners achieve the same feat?

Read More →

Concatenating a list of data frames

It’s something that I do surprisingly often: concatenating a list of data frames into a single (possibly quite enormous) data frame. Until now my naive solution worked pretty well. However, today I needed to deal with a list of over 6 million elements. The result was hours of page thrashing before my R session finally surrendered. I suppose I should be happy that my hard disk survived.

Read More →

Comrades Marathon Pacing Chart: Down Run

Although I have been thinking vaguely about my Plan A goal of a Bill Rowan medal at the Comrades Marathon this year, I have not really put a rigorous pacing plan in place. I know from previous experience that I am likely to be quite a bit slower towards the end of the race. I also know that I am going to lose a few minutes at the start. How fast does this mean I need to run in order to get from Pietermaritzburg to Durban in under 9 hours?

Read More →

Race Statistics for Comrades Novices: Corrigendum

There was some significant bias in the histogram from my previous post: the data from all years were lumped together. This is important because as of 2003 (when the Vic Clapham medal was introduced) the final cutoff for the Comrades Marathon was extended from 11:00 to 12:00. In 2000 they also applied an extended cutoff.

Read More →

Comrades Marathon: Negative Splits and Cheating

With this year’s Comrades Marathon just less than a month away, I was reminded of a story from earlier in the year. Mark Dowdeswell, a statistician at Wits University, found evidence of cheating by some middle and back of the pack Comrades runners. He identified a group of 20 athletes who had suspicious negative splits: they ran much faster in the second half of the race. There was one runner in particular whose splits were just too good to be true. When the story was publicised, this particular runner claimed that it was a conspiracy.

Read More →

Earthquakes: Magnitude / Depth Chart

I am working on a project related to secondary effects of earthquakes. To guide me in the analysis I need a chart showing the location, magnitude and depth of recent earthquakes. A host of such charts are already available, but since I had the required data on hand, it seemed like a good idea to take a stab at it myself.

Read More →