Andrew B. Collier / @datawookie

Social links and a link to my CV.

Public datasets:

British Canoeing Results

Updated Comrades Winners Chart

2013-12-14 running

On Friday I received my copy of The Official Results Brochure for the 2013 Comrades Marathon. Always makes for a diverting half an hour’s reading. And the tables at the front provide some very interesting statistics. Seemed like a good opportunity to update my Chart of Comrades Winners.

Contour and Density Layers with ggmap

2013-12-14 R

I am busy working on a project which uses data from the World Wide Lightning Location Network (WWLLN). Specifically, I am trying to reproduce some of the results from Orville, Richard E, Gary R. Huffines, John Nielsen-Gammon, Renyi Zhang, Brandon Ely, Scott Steiger, Stephen Phillips, Steve Allen, and William Read. 2001. “Enhancement of Cloud-to-Ground Lightning over Houston, Texas”. Geophysical Research Letters 28 (13): 2597–2600.

Amy Cuddy: Your body language shapes who you are

2013-11-27 TED Talk

Amy Cuddy gives a great talk. Provided me with lots to think about and I will happily confess that I have struck a few power poses (but only after ensuring that I am quite alone)!

Deriving a Priority Queue from a Plain Vanilla Queue

2013-11-26 R

Following up on my recent post about implementing a queue as a reference class, I am going to derive a Priority Queue class.

Implementing a Queue as a Reference Class

2013-11-24 R

I am working on a simulation for an Automatic Repeat-reQuest (ARQ) algorithm. After trying various options, I concluded that I would need an implementation of a queue to make this problem tractable. R does not have a native queue data structure, so this seemed like a good opportunity to implement one and learn something about Reference Classes in the process.

The Implementation

We use setRefClass() to create a generator function which will create objects of the Queue class.

Iterators in R

2013-11-14 R

According to Wikipedia, an iterator is “an object that enables a programmer to traverse a container”. A collection of items (stashed in a container) can be thought of as being “iterable” if there is a logical progression from one element to the next (so a list is iterable, while a set is not). An iterator is then an object for moving through the container, one item at a time.

Iterators are a fundamental part of contemporary Python programming, where they form the basis for loops, list comprehensions and generator expressions. Iterators facilitate navigation of container classes in the C++ Standard Template Library (STL). They are also to be found in C#, Java, Scala, Matlab, PHP and Ruby.

Introduction to Fractals

2013-11-04 R

A short while ago I was contracted to write a short piece entitled “Introduction to Fractals”. Admittedly it is hard to do justice to the topic in less than 1000 words.

Percolation Threshold: Including Next-Nearest Neighbours

2013-11-01 R

Percolation through a larger lattice at the percolation threshold.

In my previous post about estimating the Percolation Threshold on a square lattice, I only considered flow from a given cell to its four nearest neighbours. It is a relatively simple matter to extend the recursive flow algorithm to include other configurations as well.

Malarz and Galam (2005) considered the problem of percolation on a square lattice for various ranges of neighbor links. Below is their illustration of (a) nearest neighbour “NN” and (b) next-nearest neighbour “NNN” links.

Percolation Threshold on a Square Lattice

2013-10-30 R

Manfred Schroeder touches on the topic of percolation a number of times in his encyclopaedic book on fractals (Schroeder, M. (1991) Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. Percolation has numerous practical applications, the most interesting of which (from my perspective) is the flow of hot water through ground coffee!

Plotting Times of Discrete Events

2013-10-19 R

I recently enjoyed reading O’Hara, R. B., & Kotze, D. J. (2010). Do not log-transform count data. Methods in Ecology and Evolution, 1(2), 118–122. doi:10.1111/j.2041-210X.2010.00021.x.

Mounting a sshfs volume via the crontab

2013-10-06 Linux

I need to mount a directory from my laptop on my desktop machine using sshfs. At first I was not making the mount terribly regularly, so I did it manually each time that I needed it. However, the frequency increased over time and I was eventually mounting it every day (or multiple times during the course of a day!). This was a perfect opportunity to employ some automation.

Top 250 Movies at IMDb

2013-10-03 R web scraping

Some years ago I allowed myself to accept a challenge to read the Top 100 Novels of All Time (complete list here). This list was put together by Richard Lacayo and Lev Grossman at Time Magazine.

To start with I could tick off a number of books that I had already read. That left me with around 75 books outstanding. I knuckled down. The Lord of the Rings had been on my reading list for a number of years, so this was my first project. A little unfair for this trilogy to count as just one book… but I consumed it with gusto! One down. Other books followed. They were also great reads. And then I hit a couple of books that were just, well, to put it plainly, heavy going. I am sure that they were great books and my lack of enjoyment was entirely a reflection on me and not the quality of the books. No doubt I learned a lot from reading them. But it was hard work! At this stage it occurred to me that the book list was constructed from a rather specific perspective of what constituted a great book. A perspective which is quite different from my own. I had to admit defeat: my literary tastes will have to mature a bit before I attack this list again!

Flushing Live MetaTrader Logs to Disk

2013-09-18

The logs generated by expert advisors and indicators when running live on MetaTrader are displayed in the Experts tab at the bottom of the terminal window. Sometimes it is more convenient to analyse these logs offline (especially since the order of the records in the terminal runs in a rather counter-intuitive bottom-to-top order!). However, because writing to the log files is buffered, there can be a delay before what you see in the terminal is actually written to disk.

Clustering the Words of William Shakespeare

2013-09-10 R

In my previous post I used the tm package to do some simple text mining on the Complete Works of William Shakespeare. Today I am taking some of those results and using them to generate word clusters.

MetaTrader Time Zones

2013-09-09

Time zones on MetaTrader can be slightly confusing. Two specific time zones are important:

the time zone of the broker’s server and
your local time zone.

And these need not be the same.

Text Mining the Complete Works of William Shakespeare

2013-09-05 R

I am starting a new project that will require some serious text mining. In the interests of bringing myself up to speed on the {tm} package, I thought I would apply it to the Complete Works of William Shakespeare and just see what falls out.

The first order of business was getting my hands on all that text. Fortunately it is available from a number of sources. I chose to use Project Gutenberg.

Presenting Conformance Statistics

2013-08-27 R

A client came to me with some conformance data. She was having a hard time making sense of it in a spreadsheet. I had a look at a couple of ways of presenting it that would bring out the important points.

The Data

The data came as a spreadsheet with multiple sheets. Each of the sheets had a slightly different format, so the easiest thing to do was to save each one as a CSV file and then import them individually into R.

The Wonders of {foreach}

2013-08-25 R {foreach}

Writing code from scratch to do parallel computations can be rather tricky. However, the packages providing parallel facilities in R make it remarkably easy. One such package is foreach. I am going to document my trail of discovery with foreach, which began some time ago, but has really come into fruition over the last few weeks.

First we need a reproducible example. Preferably something which is numerically intensive.

max.eig <- function(N, sigma) {
  d <- matrix(rnorm(N**2, sd = sigma), nrow = N)
  #
  E <- eigen(d)$values
  #
  abs(E)[[1]]
}

This function generates a square matrix of uniformly distributed random numbers, finds the corresponding (complex) eigenvalues and then selects the eigenvalue with the largest modulus. The dimensions of the matrix and the standard deviation of the random numbers are given as input parameters.

Fitting a Model by Maximum Likelihood

2013-08-18 R

Maximum-Likelihood Estimation (MLE) is a statistical technique for estimating model parameters. It basically sets out to answer the question: what model parameters are most likely to characterise a given set of data? First you need to select a model for the data. And the model must have one or more (unknown) parameters. As the name implies, MLE proceeds to maximise a likelihood function, which in turn maximises the agreement between the model and the data.

Correlations with Uncertainty: Classical Solution

2013-08-13 R

Following up on my previous post as a result of an excellent suggestion from Andrej Spiess. The data are indeed very heteroscedastic! Andrej suggested that an alternative way to attack this problem would be to use weighted correlation with weights being the inverse of the measurement variance.

Correlations with Uncertainty: Bootstrap Solution

2013-08-11 R

A week or so ago a colleague of mine asked if I knew how to calculate correlations for data with uncertainties. Now, if we are going to be honest, then all data should have some level of experimental or measurement error. However, I suspect that in the majority of cases these uncertainties are ignored when considering correlations. To what degree are uncertainties important? A moment’s thought would suggest that if the uncertainties are large enough then they should have a rather significant effect on correlation, or more properly, the uncertainty measure associated with the correlation. What is the best (or at least correct) way to proceed? Somewhat surprisingly a quick Google search did not turn up anything too helpful.

Finding Your MetaTrader Log Files

2013-08-08

Debugging an indicator or expert advisor (EA) can be a tricky business. Especially when you are doing the debugging remotely. I write my MQL code to include copious amounts of debugging information to log files. The contents of these log files can be used to diagnose any problems. This articles tells you where you can find those files.

A Chart of Recent Comrades Marathon Winners

2013-07-30 R running

Continuing on my quest to document the Comrades Marathon results, today I have put together a chart showing the winners of both the men and ladies races since 1980. Click on the image below to see a larger version.

Comrades Marathon Inference Trees

2013-07-19 R running

Following up on my previous posts regarding the results of the Comrades Marathon, I was planning on putting together a set of models which would predict likelihood to finish and probable finishing time. Along the way I got distracted by something else that is just as interesting and which produces results which readily yield to qualitative interpretation: Conditional Inference Trees as implemented in the R package party.

Just to recall what the data look like:

Optimising a Noisy Objective Function

2013-07-16 R

I am busy with a project where I need to calibrate the Heston Model to some Asian options data. The model has been implemented as a function which executes a Monte Carlo (MC) simulation. As a result, the objective function is rather noisy. A number of algorithms exist for this sort of problem, and here I simply give a brief overview of some of them.

Compiling Indicators and Expert Advisors

2013-06-25 MetaTrader

When you receive the code for an expert advisor or indicator which we have developed for you, it will come in a package consisting of include files (with a .mqh extension) and source code files (with a .mq4 extension). What do you do with them?

Are Green Number Runners More Likely to Bail?

2013-06-22 R running

Comrades Marathon runners are awarded a permanent green race number once they have completed 10 journeys between Durban and Pietermaritzburg. For many runners, once they have completed the race a few times, achieving a green number becomes a possibility. And once the idea takes hold, it can become something of a compulsion. I can testify to this: I am thoroughly compelled! For runners with this goal in mind, every finish is one step closer to a green number. They are slowly chipping away, year after year and the idea of bailing is anathema. However, once the green number is in the bag, does the imperative to complete the race fade?

The Green Number Effect

2013-06-18 R running

Following up on a suggestion from my previous post, here are the statistics for medal count versus age. Every point on the plot is the number (see colour legend on right) of athletes who have achieved a given number of medals by a particular age.

Age Distribution of Comrades Marathon Athletes

2013-06-18 R running

I can clearly remember watching the end of the 1989 Comrades Marathon on television and seeing Wally Hayward coming in just before the final gun, completing the epic race at the age of 80! I was in awe.

Since I have been delving into the Comrades Marathon data (looking at attrition rates and medal allocations), this got me thinking about the typical age distribution of athletes taking part. The plot below indicates the ages of athletes who finished the race, going all the way back to 1984. You can clearly spot the two years when Wally Hayward ran (1988 and 1989). My data indicates that he was only 79 on the day of the 1989 Comrades Marathon, but I am not going to quibble over a year and I am more than happy to accept that he was 80!

Medal Allocations at the Comrades Marathon

2013-06-09 R running

Read More →

Comrades Marathon Attrition Rate

2013-06-07 R

It is a bit of a mission to get the complete data set for this year’s Comrades Marathon. The full results are easily accessible, but come as an HTML file. Embedded in this file are links to the splits for individual athletes. With a bit of scripting wizardry it is also possible to download the HTML files for each of the individual athletes. Parsing all of these yields the complete result set, which is the starting point for this analysis.

Analysis of Cable Morning Trade Strategy

2013-05-29 R

A couple of years ago I implemented an automated trading algorithm for a strategy called the “Cable Morning Trade”. The basis of the strategy is the range of GBPUSD during the interval 05:00 to 09:00 London time. Two buy stop orders are placed 5 points above the highest high for this period; two sell stop orders are placed 5 points below the lowest low. All orders have a protective stop at 40 points. When either the buy or sell orders are filled, the other orders are cancelled. Of the filled orders, one exits at a profit equal to the stop loss, while the other is left to run until the close of the London session.

Balanced Data with {MatchIt}

2013-05-23 R

A balanced experimental design is one in which the distribution of the covariates is the same in both the control and treatment groups. However, although often achievable in an experiment, for observational data this ideal is seldom achieved.

xkcd Style Bubble Plot

2013-05-23 R

A package was recently released to generate plots in the style of xkcd using R. Being a big fan of the cartoon, I could not resist trying it out. I set out to produce something like one of Hans Rosling’s bubble plots.

Swing Alert Indicator

2013-05-23

I’ve just finished coding a swing alert indicator for a client. The rules are rather straightforward and it all depends on two simple moving averages (by default with periods of 25 and 5).

Package {party}: Conditional Inference Trees

2013-05-21 R {party}

I am going to be using the party package for one of my projects, so I spent some time today familiarising myself with it. The details of the package are described in Hothorn, T., Hornik, K., & Zeileis, A. (1999). “party: A Laboratory for Recursive Partytioning” which is available from CRAN.

Plotting categorical variables

2013-05-20 R

In the previous installment we generated a few plots using numerical data straight out of the National Health and Nutrition Examination Survey. This time we are going to incorporate some of the categorical variables into the plots. Although going from raw numerical data to categorical data bins (like we did for age and BMI) does give you less precision, it can make drawing conclusions from plots a lot easier.

We will start off with a simple plot of two numerical variables: age against BMI.

Plotting numerical variables

2013-05-18 R

In the previous installment we generated some simple descriptive statistics for the National Health and Nutrition Examination Survey data. Now we are going to move on to an area in which R really excels: making plots and visualisations.

Descriptive Statistics

2013-05-18 R

In the previous installment we derived two categorical variables. This time we will extract descriptive statistics.

Categorical Variables

2013-05-12 R

In the previous installment we sucked some data from the National Health and Nutrition Examination Survey into R and did some preliminary work. Now we are going to play with some categorical data.

Loading Data from a Tab Delimited File

2013-05-12 R

I have just started preparing a series of talks aimed at introducing the use of R to a rather broad audience consisting of physicists, chemists, statisticians, biologists and computer scientists (plus a few other disciplines thrown in for good measure). I want to use a single consistent set of data throughout the talks. Finding something that would resonate with such a disparate set of people was quite a challenge. After playing around with a couple of options, I settled on using data for age, height and mass. These are things that we can all identify with. The next challenge was to actually find a suitable data set, which was surprisingly difficult. Eventually I stumbled upon the data from the National Health and Nutrition Examination Survey (NHANES), The data from the survey are available here. These data have been divided into a number of sets, each of which has been excellently curated and has a detailed codebook.

Support & Resistance Indicator

2013-05-06

I was recently browsing through the variety of MetaTrader indicators for support and resistance levels. None of them ticked all of my boxes. Either they were not aesthetically pleasing (making a mess of my pristine charts) or they failed to produce what I consider to be reasonable levels. Embracing my pioneering spirit, I set out to fashion my own indicator, one which will ultimately tick all of my boxes.

Sample output from v1.2 of my support-resistance indicator is shown below for the weekly chart of GBPUSD.

Locations of Geosynchronous Satellites

2013-04-16 R

A year or so ago I went to a talk which included a diagram showing the locations of Earth’s fleet of geosynchronous satellites. According to the speaker, the information in this diagram was already quite dated: the satellites and their locations had changed.

1
2
3
11
12