Andrew B. Collier / @datawookie

Social links and a link to my CV.

Public datasets:

British Canoeing Results

Web Scraping and “invalid multibyte string”

2016-08-02 R web scraping

A couple of my collaborators have had trouble using read_html() from the xml2 package to access this Wikipedia page.

Sportsbook Betting (Part 1): Odds

2016-08-01 R gambling

This series of articles was written as support material for Statistics exercises in a course that I’m teaching for iXperience. In the series I’ll be using illustrative examples for wagering on a variety of Sportsbook events including Horse Racing, Rugby and Tennis. The same principles can be applied across essentially all betting markets.

Arthur Benjamin: Teach statistics before calculus!

2016-07-29 TED Talk teaching

Arthur Benjamin thinks that the end goal of teaching Mathematics at school should be Statistics rather than Calculus. He has a point: in terms of understanding things in the real world, Statistics is definitely more powerful. These ideas are quite compatible with those of Conrad Wolfram, who thinks that we should be using computers more extensively in Mathematics education.

Building a Life Table

2016-07-28 R

Read More →

Calculating Pi using Buffon’s Needle

2016-07-26 R

Read More →

Conrad Wolfram: Teaching kids real math with computers

2016-07-25 TED Talk teaching

Conrad Wolfram gives a thought provoking talk on a different way to teach Mathematics in schools.

Mortality by Year and Age

2016-07-22 R

Taking another look at the data from the lifespan package. Plot below shows the evolution of mortality in the US as a function of year and age.

Life Expectancy by Country

2016-07-20 R

I was rather inspired by this plot on Wikipedia’s List of Countries by Life Expectancy.

Mortality Rate by Age

2016-07-19 R

Working further with the mortality data from https://www.cdc.gov/, I’ve added a breakdown of deaths by age and gender to the lifespan package on GitHub.

Escalating Life Expectancy

2016-07-18 R

I’ve added mortality data to the lifespan package. A result that immediately emerges from these data is that average life expectancy is steadily climbing.

Birth Month by Gender

2016-07-16 R

Based on some feedback to a previous post I normalised the birth counts by the (average) number of days in each month. As pointed out by a reader, the results indicate a gradual increase in the number of conceptions during (northern hemisphere) Autumn and Winter, roughly up to the end of December. Normalising the data to give births per day also shifts the peak from August to September.

Most Probable Birth Month

2016-07-15 R

In a previous post I showed that the data from www.baseball-reference.com support Malcolm Gladwell’s contention that more professional baseball players are born in August than any other month. Although this might be explained by the 31 July cutoff for admission to baseball leagues, it was suggested that it could also be linked to a larger proportion of babies being born in August.

Streaming from zip to bz2

2016-07-08 Linux

I’ve got a massive bunch of zip archives, each of which contains only a single file. And the name of the enclosed file varies. Dealing with these data is painful.

It’d be a lot more convenient if the files were compressed with gzip or bzip2 and had a consistent naming convention. How would you go about making that conversion without actually unpacking the zip archive, finding the name of the enclosed file and then recompressing? Enter funzip.

Major League Baseball Birth Months

2016-07-05 R

The cutoff date for almost all nonschool baseball leagues in the United States is July 31, with the result that more major league players are born in August than in any other month. Malcolm Gladwell, Outliers

A quick analysis to confirm Gladwell’s assertion above. Used data scraped from www.baseball-reference.com.

satRday in Cape Town

2016-05-26 R

We are planning to host one of the three inaugural satRday conferences in Cape Town during 2017. The [R Consortium](https://www.r-consortium.org/) has committed to funding three of these events: one will be in Hungary, another will be somewhere in the USA and the third will be at an international destination. At present Cape Town is dicing it out with Monterrey (Mexico) for the third location.

R Saturday [satRday] in Cape Town

2016-05-12 R

Read More →

The Next Rembrandt

2016-04-06 Machine Learning

Creating The Next Rembrandt: using data to touch the human soul. How a team from ING, Microsoft, TU Delft, Mauritshuis and Rembrandthuis used technology to synthesise a painting in the style of the Dutch master, Rembrandt, almost 350 years after his death.

International Open Data Day

2016-03-05 R

As part of International Open Data Day we spent the morning with a bunch of like minded people poring over some open Census South Africa data. Excellent initiative, @opendatadurban, I’m very excited to see where this is all going and look forward to contributing to the journey!

R, HDF5 Data and Lightning

2016-02-23 R

I used to spend an inordinate amount of time digging through lightning data. These data came from a number of sources, the World Wide Lightning Location Network (WWLLN) and LIS/OTD being the most common. I recently needed to work with some Hierarchical Data Format (HDF) data. HDF is something of a niche format and, since that was the format used for the LIS/OTD data, I went to review those old scripts. It was very pleasant rediscovering work I did some time ago.

Ira Glass on the Creative Process

2016-02-12

Read More →

Automating R scripts under Windows

2016-02-11 R

Setting up an automated job under Linux is a cinch thanks to cron. Doing the same under Windows is a little more tricky, but still eminently doable.

flipsideR: Support for ASX Option Chain Data

2016-02-08 R

I previously wrote about some R code for downloading Option Chain data from Google Finance. I finally wrapped it up into a package called flipsideR, which is now available via GitHub. Since I last wrote on this topic I’ve also added support for downloading option data from the Australian Securities Exchange (ASX).

Kaggle: Santa’s Stolen Sleigh

2016-01-22 R

This morning I read Wendy Kan’s interesting post on Creating Santa’s Stolen Sleigh. I hadn’t really thought too much about the process of constructing an optimisation competition, but Wendy gave some interesting insights on the considerations involved in designing a competition which was both fun and challenging but still computationally feasible without military grade hardware.

This seems like an opportune time to jot down some of my personal notes and also take a look at the results. I know that this sort of discussion is normally the prerogative of the winners and I trust that my ramblings won’t be viewed as presumptuous.

Lawyers and Politics

2016-01-20 visualisation

Read More →

Casting a Wide (and Sparse) Matrix in R

2016-01-19 R

I routinely use melt() and cast() from the reshape2 package as part of my data munging workflow. Recently I’ve noticed that the data frames I’ve been casting are often extremely sparse. Stashing these in a dense data structure just feels wasteful. And the dismal drone of page thrashing is unpleasant.

I had a look around for an alternative. As it turns out, it’s remarkably easy to cast a sparse matrix using sparseMatrix() from the Matrix package. Here’s an example.

Kaggle: Walmart Trip Type Classification

2016-01-15 R

Walmart Trip Type Classification was my first real foray into the world of Kaggle and I’m hooked. I previously dabbled in What’s Cooking but that was as part of a team and the team didn’t work out particularly well. As a learning experience the competition was second to none. My final entry put me at position 155 out of 1061 entries which, although not a stellar performance by any means, is just inside the top 15% and I’m pretty happy with that. Below are a few notes on the competition.

MongoDB: Installing on Windows 7

2016-01-13 MongoDB

It’s not my personal choice, but I have to spend a lot of my time working under Windows. Installing MongoDB under Ubuntu is a snap. Getting it going under Windows seems to require jumping through a few more hoops. Here are my notes. I hope that somebody will find them useful.

Mastering Python Scientific Computing

2016-01-11 Python book review

Read More →

Review: Learning Shiny

2016-01-05 R Shiny

I was asked to review Learning Shiny (Hernán G. Resnizky, Packt Publishing, 2015). I found the book to be useful, motivating and generally easy to read. I’d already spent some time dabbling with Shiny, but the book helped me graduate from paddling in the shallows to wading out into the Shiny sea.

Using Checksum to Guess Message Length: Not a Good Idea!

2015-12-22 R

A question posed by one of my colleagues: can a checksum be used to guess message length? My immediate response was negative and, as it turns out, a simple simulation supported this knee-jerk reaction.

Making Sense of Logarithmic Loss

2015-12-14 R

Logarithmic Loss, or simply Log Loss, is a classification loss function often used as an evaluation metric in Kaggle competitions. Since success in these competitions hinges on effectively minimising the Log Loss, it makes sense to have some understanding of how this metric is calculated and how it should be interpreted.

Log Loss quantifies the accuracy of a classifier by penalising false classifications. Minimising the Log Loss is basically equivalent to maximising the accuracy of the classifier, but there is a subtle twist which we’ll get to in a moment.

Installing XGBoost on Ubuntu

2015-12-09 R Python

Read More →

2015 Data Science Salary Survey

2015-12-04 R

The recently published 2015 Data Science Salary Survey conducted by O’Reilly takes a look at the salaries received, tools used and other interesting facts about Data Scientists around the World. Download the report as PDF. It’s based on a survey of over 600 respondents from a variety of industries. The entire report is well worth a read, but I’ve picked out some highlights below.

The majority (67%) of the respondents in the survey were from the United States. They also demanded the highest median salaries across the globe. At the other end of the spectrum (and relevant to me personally), only 1% of the respondents were from Africa. These represented only one country: South Africa. The lowest salaries overall were recorded in Africa, while the lowest median salaries were found in Latin America.

Evolution of First Names: Unisex Names and Nicknames

2015-11-23

Read More →

Evolution of First Names: Fashionable and Popular Names

2015-11-16

Last week I took a high level look at the trends in children’s names over the last century. Today I’ll dig a little deeper and examine the ebb and flow in popularity of some specific names.

Graph from Sparse Adjacency Matrix

2015-11-12 R

I spent a decent chunk of my morning trying to figure out how to construct a sparse adjacency matrix for use with graph.adjacency(). I’d have thought that this would be rather straight forward, but I tripped over a few subtle issues with the Matrix package. My biggest problem (which in retrospect seems rather trivial) was that elements in my adjacency matrix were occupied by the pipe symbol.

First Names Evolution: Changes over Last Century

2015-11-09

In light of recent developments, a bit of work that I did almost two years ago has become rather relevant.

LIBOR and Bond Yields

2015-11-06 R

I’ve just been looking at the historical relationship between the London Interbank Offered Rate (LIBOR) and government bond yields. LIBOR data can be found at Quandl and comes in CSV format, so it’s pretty simple to digest. The bond data can be sourced from the US Department of the Treasury. It comes as XML and requires a little more work.

treasury.xml = xmlParse('data/treasury-yield.xml')
xml.field = function(name) {
  xpathSApply(xmlRoot(treasury.xml), paste0('//ns:entry/ns:content//d:', name),
              function(x) {xmlValue(x)},
              namespaces = c(ns = 'https://www.w3.org/2005/Atom',
                             d = 'http://schemas.microsoft.com/ado/2007/08/dataservices'))
}
bonds = data.frame(
  date = strptime(xml.field('NEW_DATE'), format = '%Y-%m-%dT%H:%M:%S', tz = 'GMT'),
  yield_1m = as.numeric(xml.field('BC_1MONTH')),
  yield_6m = as.numeric(xml.field('BC_6MONTH')),
  yield_1y = as.numeric(xml.field('BC_1YEAR')),
  yield_5y = as.numeric(xml.field('BC_5YEAR')),
  yield_10y = as.numeric(xml.field('BC_10YEAR'))
)

Once I had a data frame for each time series, the next step was to convert them each to xts objects. With the data in xts format it was a simple matter to enforce temporal overlap and merge the data into a single time series object. The final step in the analysis was to calculate the linear coefficient, or beta, for a least squares fit of LIBOR on bond yield. This was to be done with both a 1 month and a 1 year moving window. Both of these could be achieved quite easily using rollapply() from the zoo package.

Guy Kawasaki on Personal Branding

2015-11-03

Kelsey Jones of Search Engine Journal interviews Guy Kawasaki of Canva. The key take-home message is that maintaining a personal brand is vital even if you are permanently employed. Specifically, it’s important to keep a visible record of who you have worked for and your personal successes.

I'm living proof. I did one thing right for Apple thirty years ago. I've been coasting ever since. Just need to do one thing really right. Guy Kawasaki

The quote above is, of course, tongue in cheek, but it bears a nugget of truth: showcase your achievements on LinkedIn and other social media because they all contribute to your personal brand.

Day 38: Imaging

2015-10-30 Julia Month of Julia

Read More →

Day 37: Fourier Techniques

2015-10-26 Julia FTT Month of Julia

Read More →

Day 36: Markdown

2015-10-19 Julia Month of Julia

Markdown is a lightweight format specification language developed by John Gruber. Markdown can be converted to HTML, LaTeX or other document formats. You probably knew all that already. The syntax is pretty simple. Check out this useful cheatsheet.

Beautiful Data

2015-10-15 Python R book review

I’ve just finished reading Beautiful Data (published by O’Reilly in 2009), a collection of essays edited by Toby Segaran and Jeff Hammerbacher. The 20 essays from 39 contributors address a diverse array of topics relating to data and how it’s collected, analysed and interpreted.

Day 35: Mapping

2015-10-15 Julia Month of Julia

Read More →

Day 34: Networking

2015-10-13 Julia Month of Julia

Read More →

Installing LightTable and Juno on Ubuntu

2015-10-12 Julia

The recipe below works for Light Table v. 0.7.2 and Julia v. 0.4.0. It might work for other versions too, but these are the ones I can vouch for.

Day 33: Evolutionary Algorithms

2015-10-12 Julia Month of Julia

Read More →

Day 32: Classification

2015-10-09 Julia Month of Julia

Read More →

Day 31: Regression

2015-10-08 Julia Month of Julia

Read More →

Day 30: Clustering

2015-10-07 Julia Month of Julia

Read More →

1
2
3
8
9
10
12