Andrew B. Collier / @datawookie

Social links and a link to my CV.

Public datasets:

British Canoeing Results

Hosting a Plumber API on AWS

2017-09-14 AWS R Plumber

I’ve been putting together a small proof-of-concept API using R and {plumber}. It works flawlessly on my local machine and I was planning on deploying it on an EC2 instance to demo it for a client. However, I ran into a snag: despite opening the required port in my Security Group I was not able to access the API. This is what I needed to do to get it working.

Creating an AWS Spot Instance

2017-09-13 AWS

EC2 Spot Instances can provide very affordable computing on EC2 by allowing access to unused capacity at significant discounts.

Building a Local OSRM Instance

2017-09-11 R OSRM

The Open Source Routing Machine (OSRM) is a library for calculating routes, distances and travel times between spatial locations. It can be accessed via either an HTTP or C++ API. Since it’s open source you can also install locally, download appropriate map data and start making efficient travel calculations.

These are the instructions for getting OSRM installed on a Ubuntu machine and hooking up the osrm R package.

Global Variables in R Packages

2017-09-07 R

I know that global variables are from the Devil, but sometimes you just can’t get around them.

I’m building a small package for a client that relies on a data file. For various reasons that file is not part of the package and can reside in different locations on users’ machines. Furthermore there are users on both Windows and Linux machines.

Driving AWS from the Command Line

2017-08-31 AWS

Although it’s very handy (and easy) to set up some cloud resources using the AWS Management Console, once you know what you need it makes a lot of sense to automate the process. Fortunately there’s a handy little command line tools, aws, which makes this eminently possible. The AWS CLI Command Reference is the definitive resource for this tool. There’s a mind boggling array of possibilities. We’ll take a look at a small selection of them.

Route Asymmetry in Google Maps

2017-08-23 R

I have been retrieving some route information using Rodrigo Azuero’s gmapsdistance package and noted that there was some asymmetry in the results: the time and distance for the trip from A to B was not necessarily always the same as the time and distance for the trip from B to A. Although in retrospect this seems self-evident, it merited further investigation.

Retrieve Kaggle Data from the Command Line

2017-08-21 Kaggle AWS

We’ve been building some models for Kaggle competitions using an EC2 instance for compute. I initially downloaded the data locally and then pushed it onto EC2 using SCP. But there had to be a more efficient way to do this, especially given the blazing fast bandwidth available on AWS.

Enter kaggle-cli.

Update: Apparently that project has been deprecated in favour of kaggle-api. More information below.

Setting Up Time Zones in BASH

2017-08-20 BASH

Ensuring that your account is configured to run with appropriate time zone information can make your life a lot easier.

Of course, if you administer your own system then you can simply set your system time to local time. However, it’s generally a better idea to set system time to Universal Time (UTC) and then configure time zone information on a per-user basis.

Why does this make sense? Well, suppose that you have remote users logging onto your system. It’s very likely that a remote user will be operating in a different time zone and it’d be handy for them to have system time converted into their local time.

Setting Up Time Zones in MySQL

2017-08-20 MySQL Django

I’m in the process of setting up a Zinnia blog on one of my Django sites. After putting all of the necessary plumbing in place I got the following message on first visiting the blog URL:

Database returned an invalid value in QuerySet.datetimes(). Are time zone definitions for your database and pytz installed?

The solution to this is to copy your system’s time zone information across to the database.

Adding a Volume to an Ubuntu EC2 Instance

2017-08-10 AWS

Some quick notes on adding a storage volume to an EC2 instance.

Remote Desktop on an Ubuntu EC2 Instance

2017-08-08 AWS

A couple of options for remote access to desktop applications on a EC2 host.

A Timeline History of R

2017-08-05 R

A record of some more or less important events in the history of R.

This is a work in progress. The information is cobbled together from a range of sources. If you have pertinent items to add, please let me know via the comments.

Adding Users to an EC2 Ubuntu Instance

2017-07-24 AWS Linux SSH

By default an EC2 instance has only a single user other than root. For example, on a Ubuntu instance, that user is ubuntu. If there will be multiple people accessing the instance then it’s generally necessary for each of them to have their own account. Setting this up is pretty simple, it just requires sorting out some authentication details.

Docker: Persisting User Data

2017-07-20 Docker

I’m busy putting together a Docker image for a multi-user Jupyter Notebook installation. I am to have an independent login for each of the users and each of them should also have their own storage space. That space should exist elsewhere from on the container though, so that even if the container stops, the data lives on. This should mitigate user rage.

Deploying Jupyter on AWS using Docker

2017-07-18 Jupyter Docker AWS

Amazon’s EC2 Container Services (ECS) is an orchestrated system for deploying Docker containers on AWS. This post is about not using ECS.

RStudio Environment on DigitalOcean with Docker

2017-07-11 R Docker

I’ll be running a training course in a few weeks which will use RStudio as the main computational tool. Since it’s a short course I don’t want to spend a lot of time sorting out technical issues. And with multiple operating systems (and versions) these issues can be numerous and pervasive. Setting up a RStudio server which everyone can access (and that requires no individual configuration!) makes a lot of sense.

Installing Hadoop on Ubuntu

2017-07-04 Linux Hadoop

This is what I did to set up Hadoop on my Ubuntu machine.

Installing Spark on Ubuntu

2017-07-04 Linux Spark

I’m busy experimenting with Spark. This is what I did to set up a local cluster on my Ubuntu machine. Before you embark on this you should first set up Hadoop.

Accessing PySpark from a Jupyter Notebook

2017-07-04 Jupyter Spark

Read More →

Increasing MySQL Packet Maximum Size

2017-07-01 MySQL

In the process of uploading a massive CSV file to my Django application my session data are getting pretty big. As a the result I’m getting these errors:

(1153, "Got a packet bigger than 'max_allowed_packet' bytes") and
(2006, 'MySQL server has gone away').

The second error is potentially unrelated.

After some research it became apparent that the source of the problem is my max_allowed_packet setting.

Setting up ExpressVPN on Ubuntu

2017-06-23 Linux

I’ve been meaning to set up a VPN and this morning seemed like a good time to tick it off the bucket list. This is a quick outline of my experience, which included one minor hiccup.

Setting up Jupyter with Python 3 on Ubuntu

2017-06-23 Jupyter Linux

A sample Jupyter notebook.

A short note on how to set up Jupyter Notebooks with Python 3 on Ubuntu. The instructions are specific to Xenial Xerus (16.04) but are likely to be helpful elsewhere too.

Deploying a Minimal Plumber API on DigitalOcean

2017-06-21 R

Read More →

RSelenium and Java Heap Space

2017-06-09 R web scraping Selenium

I’m in the process of deploying a scraper on a DigitalOcean instance. The scraper uses RSelenium with the PhantomJS browser. I ran into a problem though. Although it worked flawlessly on my local machine, on the remote instance it broke with an following error.

Clustering Time Series Data

2017-04-25 Machine Learning

I have been looking at methods for clustering time domain data and recently read TSclust: An R Package for Time Series Clustering by Pablo Montero and José Vilar. Here are the results of my initial experiments with the TSclust package.

Bulgaria Web Summit

2017-04-16 Conference

The Bulgaria Web Summit happened on 7 and 8 April 2017 at the Inter Expo Center in Sofia, Bulgaria.

Bayesian Marathon Predictions

2017-02-28 running Bayesian R

There are a variety of ways to predict running times over the standard marathon distance (42.2 km). You could dust off your copy of The Lore of Running (Tim Noakes). My treasured Third Edition discusses predicting likely marathon times on p. 366, referring to tables published by other authors to actually make predictions. There’s also a variety of online services, for example:

Runners’ World’s Race Time Predictor (based on Riegel’s Formula),
Running for Fitness’s Race Predictor and
Race Result Predictor.

Of these I particularly like the offering from Running for Fitness which produces a neatly tabulated set of predicted times over an extensive range of distances using a selection of techniques including Riegel’s Formula and Cameron’s Model.

Amazon is Getting Inside my Head

2017-02-26

Amazon seems to really understand me. Or, at least, my reading preferences. Running, garlic, data. Yup, that pretty much sums me up.

Google Quick, Draw!

2016-11-17

Spent a very diverting few minutes playing with Quick, Draw! this morning, which is one of the cool AI Experiments hosted by Google.

Simple School Maths Problem

2016-11-15

A simple problem sent through to me by one of my running friends:

There are 6 red cards and 1 black card in a box. Busi and Khanya take turns to draw a card at random from the box, with Busi being the first one to draw. The first person who draws the black card will win the game (assume that the game can go on indefinitely). If the cards are drawn with replacement, determine the probability that Khanya will win, showing all working.
Read More →

satRday Cape Town: Call for Submissions

2016-10-26 R Conference

satRday Cape Town will happen on 18 February 2017 at Workshop 17, Victoria & Alfred Waterfront, Cape Town, South Africa.

fast-neural-style: Real-Time Style Transfer

2016-10-07 Machine Learning

I followed up a reference to fast-neural-style from Twitter and spent a glorious hour experimenting with this code. Very cool stuff indeed. It’s documented in Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Justin Johnson, Alexandre Alahi and Fei-Fei Li.

The basic idea is to use feed-forward convolutional neural networks to generate image transformations. The networks are trained using perceptual loss functions and effectively apply style transfer.

What is “style transfer”? You’ll see in a moment.

Fitting a Statistical Distribution to Sampled Data

2016-10-05 R

I’m generally not too interested in fitting analytical distributions to my data. With large enough samples (which I am normally fortunate enough to have!) I can safely assume normality for most statistics of interest.

Recently I had a relatively small chunk of data and finding a decent analytical approximation was important. I had a look at the tools available in R for addressing this problem. The {fitdistrplus} package seemed like a good option. Here’s a sample workflow.

Python: First Steps with MongoDB

2016-09-28 MongoDB Python

I’m busy working my way through Kyle Banker’s MongoDB in Action. Much of the example code in the book is given in Ruby. Despite the fact that I’d love to learn more about Ruby, for the moment it makes more sense for me to follow along with Python.

Chrome Developer Tools: Throttling Connection

2016-09-20 web scraping

Sometimes you’ll want to see how a site behaves on a slower connection. This can be easily emulated using Chrome DevTools. Go to the Network tab and press the “No throttling” dropdown, which will give you a selection of presets and the option to configure custom connections.

Chrome Developer Tools: View POST Data

2016-09-19 web scraping

When figuring out how to formulate the contents of a POST request it’s often useful to see the “typical” fields submitted directly from a web form.

Deleting All Nodes and Relationships

2016-09-15 Neo4j

Seems that I am doing this a lot: deleting my entire graph (all nodes and relationships) and rebuilding from scratch. I guess that this is part of the learning process.

Remote Access to Neo4j on Windows

2016-09-13 Neo4j

Accessing the Neo4j server running on your local machine is simple: just point your browser to http://localhost:7474/. But with the default configuration the server is not accessible from other machines. This means that other folk can share in the wonder of your nodes edges.

Installing Neo4j on Ubuntu

2016-09-06 Neo4j Linux

Some instructions for installing Neo4j on Ubuntu 16.04. More for my own benefit than anything else.

PLOS Subject Keywords: Association Rules

2016-09-01 R Association Rules

In a previous post I detailed the process of compiling data on subject keywords used in articles published in PLOS journals. In this instalment I’ll be using those data to mine Association Rules with the arules package.

ubeR: A Package for the Uber API

2016-08-31 R

Uber exposes an extensive API for interacting with their service. ubeR is a R package for working with that API which Arthur Wu and I put together during a Hackathon at iXperience.

PLOS Subject Keywords: Gathering Data

2016-08-24 R Association Rules Collaborative Filtering

I’m putting together a couple of articles on Collaborative Filtering and Association Rules. Naturally, the first step is finding suitable data for illustrative purposes.

Sportsbook Betting (Part 3): Evolving Odds

2016-08-23 R gambling

Lead runners in the ladies 800 metre race at the Rio Olympic Games.

In previous instalments in this series I have not taken into account how odds can change over time.

Garmin ANT on Ubuntu

2016-08-22 Linux

I finally got tired of booting up Windows to download data from my Garmin 910XT. I tried to get my old Ubuntu 15.04 system to recognise my ANT stick but failed. Now that I have a stable Ubuntu 16.04 system the time seems ripe.

Anthony Goldbloom: The jobs we’ll lose to machines

2016-08-22 Machine Learning TED Talk

Read More →

Sportsbook Betting (Part 2): Bookmakers’ Odds

2016-08-10 R gambling

In the first instalment of this series we gained an understanding of the various types of odds used in Sportsbook betting and the link between those odds and implied probabilities. We noted that the implied probabilities for all possible outcomes in an event may sum to more than 100%. At first sight these seems a bit odd. It certainly appears to violate the basic principles of statistics. However, this anomaly is the mechanism by which bookmakers assure their profits. A similar principle applies in a casino.

Animated Mortality

2016-08-09 R

Read More →

feedeR: Reading RSS and Atom Feeds from R

2016-08-08 R

I’m working on a project in which I need to systematically parse a number of RSS and Atom feeds from within R. I was somewhat surprised to find that no package currently exists on CRAN to handle this task. This presented the opportunity for a bit of DIY.

You can find the fruits of my morning’s labour here.

Web Scraping and “invalid multibyte string”

2016-08-02 R web scraping

A couple of my collaborators have had trouble using read_html() from the xml2 package to access this Wikipedia page.

Sportsbook Betting (Part 1): Odds

2016-08-01 R gambling

This series of articles was written as support material for Statistics exercises in a course that I’m teaching for iXperience. In the series I’ll be using illustrative examples for wagering on a variety of Sportsbook events including Horse Racing, Rugby and Tennis. The same principles can be applied across essentially all betting markets.

1
2
3
7
8
9
12