Andrew B. Collier / @datawookie

Social links and a link to my CV.

Public datasets:

British Canoeing Results

Fitting a Statistical Distribution to Sampled Data

2016-10-05 R

I’m generally not too interested in fitting analytical distributions to my data. With large enough samples (which I am normally fortunate enough to have!) I can safely assume normality for most statistics of interest.

Recently I had a relatively small chunk of data and finding a decent analytical approximation was important. I had a look at the tools available in R for addressing this problem. The {fitdistrplus} package seemed like a good option. Here’s a sample workflow.

Python: First Steps with MongoDB

2016-09-28 MongoDB Python

I’m busy working my way through Kyle Banker’s MongoDB in Action. Much of the example code in the book is given in Ruby. Despite the fact that I’d love to learn more about Ruby, for the moment it makes more sense for me to follow along with Python.

Chrome Developer Tools: Throttling Connection

2016-09-20 web scraping

Sometimes you’ll want to see how a site behaves on a slower connection. This can be easily emulated using Chrome DevTools. Go to the Network tab and press the “No throttling” dropdown, which will give you a selection of presets and the option to configure custom connections.

Chrome Developer Tools: View POST Data

2016-09-19 web scraping

When figuring out how to formulate the contents of a POST request it’s often useful to see the “typical” fields submitted directly from a web form.

Deleting All Nodes and Relationships

2016-09-15 Neo4j

Seems that I am doing this a lot: deleting my entire graph (all nodes and relationships) and rebuilding from scratch. I guess that this is part of the learning process.

Remote Access to Neo4j on Windows

2016-09-13 Neo4j

Accessing the Neo4j server running on your local machine is simple: just point your browser to http://localhost:7474/. But with the default configuration the server is not accessible from other machines. This means that other folk can share in the wonder of your nodes edges.

Installing Neo4j on Ubuntu

2016-09-06 Neo4j Linux

Some instructions for installing Neo4j on Ubuntu 16.04. More for my own benefit than anything else.

PLOS Subject Keywords: Association Rules

2016-09-01 R Association Rules

In a previous post I detailed the process of compiling data on subject keywords used in articles published in PLOS journals. In this instalment I’ll be using those data to mine Association Rules with the arules package.

ubeR: A Package for the Uber API

2016-08-31 R

Uber exposes an extensive API for interacting with their service. ubeR is a R package for working with that API which Arthur Wu and I put together during a Hackathon at iXperience.

PLOS Subject Keywords: Gathering Data

2016-08-24 R Association Rules Collaborative Filtering

I’m putting together a couple of articles on Collaborative Filtering and Association Rules. Naturally, the first step is finding suitable data for illustrative purposes.

Sportsbook Betting (Part 3): Evolving Odds

2016-08-23 R gambling

Lead runners in the ladies 800 metre race at the Rio Olympic Games.

In previous instalments in this series I have not taken into account how odds can change over time.

Garmin ANT on Ubuntu

2016-08-22 Linux

I finally got tired of booting up Windows to download data from my Garmin 910XT. I tried to get my old Ubuntu 15.04 system to recognise my ANT stick but failed. Now that I have a stable Ubuntu 16.04 system the time seems ripe.

Anthony Goldbloom: The jobs we’ll lose to machines

2016-08-22 Machine Learning TED Talk

Read More →

Sportsbook Betting (Part 2): Bookmakers’ Odds

2016-08-10 R gambling

In the first instalment of this series we gained an understanding of the various types of odds used in Sportsbook betting and the link between those odds and implied probabilities. We noted that the implied probabilities for all possible outcomes in an event may sum to more than 100%. At first sight these seems a bit odd. It certainly appears to violate the basic principles of statistics. However, this anomaly is the mechanism by which bookmakers assure their profits. A similar principle applies in a casino.

Animated Mortality

2016-08-09 R

Read More →

feedeR: Reading RSS and Atom Feeds from R

2016-08-08 R

I’m working on a project in which I need to systematically parse a number of RSS and Atom feeds from within R. I was somewhat surprised to find that no package currently exists on CRAN to handle this task. This presented the opportunity for a bit of DIY.

You can find the fruits of my morning’s labour here.

Web Scraping and “invalid multibyte string”

2016-08-02 R web scraping

A couple of my collaborators have had trouble using read_html() from the xml2 package to access this Wikipedia page.

Sportsbook Betting (Part 1): Odds

2016-08-01 R gambling

This series of articles was written as support material for Statistics exercises in a course that I’m teaching for iXperience. In the series I’ll be using illustrative examples for wagering on a variety of Sportsbook events including Horse Racing, Rugby and Tennis. The same principles can be applied across essentially all betting markets.

Arthur Benjamin: Teach statistics before calculus!

2016-07-29 TED Talk teaching

Arthur Benjamin thinks that the end goal of teaching Mathematics at school should be Statistics rather than Calculus. He has a point: in terms of understanding things in the real world, Statistics is definitely more powerful. These ideas are quite compatible with those of Conrad Wolfram, who thinks that we should be using computers more extensively in Mathematics education.

Building a Life Table

2016-07-28 R

Read More →

1
2
3
19
20
21
30