Blog Posts by Andrew B. Collier / @datawookie


Birth Month by Gender

Based on some feedback to a previous post I normalised the birth counts by the (average) number of days in each month. As pointed out by a reader, the results indicate a gradual increase in the number of conceptions during (northern hemisphere) Autumn and Winter, roughly up to the end of December. Normalising the data to give births per day also shifts the peak from August to September.

Read More →

Streaming from zip to bz2

I’ve got a massive bunch of zip archives, each of which contains only a single file. And the name of the enclosed file varies. Dealing with these data is painful.

It’d be a lot more convenient if the files were compressed with gzip or bzip2 and had a consistent naming convention. How would you go about making that conversion without actually unpacking the zip archive, finding the name of the enclosed file and then recompressing? Enter funzip.

Read More →

Major League Baseball Birth Months

The cutoff date for almost all nonschool baseball leagues in the United States is July 31, with the result that more major league players are born in August than in any other month. Malcolm Gladwell, Outliers

A quick analysis to confirm Gladwell’s assertion above. Used data scraped from www.baseball-reference.com.

Read More →

satRday in Cape Town

We are planning to host one of the three inaugural satRday conferences in Cape Town during 2017. The [R Consortium](https://www.r-consortium.org/) has committed to funding three of these events: one will be in Hungary, another will be somewhere in the USA and the third will be at an international destination. At present Cape Town is dicing it out with Monterrey (Mexico) for the third location.

Read More →

R, HDF5 Data and Lightning

I used to spend an inordinate amount of time digging through lightning data. These data came from a number of sources, the World Wide Lightning Location Network (WWLLN) and LIS/OTD being the most common. I recently needed to work with some Hierarchical Data Format (HDF) data. HDF is something of a niche format and, since that was the format used for the LIS/OTD data, I went to review those old scripts. It was very pleasant rediscovering work I did some time ago.

Read More →

Kaggle: Santa’s Stolen Sleigh

This morning I read Wendy Kan’s interesting post on Creating Santa’s Stolen Sleigh. I hadn’t really thought too much about the process of constructing an optimisation competition, but Wendy gave some interesting insights on the considerations involved in designing a competition which was both fun and challenging but still computationally feasible without military grade hardware.

This seems like an opportune time to jot down some of my personal notes and also take a look at the results. I know that this sort of discussion is normally the prerogative of the winners and I trust that my ramblings won’t be viewed as presumptuous.

Read More →