Andrew B. Collier / @datawookie

Social links and a link to my CV.

Public datasets:

British Canoeing Results

Docker Images for R: r-base versus r-apt

2019-01-21 R Docker

I need to deploy a Plumber API in a Docker container. The API has some R package dependencies which need to be baked into the Docker image. A few options for the base image:

The first option, r-base, would require building the dependencies from source, a somewhat time consuming operation. The last option, r-apt, makes it possible to install most packages using apt, which is likely to be much quicker. I’ll immediately eliminate the other option, tidyverse, because although it already contains a load of packages, many of those are not required and, in addition, it incorporates RStudio Server, which is definitely not necessary for this project.

RServe: Getting Started

2019-01-21 R

Rserve is a server which allows other programs to use the facilities of R via TCP/IP.

JSON Payload for POST Request

2019-01-10 R

Starting with JSON body because this is the way that most API documentation will give you the payload examples.

Where does .Renviron live on Citrix?

2019-01-08 R

At one of my clients I run RStudio under Citrix in order to have access to their data.

For the most part this works fine. However, every time I visit them I spend the first few minutes of my day installing packages because my environment does not seem to be persisted from one session to the next.

I finally had a gap and decided to fix the problem.

Where are the packages being installed?

Installed packages just spontaneously disappear… That’s weird. Where are they being installed?

Survey Raking: An Illustration

2018-12-26 R survey

Analysing survey data can be tricky. There’s often a mismatch between the characteristics of the survey respondents and those of the general population. If the discrepancies are not accounted for then the survey results can (and generally will!) be misleading.

Citrix Receiver on Ubuntu

2018-12-14 Linux Citrix

There’s a Debian package available for Citrix Receiver, so in principle this task should be trivial.

It’s not.

Scraping the Turkey Accordion

2018-12-12 R web scraping

One of the things I like most about web scraping is that almost every site comes with a new set of challenges.

The Accordion Concept

I recently had to scrape a few product pages from the site of a large retailer. I discovered that these pages use an “accordion” to present the product attributes. Only a single panel of the accordion is visible at any one time. For example, you toggle the Details panel open to see the associated content.

RStudio & Shiny Servers with NGINX & SSL

2018-11-14 R Shiny

I fairly often set up servers to host both Shiny and RStudio servers. This is my recipe.

Installing RStudio & Shiny Servers

2018-11-13 R Shiny

I did a remote install of Ubuntu Server today. This was somewhat novel because it’s the first time that I have not had physical access to the machine I was installing on. The server install went very smoothly indeed.

The next tasks were to install RStudio Server and Shiny Server. The installation process for each of these is well documented on the RStudio web site:

These are my notes. Essentially the same, with some small variations.

Accessing Open Data from AWS

2018-11-04 AWS

There’s a magnificent variety of open data available on AWS. To see the full list, head over to the Registry of Open Data on AWS.

Embedding Dependencies into a HTML File

2018-10-31 tool speaking

I use HTML to generate slide decks. Usually my HTML will reference a host of other files on my machine (CSS, JavaScript and images). If I want to distribute my deck then I have a couple of options:

just send the HTML file without all of the dependencies or
send the HTML file and dependencies (normally wrapped up in some sort of archive).

Both of these have problems. In the former case the HTML just ends up looking like ass because it relies on all of those dependencies to sort out the aesthetics. In the latter case I need to take care of the directory structure and, if those dependencies are distributed across my file system (which they generally are!) then this can be a challenge.

DNS on Ubuntu

2018-10-25 Ubuntu

For years it’s been simple to set up DNS on a Linux machine. Just add a couple of entries to /etc/resolv.conf and you’re done.

@pyconza (2018): Data Science and Bayes with Python

2018-10-15 Python conference

I’ve just returned from PyConZA (2018), held at the Birchwood Hotel in Boksburg North (Johannesburg) on 11-12 October. A great conference with a super selection of talks and great catering.

Obviously when the PyCon call for papers came out I was feeling ambitious because I submitted a Workshop and a Talk. They were both accepted, so that put the pressure on a bit.

Workshop

I gave a full day pre-conference workshop on 10 October entitled “Introduction to Python for Data Science”. In retrospect it would have been a better idea to call it “Introduction to Data Science using Python”.

Docker Images for Spark

2018-09-28 Docker Spark

I recently put together a short training course on Spark. One of the initial components of the course involved deploying a Spark cluster on AWS. I wanted to have Jupyter Notebook and RStudio servers available on the master node too and the easiest way to make that happen was to install Docker and then run appropriate images.

There’s already a jupyter/pyspark-notebook image which includes Spark and Jupyter. It’s a simple matter to extend the rocker/verse image (which already includes RStudio server, the Tidyverse, devtools and some publishing utilities) to include the sparklyr package.

MySQL Server Replication using Binary Logs

2018-09-17 MySQL

Suppose you want to create a replica of your MySQL database. The replica should:

start with a complete snapshot of the current (initial) state of the master database and
be updated with any changes to the master database.

This post will outline how MySQL server replication can be done using binary logs.

DIY VPN with Docker

2018-09-11 Docker VPN

I’ve worked with both ExpressVPN and NordVPN. Both are great services but, from my perspective, have one major shortcoming: they’re currently blocked by Amazon Web Services (AWS). When using either of them you are simply not able to access any of the AWS services.

The most common scenario in which I’d be using a VPN is if I’m on a restrictive network where I’m only able to access web sites. Typically just ports 80, 8080 and 443 are open. Forget about SSH (port 22), SMTP (ports 25, 465 and 587) or NTP (port 123). I want to be able to connect by SSH to my AWS servers, send mail over SMTP and synchronise my clock. The latter items are normally possible over commercial VPN providers (like ExpressVPN and NordVPN) but not being able to connect to AWS is a deal breaker.

Refining an AWS IAM Policy for Flintrock

2018-09-08 Spark AWS

Read More →

Diagnosing RStudio Startup Issues

2018-09-07 R

Yesterday I tried to start RStudio and something weird happened: the window launched but it was blank and unresponsive.

Chairing a Conference Session

2018-08-09 speaking chairing conference

There are many factors which can determine the success of a conference: the location, the venue, the catering, the speakers, the social programme, the contents of the swag bag… However, in my opinion, one of the most important components of an enjoyable conference is a collection of competent chairpersons, for they will ensure that all aspects of the sessions (the very core of a conference!) run smoothly.

Setup for using Stan with Julia

2018-07-25 Julia

I’m busy preparing a poster about Stan.jl for JuliaCon 2018. Getting set up is pretty simple, although there are some minor details that I thought I’d document.

Updating R on Ubuntu

2018-07-09 R

Today I finally got around to updating my R to 3.5 (or, more specifically, 3.5.1). The complete instructions for doing the update on Ubuntu are available here. I’ve paraphrased them below.

eRum (2018) Top Twenty

2018-05-18 R conference

My Top 20 highlights from eRum (2018) in Budapest.

Travelling Salesman with ggmap

2018-05-10 R

I’ve been testing out some ideas around the Travelling Salesman Problem using TSP and ggmap.

Classification: Get the Balance Right

2018-04-21 R machine learning

For classification problems the positive class (which is what you’re normally trying to predict) is often sparsely represented in the data. Unless you do something to address this imbalance then your classifier is likely to be rather underwhelming.

Achieving a reasonable balance in the proportions of the target classes is seldom emphasised. Perhaps it’s not very sexy. But it can have a massive effect on a model.

Workshop: Web Scraping with R

2018-04-12 Training

Join Andrew Collier and Hanjo Odendaal for a workshop on using R for Web Scraping.

Who should attend?

This workshop is aimed at beginner and intermediate R users who want to learn more about using R for data acquisition and management, with a specific focus on web scraping.

What will you learn?

You will learn:

data manipulation with dplyr, tidyr and purrr;
tools for accessing the DOM;
scraping static sites with rvest;
scraping dynamic sites with RSelenium; and
setting up an automated scraper in the cloud.

See programme below for further details.

Tips for Lightning Talks

2018-04-06 speaking R

It seems a little counter-intuitive, but a 5 minute lightning talk is far more difficult to prepare (and present!) than a standard 20 minute or longer talk. The principle challenge is fitting everything that you want to say into the allotted time, while still maintaining an engaging narrative.

At the recent satRday conference in Cape Town (17 March 2018) we had a number of great lightning talks. A few of the speakers gave us their tips on creating a brilliant lightning talk.

Restoring a Django Backup

2018-02-23 Django

It took me a little while to figure out the correct sequence for restoring a Django backup. If you have borked your database, this is how to put it back together.

Installing DataGrip on Ubuntu

2018-02-16 SQL Linux

Read More →

SQL Server from Ubuntu

2018-02-05 SQL Linux

Setting up the requisites to access a SQL Server database from Ubuntu.

Installing rJava

2018-02-05 R Linux Docker

Installing the {rJava} package on Ubuntu is not quite as simple as most other R packages. Some quick notes on how to do it.

Linux VM on Azure

2018-02-05 Azure Linux

A quick tutorial on how to create a Linux VM on Azure.

Ethereum: DIY Tools for Smart Contracts

2018-01-19 Ethereum

What tools do you need to start working with Ethereum smart contracts?

The Solidity Online Compiler provides a quick way to experiment with smart contracts without installing any software on your machine. Another promising online alternative is Cosmo.

However at some stage you’ll probably want to put together a local Ethereum development environment. Here are some suggestions for how to do that on an Ubuntu machine.

Since I’m just feeling my way into this new domain, I’m not sure to what degree all of these are necessary. I do know for sure, that Truffle and testrpc are crucial.

Ethereum: Running a Node

2018-01-19 Ethereum

Once you’ve installed Geth you’re ready to run your own Ethereum node.

NTP: Synchronise Your Watches

2018-01-11 NTP

Just like an old fashioned grandfather clock, time on your computer’s clock can slowly drift. You can quickly verify the accuracy of your clock by comparing it to https://time.is/. It’s not unusual for it to be anything from a few seconds to a couple of minutes out. For most purposes this is not a major issue, but there are some applications which are very time sensitive.

NTP (Network Time Protocol) is a tool which will synchronise your computer’s clock with a network of accurate time servers, ensuring that it’s always accurate.

There’s a lot to be said about NTP, but this is a quick guide to getting it up and running on an Ubuntu machine.

An Ethereum Package for R

2018-01-07 Ethereum

Charts showing number of Ethereum transactions and unique addresses.

Bitcoin has become synonymous with “cryptocurrency”. Ethereum is another cryptocurrency which, although not as hyped at Bitcoin, presents some attractive characteristics. The foremost of these is the ability to create sophisticated smart contracts.

This post introduces the new ether package for interacting with the Ethereum network from R.

Moving a Running Process to screen

2017-12-30 Linux

I am not sure how many times this has happened to me, but it’s not infrequent. I’m working on a remote session and I start a long running job. Then some time later I want to disconnect from the session but realise that if I do then the job will be killed.

I should have started job in screen or tmux!

Is it possible to transfer the running process to screen? (Or, equally, to tmux?) Well it turns out that it is using the reptyr utility. I discovered this thanks to a LinkedIn post by Bruce Werdschinski. A slightly refinement of his process is documented below.

Creating an Amazon Machine Image

2017-12-04 AWS

Creating an Amazon Machine Image (AMI) makes it quick and simple to rebuild a specific EC2 setup. This post illustrates the process by creating an AMI with ethminer and NVIDIA GPU drivers. Of course you’d never use this for mining Ether because the hardware costs are still too high!

Using Large Maps with OSRM

2017-11-27 OSRM

How to deal with large data sets in OSRM? Some quick notes on processing monster PBF files and getting them ready to serve with OSRM.

Something to consider up front: if you are RAM limited then this process is going to take a very long time due to swapping. It might make sense to spin up a big cloud instance (like a r4.8xlarge) for a couple of hours. You’ll get the job done much more quickly and it’ll definitely be worth it.

EC2 Missing Disk Space

2017-11-23 AWS

This morning I created a r3.xlarge spot instance on EC2. The job I’m planning on running requires a good wad of data to be uploaded, which is why I chose the r3.xlarge instance: it’s cost effective and, according to AWS, has 80 Gb of SSD storage.

I was a little surprised when I connected to the running instance and found that the root partition was only around 8 Gb. This is what I did to claim that missing disk space.

Variable Names: Camel Case to Underscore Delimited

2017-11-20 R

A project I’m working on has a bunch of different data sources. Some of them have column names in Camel Case. Others are underscore delimited. My OCD rebels at this disarray and demands either one or the other.

If it were just a few columns and I was only going to have to do this once, then I’d probably just quickly do it by hand. But there are many columns and it’s very likely that there will be more data in the future and the process will need to be repeated.

Seems like something that should be easy to automate.

Analysis of Feedback from satRday [Cape Town] 2017

2017-11-15 R satRday Conference

We recently announced the second satRday (Cape Town) conference scheduled to take place on 17 March 2018. Obviously we want this to be bigger and better than this year’s event, so we are paying careful attention to the feedback that we received from the first event.

This is a quick analysis of the feedback. We sold 192 tickets and gave out 11 complimentary tickets to the event. There were 107 responses to the feedback survey, which means that we heard back from more than half of the people who attended, which is hopefully a representative sample.

Durban Twitter Analysis

2017-11-10 R sentiment

I was invited to give a talk at Digifest (Durban University of Technology) on 10 November 2017. Looking at the other speakers and talks on the programme I realised that my normal range of topics would not be suitable.

Installing NVIDIA Graphics Driver on Ubuntu

2017-10-07 Linux GPU

Recipe for installing the NVIDIA binary drivers on Ubuntu.

Running OSRM with Docker

2017-10-07 Docker OSRM

I’ve now been through the process of setting up OSRM a few times. While it’s not exactly taxing, it seemed like a prime candidate for automation.

Exporting HTML Presentations to PDF

2017-10-05 speaking

Building a presentation with reveal.js is such a pleasure. And the results looks so good. Seriously doubt that I will ever use anything like PowerPoint again. Although it’s possible to export a presentation directly to PDF using a style sheet, this doesn’t always work perfectly (IMHO).

Fortunately there’s another way: decktape. It works with reveal.js and a bunch of other HTML5 presentation frameworks.

Quick WordPress Install with Docker

2017-09-22 WordPress MySQL NGINX Docker Linux

I’ve just put together a WordPress site for my older daughter. It’s hosted on DigitalOcean and all of the infrastructure is handled with Docker. This post describes the steps in the (easy) install process.

Diagnosing Killed Jobs on EC2

2017-09-21 Linux AWS

I’ve got a long running optimisation problem on a EC2 instance. Yesterday it was mysteriously killed. I shrugged it off as an anomaly and restarted the job. However, this morning it was killed again. Definitely not a coincidence! I investigated. This is what I found and how I am resolving the problem.

Removing Redundant Hostnames with NGINX

2017-09-15 NGINX Google Analytics

Redundant hostnames dialog in Google Analytics.

Read More →

Creating a S3 Bucket

2017-09-14 AWS

There are many good reasons to use S3 (Simple Storage Service) storage. This is a quick overview of how to create a S3 bucket.

Installing Docker on Ubuntu

2017-09-14 Docker Linux

This procedure works on both my laptop and a fresh EC2 instance.

1
2
3
6
7
8
12