Blog Posts by Andrew B. Collier / @datawookie


Embedding Dependencies into a HTML File

I use HTML to generate slide decks. Usually my HTML will reference a host of other files on my machine (CSS, JavaScript and images). If I want to distribute my deck then I have a couple of options:

  • just send the HTML file without all of the dependencies or
  • send the HTML file and dependencies (normally wrapped up in some sort of archive).

Both of these have problems. In the former case the HTML just ends up looking like ass because it relies on all of those dependencies to sort out the aesthetics. In the latter case I need to take care of the directory structure and, if those dependencies are distributed across my file system (which they generally are!) then this can be a challenge.

Read More →

DNS on Ubuntu

For years it’s been simple to set up DNS on a Linux machine. Just add a couple of entries to /etc/resolv.conf and you’re done.

Read More →

@pyconza (2018): Data Science and Bayes with Python

I’ve just returned from PyConZA (2018), held at the Birchwood Hotel in Boksburg North (Johannesburg) on 11-12 October. A great conference with a super selection of talks and great catering.

Obviously when the PyCon call for papers came out I was feeling ambitious because I submitted a Workshop and a Talk. They were both accepted, so that put the pressure on a bit.

Workshop

I gave a full day pre-conference workshop on 10 October entitled “Introduction to Python for Data Science”. In retrospect it would have been a better idea to call it “Introduction to Data Science using Python”.

Read More →

Docker Images for Spark

I recently put together a short training course on Spark. One of the initial components of the course involved deploying a Spark cluster on AWS. I wanted to have Jupyter Notebook and RStudio servers available on the master node too and the easiest way to make that happen was to install Docker and then run appropriate images.

There’s already a jupyter/pyspark-notebook image which includes Spark and Jupyter. It’s a simple matter to extend the rocker/verse image (which already includes RStudio server, the Tidyverse, devtools and some publishing utilities) to include the sparklyr package.

Read More →

MySQL Server Replication using Binary Logs

A banner image featuring the MySQL logo.

Suppose you want to create a replica of your MySQL database. The replica should:

  • start with a complete snapshot of the current (initial) state of the master database and
  • be updated with any changes to the master database.

This post will outline how MySQL server replication can be done using binary logs.

Read More →

DIY VPN with Docker

The OpenVPN logo.

I’ve worked with both ExpressVPN and NordVPN. Both are great services but, from my perspective, have one major shortcoming: they’re currently blocked by Amazon Web Services (AWS). When using either of them you are simply not able to access any of the AWS services.

The most common scenario in which I’d be using a VPN is if I’m on a restrictive network where I’m only able to access web sites. Typically just ports 80, 8080 and 443 are open. Forget about SSH (port 22), SMTP (ports 25, 465 and 587) or NTP (port 123). I want to be able to connect by SSH to my AWS servers, send mail over SMTP and synchronise my clock. The latter items are normally possible over commercial VPN providers (like ExpressVPN and NordVPN) but not being able to connect to AWS is a deal breaker.

Read More →

Chairing a Conference Session

Being the Chair: Notes on Running a Conference Session.

There are many factors which can determine the success of a conference: the location, the venue, the catering, the speakers, the social programme, the contents of the swag bag… However, in my opinion, one of the most important components of an enjoyable conference is a collection of competent chairpersons, for they will ensure that all aspects of the sessions (the very core of a conference!) run smoothly.

Read More →

Updating R on Ubuntu

Today I finally got around to updating my R to 3.5 (or, more specifically, 3.5.1). The complete instructions for doing the update on Ubuntu are available here. I’ve paraphrased them below.

Read More →

Classification: Get the Balance Right

Graphic from the Depeche Mode album 'Get the Balance Right'.

For classification problems the positive class (which is what you’re normally trying to predict) is often sparsely represented in the data. Unless you do something to address this imbalance then your classifier is likely to be rather underwhelming.

Achieving a reasonable balance in the proportions of the target classes is seldom emphasised. Perhaps it’s not very sexy. But it can have a massive effect on a model.

Read More →

Workshop: Web Scraping with R

Join Andrew Collier and Hanjo Odendaal for a workshop on using R for Web Scraping.

Who should attend?

This workshop is aimed at beginner and intermediate R users who want to learn more about using R for data acquisition and management, with a specific focus on web scraping.

What will you learn?

You will learn:

  • data manipulation with dplyr, tidyr and purrr;
  • tools for accessing the DOM;
  • scraping static sites with rvest;
  • scraping dynamic sites with RSelenium; and
  • setting up an automated scraper in the cloud.

See programme below for further details.

Read More →

Tips for Lightning Talks

It seems a little counter-intuitive, but a 5 minute lightning talk is far more difficult to prepare (and present!) than a standard 20 minute or longer talk. The principle challenge is fitting everything that you want to say into the allotted time, while still maintaining an engaging narrative.

At the recent satRday conference in Cape Town (17 March 2018) we had a number of great lightning talks. A few of the speakers gave us their tips on creating a brilliant lightning talk.

Read More →

Restoring a Django Backup

It took me a little while to figure out the correct sequence for restoring a Django backup. If you have borked your database, this is how to put it back together.

Read More →