Blog Posts by Andrew B. Collier / @datawookie


{emayili} Rendering Plain Markdown

We’ve been able to attach text and HTML content to messages with {emayili}. But something that I’ve really been wanting to do is render Markdown directly into an email.

In version 0.4.19 I’ve added the ability to directly render Plain Markdown into a message. That version is not on CRAN, so you’ll need to install from GitHub.

Read More →

{clockify} Time Tracking from R

At Fathom Data we use Clockify to keep detailed records of the time that we spend working on our clients' projects. Up until fairly recently we manually generated timesheets at the end of each month that were sent through to the clients along with their invoices. Our experience has been that providing detailed timesheets helps foster trust and transparency. However, with a growing team and an expanding clientele, generating these timesheets has become progressively more laborious. Time to automate!

Read More →

Setting up a Tiny HTTP Proxy

It’s often handy to have access to a HTTP proxy. I use this recipe from time to time to quickly fling together a proxy server which I can use to relay HTTP requests from a different origin.

Read More →

Pre-Commit Hook for Processing README.Rmd

When writing an R package I usually create a README.Rmd file that I render to README.md. I use {pkgdown} to then create documentation. I run the last step via CI, so once it’s set up I never need to think about it again. The problem is that I regularly forget to process the README.Rmd file, which means that despite keeping that up to date, everything else lags behind. Read More →

{emayili} Rudimentary Email Address Validation

A recent issue on the {emayili} GitHub repository prompted me to think a bit more about email address validation. When I started looking into this I was somewhat surprised to learn that it’s such a complicated problem. Who would have thought that something as apparently simple as an email address could be linked with such complexity?

Read More →

Old ‘Hood, New ‘Hood

Image adapted from the cover of 'Old Hat New Hat' by Dr Seuss.

I recently moved from suburban South Africa to rural England. I’m figuring out my new environment. Making some maps seemed to be a good way to get familiar with the surroundings.

In the process I wanted to figure out two things:

  • how to get maps with a consistent aspect ratio at different latitudes; and
  • how to overlay a partially transparent map layer.

To make things more interesting I’ll create maps of both my old and new locations.

Read More →

Websockify & noVNC behind an NGINX Proxy

At Fathom Data we are developing a framework which will enable remote access to Linux desktops via a browser. There’s nothing new to this idea. However, we have a very specific application in mind, so we need to roll our own solution. Importantly, there need to be multiple independent connections catering for a group of users. In this post I’ll show how we used the following tools to make this possible: Read More →

TomTom Routing

While working with the Google Mobility Data I stumbled upon the TomTom Traffic Index. I then learned that TomTom has a public API which exposes a bunch of useful and interesting data. Seemed like another opportunity to create a smaller R package. Enter {tomtom}. {tomtom} Package The {tomtom} package can be found here. Install the package. remotes::install_github("datawookie/tomtom") Load the package. library(tomtom) API Key Getting a key for the API is quick and painless. Read More →

Fixing Truncated Logs on Gitlab CI/CD

I’ve got a few CI/CD jobs running on GitLab which produce long logs. So much so that the logs get truncated. Job's log exceeded limit of 4194304 bytes. There’s a fundamental problem with this though: if something’s going to break then it’s inevitably going to happen after the logs have been truncated so I won’t be able to actually see what’s broken. Fortunately this is easily fixed, provided that you have access to the configuration for your GitLab Runner. Read More →

Mobility & Unrest in South Africa

Did the recent unrest in South Africa have a detectable effect on mobility patterns? Google Mobility Data Google has used anonymised data personal location data to gather information on mobility during the COVID-19 pandemic. These data are freely available in CSV format and regularly updated. {mobility} Package I created a small R package, {mobility}, which wraps the Google Mobility Data. The package is available from GitHub and the data are updated daily using GitHub Actions. Read More →

SSH Tunnel from Docker

I’m building a crawler which I’m going to wrap up in a Docker image. The crawler writes data to a remote MySQL database. However, there’s a catch: the database connection is via an SSH tunnel. Another wrinkle: the crawler is going to be run on ECS, so the whole thing (including setting up the SSH tunnel) needs to be baked into the Docker image. This post illustrates the process of connecting to a remote MySQL data via a SSH tunnel from Docker. Read More →

Shiny on ECS

A recipe for setting up a simple Shiny app on ECS. Docker Image 📢 If you want to simply use my Docker image for testing, then you can skip this section and go straight to deploying. Since this is for illustration purposes, I’m going to keep the app itself super simple, just using a slightly modified version of the “Old Faithful Geyser Data” template app, faithful.R. We need to wrap this up in a Docker image. Read More →

Adding Swap Space on Ubuntu

Most people running a Linux system would agree that you should set up swap. According to the poll below, only 28% believe that no swap is required. And I think that they are misguided. Always put some swap on your system. You’ll never regret it. There are two approaches to adding swap to your system: a separate swap partition or a swap file on an existing partition. Swap Partition You can set up an entire separate partition which is dedicated to swap. Read More →

Scrapy with a Rotating Tor Proxy

This post shows an approach to using a rotating Tor proxy with Scrapy. I’m using the scrapy-rotating-proxies download middleware package to rotate through a set of proxies, ensuring that my requests are originating from a selection of IP addresses. However, I need to have those IP addresses evolve over time too, so I’m using the Tor network. Setup I’ve got the following in the settings.py for my Scrapy project: DOWNLOADER_MIDDLEWARES = { 'rotating_proxies. Read More →

RAM & CPU Requirements for a Selenium Crawler

How much memory and CPU resources should be allocated to a simple Selenium crawler? I’ve been fudging these parameters but the time has come to man up and do this right. I want my task to have sufficient resources that it’s able to perform its function. It should never be starved of resources! But, at the same, I also don’t want to extravagantly allocate excess resources. More resources → higher costs. Read More →