Andrew B. Collier / @datawookie

AWS Containers #1: Creating an ECS Cluster

2021-04-25 ECS AWS

In the last few posts we’ve looked at a few ways to set up the infrastructure for a Selenium crawler using Docker to run both the crawler and Selenium. In this post we’ll launch this setup in the cloud using AWS Elastic Container Service (ECS).

Read More →

Selenium Crawler #3: Docker Compose

2021-04-19 Python Selenium Docker

In two previous posts we’ve looked at how to set up a simple scraper which uses Selenium in Docker, communicating via the host network and bridge network. Both of those setups have involved launching separate containers for the scraper and Selenium. In this post we’ll see how to wrap everything up in a single entity using Docker Compose.

Read More →

Selenium Crawler #2: Docker Bridge Network

2021-04-18 Python Selenium Docker

In the previous post we set up a scraper template which used Selenium on Docker via the host network. Now we’re going to do essentially the same thing but using a bridge network.

Read More →

Selenium Crawler #1: Docker Host Network

2021-04-17 Python Selenium Docker

This post will show you how to set up the following:

a Selenium instance and
a simple script connecting to Selenium.

Both of these will run in Docker containers and will communicate over the host network.

Read More →

{hagr} Linnaean Classification

2021-04-16 {hagr} R

I’ve taken another look at the {hagr} data, which I wrote about previously. This time I’m focusing on the hierarchy of creatures.

Taxonomic Rank

The Linnaean Taxonomy is a hierarchical classification system for organisms devised by Carl Linnaeus. An organism is assigned to the following levels in the hierarchy (in increasing order or granularity):

domain
kingdom
phylum
class
order
family
genus and
species.

The relative level of a group of organisms in this hierarchy determines its taxonomic rank.

Read More →

{hagr} Database of Animal Ageing and Longevity

2021-04-12 {hagr} R

I came across the Human Ageing Genomic Resources. They are doing some fascinating work and expose some engrossing data. I wanted to make the data easier for me to work with, and an R package seemed to be the natural vehicle to do this.

For more information on these data, take a look at this article: Tacutu, Craig, Budovsky, Wuttke, Lehmann, Taranukha, Costa, Fraifeld and de Magalhaes, “Human Ageing Genomic Resources: Integrated databases and tools for the biology and genetics of ageing,” Nucleic Acids Research 41(D1):D1027-D1033, 2013.

Read More →

The Easter Bunny is Cashing In

2021-04-03 {trundler} R

Has the price of Easter Eggs shot up since last year? Let’s use data from Trundler to investigate. I’ll do the analysis in R using the {trundler} package.

Read More →

Making the Most of Mobility

2021-04-02 R

The Google Mobility Data (or Community Mobility Reports) refers to the datasets provided by Google which track how people move and congregate in various locations during specific time periods. The data is based on anonymised location information from users who have opted into Location History on their Google accounts.

Read More →

An Environment for Reliably Rendering Figures in R

2021-03-23 Docker R

Fathom Data is working on a project to reproduce the figures from the CORE textbook The Economy using R and {ggplot2}. There’s a strict style guide which specifies the figure aesthetics including colours and font. We’re a team of seven people working on as many different setups. The principle challenges have been package versions and fonts.

Read More →

Flexible Environment Variables for a Docker Image

2021-03-22 Docker CI GitLab

I’ve been following an excellent tutorialfor deploying a Docker image on an EC2 instance via GitLab CI/CD. It covers every step in the process in great detail. If you follow the steps then you’ll definitely end up with a working pipeline.

However, I still wasn’t quite sure how to handle the environment variables and credentials that I wanted to bake into the image, and which varied between my local development environment and the final deployed image.

Read More →

Install GitLab Runner with Docker

2021-03-21 Docker CI GitLab

📢 An updated version of this post reflecting recent changes in GitLab Runner can be found here.

I’ve got a project which takes a long time to build. And I rebuild it regularly. I’ve been using the shared runners on GitLab. However, the total time constraint has become a limitation. I’m going to install GitLab Runner as a Docker service on an underutilised EC2 instance.

Read More →

{emayili} UTF-8 Filenames & Setting Sender

2021-03-08 {emayili} R

Two new features in the {emayili} (0.4.6) package for easily sending emails from R.

Package Setup

If you have not already installed the package, then grab it from CRAN or GitHub.

# From CRAN.
install.packages("emayili")
# From GitHub.
remotes::install_github("datawookie/emayili")

Load the package.

library(emayili)

Check that you have the current version.

packageVersion("emayili")

[1] ‘0.4.6’

Let’s quickly set up an SMTP server. We’ll use SMTP Bucket, which is incredibly convenient for testing.

SMTP_SERVER   = "mail.smtpbucket.com"
SMTP_PORT     = 8025

smtp <- server(host = SMTP_SERVER, port = SMTP_PORT)

UTF-8 Characters in Attachment Filenames

It’s now possible to attach files with names that include non-ASCII characters. Suppose I wanted to send this image (source) of Wenceslao Moreno.

Read More →

Resurrecting MySQL into PostgreSQL with PGLoader

2021-03-02 MySQL PostgreSQL Docker PGLoader

I’ve been hosting a MySQL database on a DigitalOcean server for a few of years. The project has been on hold for a while. Entropy kicked in and the server became unreachable. Fortunately I was still able to access the server via a recovery console to export the database using mysqldump and download the resulting SQL dump file.

Now I want to resurrect the database locally but I also want to migrate it to PostgreSQL.

Read More →

{blogdown}: Optimise PNG Image Size

2021-02-21 {blogdown} R

Inspired by the informative post from Jumping Rivers about selecting the correct image file type, I decided to optimise PNG file size as part of this blog’s CI pipeline.

Read More →

{emayili} Sending Birthday Messages

2021-02-18 {emayili} R

Suppose that you want to use {emayili} to send birthday messages (this post motivated by issue #61).

Read More →

Setting up postref Shortcode for Remote Blog

2021-02-10 {blogdown} Hugo R

I ran into a bit of a snag when updating a {blogdown} site. Suddenly, inexplicably, the images were no longer present.

Read More →

Launching Selenium with JavaScript Disabled

2021-02-03 Python Selenium web scraping

I have a rather obscure situation where I want to launch Selenium… but with JavaScript disabled.

Read More →

Levies, Tax and the Fuel Price in South Africa

2021-02-01 {saffer} R

According to the Automobile Association (AA) the fuel price is the sum of four main components:

the basic fuel price
the general fuel levy
the Road Accident Fund (RAF) levy and
wholesale and retail margins, distribution and transport costs.

This article suggests that almost 70% of the fuel price in South Africa is due to taxes and levies.

How much of South Africa's petrol price goes to taxes https://t.co/m96j0TaKlS
— BusinessTech (@BusinessTechSA) January 29, 2021

I used data from {saffer} to examine this assertion.

Read More →

This is not Rain: It’s a Trickle

2021-01-30 Selenium web scraping R

I started using rain (a South African ISP) back in March 2019. The coverage was good (the only place I couldn’t get a signal was at Lanseria Airport), while the bandwidth was consistently high. I loved the fact that it was affordable, reliable and portable.

Much has changed.

Read More →

Persistent Selenium Sessions

2021-01-28 Python Selenium

I have a project where I need to have a persistent Selenium session. There’s a script which will leave a browser window open when it exits. When the script runs again it should connect to the same window.

Read More →