Andrew B. Collier / @datawookie


Social links and a link to my CV.

Public datasets:


Fixing Truncated Logs on Gitlab CI/CD

I’ve got a few CI/CD jobs running on GitLab that produce long logs, which in turn get truncated. Since the most interesting stuff normally happens towards the end of the logs (like errors!), this can be really counter-productive.

Job's log exceeded limit of 4194304 bytes.

There’s a fundamental problem with this though: if something’s going to break then it’s inevitably going to happen after the logs have been truncated so I won’t be able to actually see what’s broken.

Read More →

SSH Tunnel from Docker

I’m building a crawler which I’m going to wrap up in a Docker image. The crawler writes data to a remote MySQL database. However, there’s a catch: the database connection is via an SSH tunnel. Another wrinkle: the crawler is going to be run on ECS, so the whole thing (including setting up the SSH tunnel) needs to be baked into the Docker image.

This post illustrates the process of connecting to a remote MySQL data via a SSH tunnel from Docker. I’m not sure how secure this is. And there are probably better ways to do this. But it’s a start and it works!

Read More →

Adding Swap Space on Ubuntu

Most people running a Linux system would agree that you should set up swap. According to the poll below, only 28% believe that no swap is required. And I think that they are misguided. Always put some swap on your system. You’ll never regret it.

Read More →

Scrapy with a Rotating Tor Proxy

This post shows an approach to using a rotating Tor proxy with Scrapy.

I’m using the scrapy-rotating-proxies download middleware package to rotate through a set of proxies, ensuring that my requests are originating from a selection of IP addresses. However, I need to have those IP addresses evolve over time too, so I’m using the Tor network.

Setup

I’ve got the following in the settings.py for my Scrapy project:

Read More →

RAM & CPU Requirements for a Selenium Crawler

How much memory and CPU resources should be allocated to a simple Selenium crawler? I’ve been fudging these parameters but the time has come to man up and do this right.

I want my task to have sufficient resources that it’s able to perform its function. It should never be starved of resources! But, at the same, I also don’t want to extravagantly allocate excess resources. More resources → higher costs. I want to allocate the minimal resources to get the job done.

Read More →

Shiny Inception: JavaScript in Rendered Markdown

I’m busy helping a colleague with a Shiny application. The application includes HTML content rendered from a .Rmd document. However, there’s a catch: the .Rmd uses the {DT} package to render a dynamic DataTable. It turns out that this doesn’t immediately work because the JavaScript in the embedded document isn’t run.

I’ll use a simple document and application structure to illustrate the problem.

Static Document

Let’s start with a .Rmd document which renders two different static views of the Palmer Archipelago (Antarctica) Penguin Data.

Read More →

Building an Airflow Environment in Docker

We’re developing some training about Apache Airflow and need to have a robust and portable environment for running demos and labs which we can make available to the class. This will reduce the frustration and time wasted getting everybody set up and ensure that everybody is working in the same environment.

Read More →

Desktop in Docker

We’re building a new training program around Apache Airflow. The major technical challenge with delivering this sort of program is ensuring that everybody in the class has access to a working version of the technology. Since there is generally a diverse range of setups (operating systems, corporate firewalls and personal configurations) this can really be a nightmare.

Read More →

AWS EC2: Setting up a Load Balancer

An Application Load Balancer receives requests and distributes them across a selection of processing resources. These processing resources are divided into Target Groups (see previous post for how to set one up).

Creating an Application Load Balancer

We’re setting up a Flask API which is deployed as a Docker image and running on ECS. We’re going to create a load balancer which will accept requests on port 80 and route them to port 5000 on the API container.

Read More →

AWS EC2: Creating a Target Group

A banner image for AWS/EC2.

If we want to have an ECS service which is visible to the public, then we need to set up an Application Load Balancer. There are a couple of steps to this process, the first of which is creating a Target Group.

Read More →

AWS Containers #4: Dependencies

A banner image for AWS/ECS.

We saw in a previous post that it’s important to ensure that the Selenium container is running and accepting requests before the crawler actually gets started. This is because the crawler depends on Selenium being available. We can use ECS task dependencies to assert this dependency.

Read More →

AWS Containers #5: Health Checks

A banner image for AWS/ECS.

Can we create a health check that will check if the Selenium service is available? Yes! We will need to do two things:

  • tell the crawler container to wait for the Selenium container to be HEALTHY and
  • add a health check to the Selenium container.

Let’s do it!

Read More →