Chrome & ChromeDriver in Docker

When I containerised Selenium crawlers in the past I normally used a remote driver connection from the crawler to Selenium, running a separate Docker image with Selenium and accessing it via port 4444. This has proven to be a robust design. However, it does mean two containers rather than just one, leading to a higher maintenance burden and elevated resource requirements.

What about simply embedding Chrome and ChromeDriver directly into the crawler image? It requires a bit more work, but it’s worth it. The critical point is ensuring compatible versions of Chrome and ChromeDriver.

Recent distributions of Chrome and ChromeDriver can be found here. The Dockerfile extract below installs from that location.

RUN apt-get update -qq -y && \
    apt-get install -y \
        libasound2 \
        libatk-bridge2.0-0 \
        libgtk-4-1 \
        libnss3 \
        xdg-utils \
        wget && \
    wget -q -O chrome-linux64.zip https://bit.ly/chrome-linux64-121-0-6167-85 && \
    unzip chrome-linux64.zip && \
    rm chrome-linux64.zip && \
    mv chrome-linux64 /opt/chrome/ && \
    ln -s /opt/chrome/chrome /usr/local/bin/ && \
    wget -q -O chromedriver-linux64.zip https://bit.ly/chromedriver-linux64-121-0-6167-85 && \
    unzip -j chromedriver-linux64.zip chromedriver-linux64/chromedriver && \
    rm chromedriver-linux64.zip && \
    mv chromedriver /usr/local/bin/

Fedora Base Image

The setup above assumes that you are using a Debian base image. Here are the dependencies for images using the dnf or yum package managers.

RUN dnf update -y && \
    dnf install -y \
        wget \
        unzip \
        gtk3 \
        nss \
        atk \
        at-spi2-atk \
        cups-libs \
        libdrm \
        libxkbcommon \
        libXcomposite \
        libXdamage \
        libXrandr && \
    dnf clean all

Launching Selenium

Node

If you’re going to be running a Node application then you’ll want to use a node base image.

FROM node:18

# 🚀 Install Chrome and Chromedriver here!

WORKDIR /usr/src/app
COPY package*.json .

ENV CHROMEDRIVER_SKIP_DOWNLOAD=true

RUN npm install --omit=dev
RUN npm install chromedriver

COPY . .

CMD [ "npm", "start" ]

Setting the CHROMEDRIVER_SKIP_DOWNLOAD environment variable is important to ensure that an incompatible version of ChromeDriver is not installed along with the chromedriver package.

Test with something like this:

import { Builder } from 'selenium-webdriver';
import chrome from 'selenium-webdriver/chrome.js';

console.log("Start the browser.")

let chromeOptions = new chrome.Options();
chromeOptions.addArguments('--headless', '--disable-gpu', '--no-sandbox');

let driver = new Builder()
    .forBrowser('chrome')
    .setChromeOptions(chromeOptions)
    .build();

console.log("Done!")

console.log("Open Google.")
await driver.get("https://google.com");
console.log("Done!")

const html = await driver.getPageSource();

driver.quit();

Possibly not all of those options are necessary. The --no-sandbox is required though, otherwise you get an error related to launching Chrome as the root user.

Python

Similarly, for Python you’ll want a python base image.

FROM python:3.11.4

# 🚀 Install Chrome and Chromedriver here!

RUN pip3 install selenium==4.18.1

Test with something like this:

from selenium.webdriver import Chrome, ChromeOptions

options = ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

driver = Chrome(options=options)

driver.get('https://google.com')

print(driver.page_source)

driver.quit()