When I containerised Selenium crawlers in the past I normally used a remote driver connection from the crawler to Selenium, running a separate Docker image with Selenium and accessing it via port 4444. This has proven to be a robust design. However, it does mean two containers rather than just one, leading to a higher maintenance burden and elevated resource requirements.
What about simply embedding Chrome and ChromeDriver directly into the crawler image? It requires a bit more work, but it’s worth it. The critical point is ensuring compatible versions of Chrome and ChromeDriver.
Recent distributions of Chrome and ChromeDriver can be found here. The Dockerfile
extract below installs from that location.
RUN apt-get update -qq -y && \
apt-get install -y \
libasound2 \
libatk-bridge2.0-0 \
libgtk-4-1 \
libnss3 \
xdg-utils \
wget && \
wget -q -O chrome-linux64.zip https://bit.ly/chrome-linux64-121-0-6167-85 && \
unzip chrome-linux64.zip && \
rm chrome-linux64.zip && \
mv chrome-linux64 /opt/chrome/ && \
ln -s /opt/chrome/chrome /usr/local/bin/ && \
wget -q -O chromedriver-linux64.zip https://bit.ly/chromedriver-linux64-121-0-6167-85 && \
unzip -j chromedriver-linux64.zip chromedriver-linux64/chromedriver && \
rm chromedriver-linux64.zip && \
mv chromedriver /usr/local/bin/
Fedora Base Image
The setup above assumes that you are using a Debian base image. Here are the dependencies for images using the dnf
or yum
package managers.
RUN dnf update -y && \
dnf install -y \
wget \
unzip \
gtk3 \
nss \
atk \
at-spi2-atk \
cups-libs \
libdrm \
libxkbcommon \
libXcomposite \
libXdamage \
libXrandr && \
dnf clean all
Launching Selenium
Node
If you’re going to be running a Node application then you’ll want to use a node
base image.
FROM node:18
# 🚀 Install Chrome and ChromeDriver here!
WORKDIR /usr/src/app
COPY package*.json .
ENV CHROMEDRIVER_SKIP_DOWNLOAD=true
RUN npm install --omit=dev
RUN npm install chromedriver
COPY . .
CMD [ "npm", "start" ]
Setting the CHROMEDRIVER_SKIP_DOWNLOAD
environment variable is important to ensure that an incompatible version of ChromeDriver is not installed along with the chromedriver
package.
Test with something like this:
import { Builder } from 'selenium-webdriver';
import chrome from 'selenium-webdriver/chrome.js';
console.log("Start the browser.")
let chromeOptions = new chrome.Options();
chromeOptions.addArguments('--headless', '--disable-gpu', '--no-sandbox');
let driver = new Builder()
.forBrowser('chrome')
.setChromeOptions(chromeOptions)
.build();
console.log("Done!")
console.log("Open Google.")
await driver.get("https://google.com");
console.log("Done!")
const html = await driver.getPageSource();
driver.quit();
Possibly not all of those options are necessary. The --no-sandbox
is required though, otherwise you get an error related to launching Chrome as the root
user.
Python
Similarly, for Python you’ll want a python
base image.
FROM python:3.11.4
# 🚀 Install Chrome and ChromeDriver here!
RUN pip3 install selenium==4.18.1
Test with something like this:
from selenium.webdriver import Chrome, ChromeOptions
options = ChromeOptions()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = Chrome(options=options)
driver.get('https://google.com')
print(driver.page_source)
driver.quit()