Selenium Crawler #2: Docker Bridge Network

In the previous post we set up a scraper template which used Selenium on Docker via the host network. Now we’re going to do essentially the same thing but using a bridge network.

Default Network

We’ll start by using the Docker default bridge network.

docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
ea5ebd23a086        bridge              bridge              local
bb80a2809880        host                host                local
00b74ecbf970        none                null                local

These three networks will always be available: bridge, host and none. We’re only interested in the first one.

Let’s create a Selenium container.

docker run -d --rm --name selenium selenium/standalone-chrome:3.141
cede2a2e6fc279fcb2014f290cc5e324d86f2033d04cca1b2da59c03e121aec5

Now if we inspect the bridge network we’ll see that the selenium container is connected.

docker network inspect bridge
[
    {
        "Name": "bridge",
        "IPAM": {
            "Config": [
                {
                    "Subnet": "172.17.0.0/16",
                    "Gateway": "172.17.0.1"
                }
            ]
        },
        "Containers": {
            "cede2a2e6fc279fcb2014f290cc5e324d86f2033d04cca1b2da59c03e121aec5": {
                "Name": "selenium",
                "MacAddress": "02:42:ac:11:00:02",
                "IPv4Address": "172.17.0.2/16",
                "IPv6Address": ""
            }
        }
    }
]
The above output has been abridged for clarity.

We can see that the gateway between the host and the bridge network has an IP of 172.17.0.1 and that the selenium container is at 172.17.0.2.

Launching a shell inside the selenium container we can see what the network looks like from its perspective.

root@cede2a2e6fc2:/# ip -br -c a
lo               UNKNOWN        127.0.0.1/8 
eth0@if68        UP             172.17.0.2/16
To get this to work you'll need to install the `iproute2` package on the container.

Okay, now let’s try connecting to the selenium container via the default bridge network. In order to do this we need to use it’s IP address.

from selenium import webdriver

SELENIUM_URL = "172.17.0.2:4444"

browser = webdriver.Remote(f"http://{SELENIUM_URL}/wd/hub", {'browserName': 'chrome'})

browser.get("https://www.google.com")

print(f"Retrieved URL: {browser.current_url}.")

browser.close()

We have to explicitly specify the IP for the selenium container. Obviously this is not ideal. We cannot be assured that the selenium container will always be at the same IP address, so this will become hard to maintain.

Stop the existing selenium container.

docker stop selenium

User-Defined Network

We’re able to build a more robust setup if we create a user-defined network.

docker network create --driver bridge google
57a868c4124e4339a35b13dd6125f36835530e561e48a85e26de02e31d44460b
```network.

List the Docker networks again.


``` bash
docker network ls
NETWORK ID          NAME                DRIVER              SCOPE
ea5ebd23a086        bridge              bridge              local
57a868c4124e        google              bridge              local
bb80a2809880        host                host                local
00b74ecbf970        none                null                local

The google network has been added to the list.

Now launch the Selenium container again, but this time using the --network argument to connect it to the google network.

docker run -d --rm --name selenium --network google selenium/standalone-chrome:3.141
f8a0a0dd21f4f27773c5ce260df21cb6d509815b56638ccfd6be5f05dbb8172b

If we inspect the google network then we’ll see the details of the selenium container.

docker network inspect google
[
    {
        "Name": "google",
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.21.0.0/16",
                    "Gateway": "172.21.0.1"
                }
            ]
        },
        "Containers": {
            "f8a0a0dd21f4f27773c5ce260df21cb6d509815b56638ccfd6be5f05dbb8172b": {
                "Name": "selenium",
                "MacAddress": "02:42:ac:15:00:02",
                "IPv4Address": "172.21.0.2/16",
                "IPv6Address": ""
            }
        }
    }
]
The above output has been abridged for clarity.

On a user-defined network containers can be located either via IP address or by name (where the name is internally resolved to an IP address via the automatic service discovery capability). This means that, rather than address the selenium container by its IP address we can simply refer to it by name. This is a much more robust setup since, provided we consistently use the same name for this container.

from selenium import webdriver

SELENIUM_URL = "selenium:4444"

browser = webdriver.Remote(f"http://{SELENIUM_URL}/wd/hub", {'browserName': 'chrome'})

browser.get("https://www.google.com")

print(f"Retrieved URL: {browser.current_url}.")

browser.close()

Scraper Template in Docker with User-Defined Bridge Network

Let’s wrap this up by putting our little scraper into a Docker image.

FROM python:3.8.5-slim AS base

RUN pip3 install selenium==3.141.0

COPY google-selenium-bridge-user-defined.py /

CMD python3 google-selenium-bridge-user-defined.py

Now build the image.

docker build -t google-selenium-bridge-user-defined .

And run it.

docker run --net=google google-selenium-bridge-user-defined
Retrieved URL: https://www.google.com/.

We specified --net=google to ensure that this container is launched onto the google network.

Our setup now has everything covered in the previous post but also keeps all of the networking within Docker, so everything is isolated from the host.

Cleaning Up

Always good practice to mop up: stop the selenium container and remove the google network.

docker stop selenium
docker network rm google