Chrome DevTools Protocol & Selenium

Do you do any web scraping? If so, then you probably spend a lot of time scratching around in your browser’s Developer Tools, figuring out the DOM structure and understanding how various bits of a site are delivered. Wouldn’t it be cool to access the Developer Tools functionality from inside your scraper? Well, you can. The Chrome DevTools Protocol (CDP) provides a low-level interface for interacting with Chrome. And you can tap into that interface via Selenium.

Setup

You’ll need to have a local instance of Selenium running. CDP is not accessible via the remote WebDriver.

I normally run Selenium in a Docker container and access it via the remote WebDriver. For the purpose of demonstration we’ll do something similar. First launch Selenium.

docker run --name selenium -it selenium/standalone-chrome-debug

Now connect to a BASH session in the running container.

docker exec -it selenium /bin/bash

In the BASH session we’ll install pip and the selenium package.

sudo apt update
sudo apt install python3-pip
pip install selenium

Then run Python.

If you’re already running a local Selenium instance then this setup is not necessary.

Open a Page

We can then launch a browser window and open a page.

from selenium import webdriver

driver = webdriver.Chrome()

driver.get("https://www.google.com/")

Domain: DOM

Use the execute_cdp_cmd() method to interact with the CDP. The functionality exposed by CDP is divided into domains, each of which supports a number of commands and events.

Let’s run the getDocument command from the DOM domain.

document = driver.execute_cdp_cmd(cmd="DOM.getDocument", cmd_args={})

The result is a dictionary with numerous keys. If we dump it to JSON then we can examine the structure.

{
  "root": {
    "backendNodeId": 1,
    "baseURL": "https://www.google.com/",
    "childNodeCount": 2,
    "children": [
      {
        "backendNodeId": 2,
        "localName": "",
        "nodeId": 10,
        "nodeName": "html",
        "nodeType": 10,
        "nodeValue": "",
        "parentId": 9,
        "publicId": "",
        "systemId": ""
      },
      {
        "attributes": [
          "itemscope",
          "",
          "itemtype",
          "http://schema.org/WebPage",
          "lang",
          "en-GB"
        ],
        "backendNodeId": 3,
        "childNodeCount": 2,
        "children": [
          {
            "attributes": [],
            "backendNodeId": 50,
            "childNodeCount": 10,
            "localName": "head",
            "nodeId": 12,
            "nodeName": "HEAD",
            "nodeType": 1,
            "nodeValue": "",
            "parentId": 11
          },
          {
            "attributes": [
              "jsmodel",
              "hspDDf",
              "class",
              "EM1Mrb"
            ],
            "backendNodeId": 51,
            "childNodeCount": 9,
            "localName": "body",
            "nodeId": 13,
            "nodeName": "BODY",
            "nodeType": 1,
            "nodeValue": "",
            "parentId": 11
          }
        ],
        "frameId": "6CB4D87F15E8ACD6C74B71B491E7EE60",
        "localName": "html",
        "nodeId": 11,
        "nodeName": "HTML",
        "nodeType": 1,
        "nodeValue": "",
        "parentId": 9
      }
    ],
    "compatibilityMode": "NoQuirksMode",
    "documentURL": "https://www.google.com/",
    "localName": "",
    "nodeId": 9,
    "nodeName": "#document",
    "nodeType": 9,
    "nodeValue": "",
    "xmlVersion": ""
  }
}

It gives us the high level structure of the page as well as some metadata. We’ll use it to get the document URL.

document["root"]["documentURL"]

'https://www.google.com/'

We can delve into the DOM. From the data above we see that the <body> tag has a node ID of 13. Let’s get some more information.

driver.execute_cdp_cmd(cmd="DOM.describeNode", cmd_args={"nodeId": 13, "depth": 0})

{
  "node": {
    "attributes": [
      "jsmodel",
      "hspDDf",
      "class",
      "EM1Mrb"
    ],
    "backendNodeId": 51,
    "childNodeCount": 9,
    "localName": "body",
    "nodeId": 0,
    "nodeName": "BODY",
    "nodeType": 1,
    "nodeValue": ""
  }
}

You can vary the depth parameter to determine how deep to delve into the DOM.

We can also get the box model for that tag.

driver.execute_cdp_cmd(cmd="DOM.getBoxModel", cmd_args={"nodeId": 13})

{
  "model": {
    "border": [0, 0, 1042, 0, 1042, 849, 0, 849],
    "content": [0, 0, 1042, 0, 1042, 849, 0, 849],
    "height": 849,
    "margin": [0, 0, 1042, 0, 1042, 849, 0, 849],
    "padding": [0, 0, 1042, 0, 1042, 849, 0, 849],
    "width": 1042
  }
}

Domain: Network

What about another domain? The Network domain can be used to retrieve cookies via the getCookies command.

cookies = driver.execute_cdp_cmd(cmd="Network.getCookies", cmd_args={})

for cookie in cookies["cookies"]:
    print(cookie["name"])

CONSENT
__Secure-ENID
AEC

We can use the clearBrowserCookies to clear those cookies.

driver.execute_cdp_cmd(cmd="Network.clearBrowserCookies", cmd_args={})

Check the cookies again. They should all be gone.

Domain: Browser

Use the Browser domain to find out more about the browser.

driver.execute_cdp_cmd(cmd="Browser.getVersion", cmd_args={})

{
  "jsVersion": "9.4.146.16",
  "product": "Chrome/94.0.4606.61",
  "protocolVersion": "1.3",
  "revision": "@418b78f5838ed0b1c69bb4e51ea0252171854915",
  "userAgent": "Mozilla/5.0 (X11; Linux) AppleWebKit/537.36 Chrome/94.0.4606.61 Safari/537.36"
}

Conclusion

Chrome DevTools Protocol is another tool to add to your web scraping arsenal. There’s a wealth of information to be retrieved from the various domains. This is a rather niche tool and, for me, it feels a bit like a hammer looking for a nail. But I think it’s just a matter of time before I need something like this, and then it’s going to be invaluable.