 
          
        
      
      
Do you do any web scraping? If so, then you probably spend a lot of time scratching around in your browser’s Developer Tools, figuring out the DOM structure and understanding how various bits of a site are delivered. Wouldn’t it be cool to access the Developer Tools functionality from inside your scraper? Well, you can. The Chrome DevTools Protocol (CDP) provides a low-level interface for interacting with Chrome. And you can tap into that interface via Selenium.
Setup
You’ll need to have a local instance of Selenium running. CDP is not accessible via the remote WebDriver.
I normally run Selenium in a Docker container and access it via the remote WebDriver. For the purpose of demonstration we’ll do something similar. First launch Selenium.
docker run --name selenium -it selenium/standalone-chrome-debug
Now connect to a BASH session in the running container.
docker exec -it selenium /bin/bash
In the BASH session we’ll install pip and the selenium package.
sudo apt update
sudo apt install python3-pip
pip install selenium
Then run Python.
If you’re already running a local Selenium instance then this setup is not necessary.
Open a Page
We can then launch a browser window and open a page.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.google.com/")
Domain: DOM
Use the execute_cdp_cmd() method to interact with the CDP. The functionality exposed by CDP is divided into domains, each of which supports a number of commands and events.
Let’s run the getDocument command from the DOM domain.
document = driver.execute_cdp_cmd(cmd="DOM.getDocument", cmd_args={})
The result is a dictionary with numerous keys. If we dump it to JSON then we can examine the structure.
{
  "root": {
    "backendNodeId": 1,
    "baseURL": "https://www.google.com/",
    "childNodeCount": 2,
    "children": [
      {
        "backendNodeId": 2,
        "localName": "",
        "nodeId": 10,
        "nodeName": "html",
        "nodeType": 10,
        "nodeValue": "",
        "parentId": 9,
        "publicId": "",
        "systemId": ""
      },
      {
        "attributes": [
          "itemscope",
          "",
          "itemtype",
          "http://schema.org/WebPage",
          "lang",
          "en-GB"
        ],
        "backendNodeId": 3,
        "childNodeCount": 2,
        "children": [
          {
            "attributes": [],
            "backendNodeId": 50,
            "childNodeCount": 10,
            "localName": "head",
            "nodeId": 12,
            "nodeName": "HEAD",
            "nodeType": 1,
            "nodeValue": "",
            "parentId": 11
          },
          {
            "attributes": [
              "jsmodel",
              "hspDDf",
              "class",
              "EM1Mrb"
            ],
            "backendNodeId": 51,
            "childNodeCount": 9,
            "localName": "body",
            "nodeId": 13,
            "nodeName": "BODY",
            "nodeType": 1,
            "nodeValue": "",
            "parentId": 11
          }
        ],
        "frameId": "6CB4D87F15E8ACD6C74B71B491E7EE60",
        "localName": "html",
        "nodeId": 11,
        "nodeName": "HTML",
        "nodeType": 1,
        "nodeValue": "",
        "parentId": 9
      }
    ],
    "compatibilityMode": "NoQuirksMode",
    "documentURL": "https://www.google.com/",
    "localName": "",
    "nodeId": 9,
    "nodeName": "#document",
    "nodeType": 9,
    "nodeValue": "",
    "xmlVersion": ""
  }
}
It gives us the high level structure of the page as well as some metadata. We’ll use it to get the document URL.
document["root"]["documentURL"]
'https://www.google.com/'
We can delve into the DOM. From the data above we see that the <body> tag has a node ID of 13. Let’s get some more information.
driver.execute_cdp_cmd(cmd="DOM.describeNode", cmd_args={"nodeId": 13, "depth": 0})
{
  "node": {
    "attributes": [
      "jsmodel",
      "hspDDf",
      "class",
      "EM1Mrb"
    ],
    "backendNodeId": 51,
    "childNodeCount": 9,
    "localName": "body",
    "nodeId": 0,
    "nodeName": "BODY",
    "nodeType": 1,
    "nodeValue": ""
  }
}
You can vary the depth parameter to determine how deep to delve into the DOM.
We can also get the box model for that tag.
driver.execute_cdp_cmd(cmd="DOM.getBoxModel", cmd_args={"nodeId": 13})
{
  "model": {
    "border": [0, 0, 1042, 0, 1042, 849, 0, 849],
    "content": [0, 0, 1042, 0, 1042, 849, 0, 849],
    "height": 849,
    "margin": [0, 0, 1042, 0, 1042, 849, 0, 849],
    "padding": [0, 0, 1042, 0, 1042, 849, 0, 849],
    "width": 1042
  }
}
Domain: Network
What about another domain? The Network domain can be used to retrieve cookies via the getCookies command.
cookies = driver.execute_cdp_cmd(cmd="Network.getCookies", cmd_args={})
for cookie in cookies["cookies"]:
    print(cookie["name"])
CONSENT
__Secure-ENID
AEC
We can use the clearBrowserCookies to clear those cookies.
driver.execute_cdp_cmd(cmd="Network.clearBrowserCookies", cmd_args={})
Check the cookies again. They should all be gone.
Domain: Browser
Use the Browser domain to find out more about the browser.
driver.execute_cdp_cmd(cmd="Browser.getVersion", cmd_args={})
{
  "jsVersion": "9.4.146.16",
  "product": "Chrome/94.0.4606.61",
  "protocolVersion": "1.3",
  "revision": "@418b78f5838ed0b1c69bb4e51ea0252171854915",
  "userAgent": "Mozilla/5.0 (X11; Linux) AppleWebKit/537.36 Chrome/94.0.4606.61 Safari/537.36"
}
Conclusion
Chrome DevTools Protocol is another tool to add to your web scraping arsenal. There’s a wealth of information to be retrieved from the various domains. This is a rather niche tool and, for me, it feels a bit like a hammer looking for a nail. But I think it’s just a matter of time before I need something like this, and then it’s going to be invaluable.