Do you do any web scraping? If so, then you probably spend a lot of time scratching around in your browser’s Developer Tools, figuring out the DOM structure and understanding how various bits of a site are delivered. Wouldn’t it be cool to access the Developer Tools functionality from inside your scraper? Well, you can. The Chrome DevTools Protocol (CDP) provides a low-level interface for interacting with Chrome. And you can tap into that interface via Selenium.
Setup
You’ll need to have a local instance of Selenium running. CDP is not accessible via the remote WebDriver.
I normally run Selenium in a Docker container and access it via the remote WebDriver. For the purpose of demonstration we’ll do something similar. First launch Selenium.
docker run --name selenium -it selenium/standalone-chrome-debug
Now connect to a BASH session in the running container.
docker exec -it selenium /bin/bash
In the BASH session we’ll install pip
and the selenium
package.
sudo apt update
sudo apt install python3-pip
pip install selenium
Then run Python.
If you’re already running a local Selenium instance then this setup is not necessary.
Open a Page
We can then launch a browser window and open a page.
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://www.google.com/")
Domain: DOM
Use the execute_cdp_cmd()
method to interact with the CDP. The functionality exposed by CDP is divided into domains, each of which supports a number of commands and events.
Let’s run the getDocument
command from the DOM
domain.
document = driver.execute_cdp_cmd(cmd="DOM.getDocument", cmd_args={})
The result is a dictionary with numerous keys. If we dump it to JSON then we can examine the structure.
{
"root": {
"backendNodeId": 1,
"baseURL": "https://www.google.com/",
"childNodeCount": 2,
"children": [
{
"backendNodeId": 2,
"localName": "",
"nodeId": 10,
"nodeName": "html",
"nodeType": 10,
"nodeValue": "",
"parentId": 9,
"publicId": "",
"systemId": ""
},
{
"attributes": [
"itemscope",
"",
"itemtype",
"http://schema.org/WebPage",
"lang",
"en-GB"
],
"backendNodeId": 3,
"childNodeCount": 2,
"children": [
{
"attributes": [],
"backendNodeId": 50,
"childNodeCount": 10,
"localName": "head",
"nodeId": 12,
"nodeName": "HEAD",
"nodeType": 1,
"nodeValue": "",
"parentId": 11
},
{
"attributes": [
"jsmodel",
"hspDDf",
"class",
"EM1Mrb"
],
"backendNodeId": 51,
"childNodeCount": 9,
"localName": "body",
"nodeId": 13,
"nodeName": "BODY",
"nodeType": 1,
"nodeValue": "",
"parentId": 11
}
],
"frameId": "6CB4D87F15E8ACD6C74B71B491E7EE60",
"localName": "html",
"nodeId": 11,
"nodeName": "HTML",
"nodeType": 1,
"nodeValue": "",
"parentId": 9
}
],
"compatibilityMode": "NoQuirksMode",
"documentURL": "https://www.google.com/",
"localName": "",
"nodeId": 9,
"nodeName": "#document",
"nodeType": 9,
"nodeValue": "",
"xmlVersion": ""
}
}
It gives us the high level structure of the page as well as some metadata. We’ll use it to get the document URL.
document["root"]["documentURL"]
'https://www.google.com/'
We can delve into the DOM. From the data above we see that the <body>
tag has a node ID of 13. Let’s get some more information.
driver.execute_cdp_cmd(cmd="DOM.describeNode", cmd_args={"nodeId": 13, "depth": 0})
{
"node": {
"attributes": [
"jsmodel",
"hspDDf",
"class",
"EM1Mrb"
],
"backendNodeId": 51,
"childNodeCount": 9,
"localName": "body",
"nodeId": 0,
"nodeName": "BODY",
"nodeType": 1,
"nodeValue": ""
}
}
You can vary the depth
parameter to determine how deep to delve into the DOM.
We can also get the box model for that tag.
driver.execute_cdp_cmd(cmd="DOM.getBoxModel", cmd_args={"nodeId": 13})
{
"model": {
"border": [0, 0, 1042, 0, 1042, 849, 0, 849],
"content": [0, 0, 1042, 0, 1042, 849, 0, 849],
"height": 849,
"margin": [0, 0, 1042, 0, 1042, 849, 0, 849],
"padding": [0, 0, 1042, 0, 1042, 849, 0, 849],
"width": 1042
}
}
Domain: Network
What about another domain? The Network
domain can be used to retrieve cookies via the getCookies
command.
cookies = driver.execute_cdp_cmd(cmd="Network.getCookies", cmd_args={})
for cookie in cookies["cookies"]:
print(cookie["name"])
CONSENT
__Secure-ENID
AEC
We can use the clearBrowserCookies
to clear those cookies.
driver.execute_cdp_cmd(cmd="Network.clearBrowserCookies", cmd_args={})
Check the cookies again. They should all be gone.
Domain: Browser
Use the Browser
domain to find out more about the browser.
driver.execute_cdp_cmd(cmd="Browser.getVersion", cmd_args={})
{
"jsVersion": "9.4.146.16",
"product": "Chrome/94.0.4606.61",
"protocolVersion": "1.3",
"revision": "@418b78f5838ed0b1c69bb4e51ea0252171854915",
"userAgent": "Mozilla/5.0 (X11; Linux) AppleWebKit/537.36 Chrome/94.0.4606.61 Safari/537.36"
}
Conclusion
Chrome DevTools Protocol is another tool to add to your web scraping arsenal. There’s a wealth of information to be retrieved from the various domains. This is a rather niche tool and, for me, it feels a bit like a hammer looking for a nail. But I think it’s just a matter of time before I need something like this, and then it’s going to be invaluable.