One of my standard approaches to scraping content from a dynamic website is to diagnose the API behind the site and then use it to retrieve data directly. This means that I can make efficient HTTP requests using the requests
package and I don’t need to worry about all of the complexity around scraping with Selenium. However, it’s often the case that the API requests require a collection of cookies and headers, and those need to be gathered using Selenium.
In this case I have a two-step method:
- open the page in Selenium and retrieve the cookies and headers; and
- use the required cookies and/or headers to submit further requests using the
requests
package.
Getting Cookies & Headers
Here’s the function that I use to retrieve the cookies and headers.
import re
def get_cookies_headers(driver):
# Get cookies from browser & unpack into a dictionary.
#
cookies = {cookie["name"]: cookie["value"] for cookie in driver.get_cookies()}
# Use a synchronous request to retrieve response headers.
#
script = """
var xhr = new XMLHttpRequest();
xhr.open('GET', window.location.href, false);
xhr.send(null);
return xhr.getAllResponseHeaders();
"""
headers = driver.execute_script(script)
# Unpack headers into dictionary.
#
headers = headers.splitlines()
headers = dict([re.split(": +", header, maxsplit=1) for header in headers])
return cookies, headers
Getting the cookies is relatively simple because the Selenium driver has a get_cookies()
method. The object returned by get_cookies()
is a list of dictionaries, which we transform into a single dictionary.
A little more work is required for the headers. There’s no dedicated method to get the headers, so we run a snippet of JavaScript. The result is returned as a multi-line string, which is then parsed into a dictionary.
Driver
Let’s hook that up with a driver and see how well it works. I’ve got Selenium running in a Docker container and will access it via port 4444. Also I’m using the selenium==4.9.0
package.
import atexit
from selenium import webdriver
from selenium.webdriver import ChromeOptions
from util import get_cookies_headers
SELENIUM_SERVER_URL = "http://127.0.0.1:4444/wd/hub"
chrome_options = ChromeOptions()
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Remote(
command_executor=SELENIUM_SERVER_URL,
options=chrome_options,
)
atexit.register(lambda: driver.quit())
driver.get("https://www.google.com/")
cookies, headers = get_cookies_headers(driver)
Both cookies
and headers
are dictionaries, as required for use with the requests
package. Dumping a subset of the cookies as JSON gives:
{
"CONSENT": "PENDING+054",
"AEC": "Ackid1R8aA4SMd3lRtqdNWfmyuStZ8asnsieORbONgKWNabhDCMFZebYafY"
}
And here are selected headers:
{
"alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000",
"cache-control": "private, max-age=0",
"content-encoding": "br",
"content-length": "72050",
"content-type": "text/html; charset=UTF-8",
"cross-origin-opener-policy": "same-origin-allow-popups; report-to=\"gws\"",
"date": "Tue, 31 Oct 2023 14:12:59 GMT",
"expires": "-1",
"permissions-policy": "unload=()",
"server": "gws",
"strict-transport-security": "max-age=31536000",
"x-frame-options": "SAMEORIGIN",
"x-xss-protection": "0"
}
Conclusion
Being able to retrieve cookies and headers from a dynamic website using Selenium can be handy when the underlying API requires specific cookies and/or headers.