Cookies & Headers from Selenium

One of my standard approaches to scraping content from a dynamic website is to diagnose the API behind the site and then use it to retrieve data directly. This means that I can make efficient HTTP requests using the requests package and I don’t need to worry about all of the complexity around scraping with Selenium. However, it’s often the case that the API requests require a collection of cookies and headers, and those need to be gathered using Selenium.

In this case I have a two-step method:

  1. open the page in Selenium and retrieve the cookies and headers; and
  2. use the required cookies and/or headers to submit further requests using the requests package.

Getting Cookies & Headers

Here’s the function that I use to retrieve the cookies and headers.

import re

def get_cookies_headers(driver):
    # Get cookies from browser & unpack into a dictionary.
    #    
    cookies = {cookie["name"]: cookie["value"] for cookie in driver.get_cookies()}

    # Use a synchronous request to retrieve response headers.
    #
    script = """
    var xhr = new XMLHttpRequest();
    xhr.open('GET', window.location.href, false);
    xhr.send(null);
    return xhr.getAllResponseHeaders();
    """
    headers = driver.execute_script(script)
    
    # Unpack headers into dictionary.
    #
    headers = headers.splitlines()
    headers = dict([re.split(": +", header, maxsplit=1) for header in headers])

    return cookies, headers

Getting the cookies is relatively simple because the Selenium driver has a get_cookies() method. The object returned by get_cookies() is a list of dictionaries, which we transform into a single dictionary.

A little more work is required for the headers. There’s no dedicated method to get the headers, so we run a snippet of JavaScript. The result is returned as a multi-line string, which is then parsed into a dictionary.

Driver

Let’s hook that up with a driver and see how well it works. I’ve got Selenium running in a Docker container and will access it via port 4444. Also I’m using the selenium==4.9.0 package.

import atexit

from selenium import webdriver
from selenium.webdriver import ChromeOptions

from util import get_cookies_headers

SELENIUM_SERVER_URL = "http://127.0.0.1:4444/wd/hub"

chrome_options = ChromeOptions()
chrome_options.add_argument("--disable-gpu")

driver = webdriver.Remote(
    command_executor=SELENIUM_SERVER_URL,
    options=chrome_options,
)

atexit.register(lambda: driver.quit())

driver.get("https://www.google.com/")

cookies, headers = get_cookies_headers(driver)

Both cookies and headers are dictionaries, as required for use with the requests package. Dumping a subset of the cookies as JSON gives:

{
  "CONSENT": "PENDING+054",
  "AEC": "Ackid1R8aA4SMd3lRtqdNWfmyuStZ8asnsieORbONgKWNabhDCMFZebYafY"
}

And here are selected headers:

{
  "alt-svc": "h3=\":443\"; ma=2592000,h3-29=\":443\"; ma=2592000",
  "cache-control": "private, max-age=0",
  "content-encoding": "br",
  "content-length": "72050",
  "content-type": "text/html; charset=UTF-8",
  "cross-origin-opener-policy": "same-origin-allow-popups; report-to=\"gws\"",
  "date": "Tue, 31 Oct 2023 14:12:59 GMT",
  "expires": "-1",
  "permissions-policy": "unload=()",
  "server": "gws",
  "strict-transport-security": "max-age=31536000",
  "x-frame-options": "SAMEORIGIN",
  "x-xss-protection": "0"
}

Conclusion

Being able to retrieve cookies and headers from a dynamic website using Selenium can be handy when the underlying API requires specific cookies and/or headers.