Zyte API Cookie Management

In a previous post I looked at various ways to use the Zyte API to retrieve web content. Now I’m going to delve into options for managing cookies via the Zyte API.

What are Cookies?

I’m a great exponent of cookies. Ginger. Oat. Chocolate digestives. Chocolate chip. Triple chocolate. I sense there’s a pattern emerging. But I digress. These are not the cookies I’m concerned with here.

In the context of the web, cookies are small text files stored by a web browser when it visits a web site. They allow sites to retain user-specific information, such as login credentials, preferences or session data. Their purpose is ostensibly to enhance user experience and enable personalised functionality.

For web scraping, cookies are essential for replicating browser behaviour and maintaining session continuity, especially when scraping authenticated or dynamic sites. Proper cookie management ensures requests are recognized as valid, helping to avoid anti-bot mechanisms. Sloppy cookie management can lead to failed requests, blocked access or incomplete data extraction.

Although these cookies are just text files, they still come in a couple of flavours:

  • Request Cookies — These are sent along with the request. You have control over these.
  • Response Cookies — These are returned with the response and are generated by the server. Generally the response cookies are stored by your browser and will influence the request cookies sent along with subsequent requests.

Your browser will maintain a local database of cookies. These will be used to populate the cookies for future requests. Firefox maintains a cookie database in cookies.sqlite. You can open up that database and browse around. For example, below are some of the cookies for https://www.checkers.co.za.

sqlite> select name, value, host, path, expiry from moz_cookies where host = "www.checkers.co.za";
name                       value         host                path  expiry    
-------------------------  ------------  ------------------  ----  ----------
cookie-notification        NOT_ACCEPTED  www.checkers.co.za  /     1772349673
cookie-promo-alerts-popup  true          www.checkers.co.za  /     1772349677
webp_supported             true          www.checkers.co.za  /     1769325677
checkersZA-preferredStore  40206         www.checkers.co.za  /     1769346703

Each cookie defines a few fields:

  • name
  • value
  • domain or host
  • path
  • expires or expiry
  • httpOnly or isHttpOnly; and
  • secure or isSecure.

The names of the fields depends on where they are being accessed (via the API versus in the browser database).

Enough background. It’s time to crack open the Zyte API.

Getting Response Cookies

By default the Zyte API will not provide any information on the cookies returned with a response. But you can easily get this extra information. Set the "responseCookies" key (see documentation) to receive the response cookies.

Let’s take a look at the cookies returned by Univar Solutions (the same site I looked at in the previous post).

import logger
import os
from zyte_api import ZyteAPI

ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

URL = "https://www.univarsolutions.com/diethylenetriamine-3275000"

client = ZyteAPI(api_key=ZYTE_API_KEY)

# Send a request to the Zyte API, specifying that we want the response cookies.
#
response = client.get(
    {
        "url": URL,
        "browserHtml": True,
        "responseCookies": True,
    }
)

# Unpack the response cookies.
#
for cookie in response.get("responseCookies"):
    name = cookie.get("name")
    value = cookie.get("value")

    logger.info(f"{name} | {value}")

Here are the names and values for a selection of the returned cookies. 💡 These are the response cookies returned when no cookies are sent along with the request. This effectively mimics the behaviour of visiting the site for the first time in your browser (easily replicated by creating an incognito browser session).

2025-01-23 06:14:12,634 [   INFO] _ga | GA1.1.982475565.1737612849
2025-01-23 06:14:12,634 [   INFO] PHPSESSID | 5b94ad5e5889ea4d1a76b00b35107349
2025-01-23 06:14:12,634 [   INFO] form_key | GS3EWLmDNikyawII
2025-01-23 06:14:12,634 [   INFO] notice_behavior | implied,eu

The _ga cookie is for Google Analytics. The PHPSESSID cookie records the PHP session ID, which should remain constant during a session. The form_key cookie is used by Magento (a PHP e-commerce platform) for form submission, protecting data from Cross-Site Request Forgery (CSRF). Finally, the notice_behavior cookie records preferences regarding cookie consent.

There’s also a cf_clearance cookie (which has a long and convoluted value, so it’s not shown) that contains a token used to maintain session access through Cloudflare.

Setting Request Cookies

It’s useful to understand the cookies sent back with the response. However, this is completely passive. What if you want to actively specify the cookies that accompany a request? Use the "requestCookies" key (see documentation).

Start by using the httpbin service, which will allow me to verify that the request cookies are propagating through the Zyte API and being forwarded to the target site.

import logger
import os
import json
from base64 import b64decode
from zyte_api import ZyteAPI

ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

URL = "https://httpbin.org/cookies"

# Create a couple of cookies. Names and values are chosen for illustration only.
# Each cookie is a separate dictionary.
#
cookies = [
    {
        "name": "location",
        "value": "London",
        "domain": ".httpbin.org",
    },
    {
        "name": "username",
        "value": "datawookie",
        "domain": ".httpbin.org",
        "path": "/cookies",
    },
]

client = ZyteAPI(api_key=ZYTE_API_KEY)

# Send a request to the Zyte API, specifying the request cookies.
#
response = client.get(
    {
        "url": URL,
        "httpResponseBody": True,
        "requestCookies": cookies,
    }
)

body = response.get("httpResponseBody")
#
if body:
    data = b64decode(body).decode("utf-8")

    logger.info(json.loads(data))

The response returned by https://httpbin.org/cookies includes the two specified cookies. 📢 This is not the way that most sites react to request cookies. This httpbin service is specifically for interrogating requests.

2025-01-25 07:10:01,478 [   INFO] {'cookies': {'location': 'London', 'username': 'datawookie'}}

That feels a little dry and theoretical and might leave you wondering “Why?”. Below are a few applications which I hope will give some insight.

Sir Vape Cookies

Sir Vape will only admit you to their site if you declare that you’re over 18 years old. When you visit the site for the first time you’re blocked by a dialog that asks you for your date of birth.

Sir Vape dialog requesting date of birth.

If, for whatever reason, you wanted to use the Zyte API to access the content of the site then this dialog would most certainly get in the way. However, with a bit of investigation I found that the declaration sets a cookie and I can send that cookie along with the request and immediately satisfy the age requirement.

import os
from base64 import b64decode
from zyte_api import ZyteAPI

ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

URL = "https://www.sirvape.co.za/"

# A cookie that specifies that the request comes from an adult.
#
cookies = [
    {
        "name": "__age_checker-history",
        "value": "pass",
        "domain": ".sirvape.co.za",
    },
]

client = ZyteAPI(api_key=ZYTE_API_KEY)

response = client.get(
    {
        "url": URL,
        "requestCookies": cookies,
        "screenshot": True,
    }
)

with open("screenshot-sirvape.png", "wb") as file:
    file.write(b64decode(response["screenshot"]))

Sending that cookie effectively tells the site that I have already satisfied the age declaration and I can immediately proceed to the content.

Sir Vape landing page without age declaration dialog.

Checkers Cookies

Checkers uses a cookie to persist a selected store. This is what the site looks like without any cookies set (or, equivalently, how it would appear upon first visit or in an incognito session).

Checkers landing page with default store selected.

There are two things that I want to address:

  1. The promo banner. This is not really a problem, just an annoyance.
  2. The selected store, which is currently set the to default, Checkers Mall of the North. Since pricing and availability of items may vary by store I want to select another store that’s of more direct relevance to me, specifically Checkers Hyper Gateway.

I can achieve these two goals by setting a pair of cookies:

  • checkersZA-preferredStore — set to a specific numeric store code and
  • cookie-promo-alerts-popup — set to true.

These cookies and their preferred values were found by inspecting the request submitted by a live browser session.

import os
from base64 import b64decode
from zyte_api import ZyteAPI

ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

URL = "https://www.checkers.co.za/"

cookies = [
    {
        "name": "checkersZA-preferredStore",
        "value": "40206",
        "domain": ".checkers.co.za",
    },
    {
        "name": "cookie-promo-alerts-popup",
        "value": "true",
        "domain": ".checkers.co.za",
    },
]


client = ZyteAPI(api_key=ZYTE_API_KEY)

response = client.get(
    {
        "url": URL,
        "requestCookies": cookies,
        "screenshot": True,
    }
)

with open("screenshot-checkers.png", "wb") as file:
    file.write(b64decode(response["screenshot"]))

Sending those cookies along with the request achieves both of my objectives. If you look carefully at the screenshot below you’ll see that the promo banner is gone and the required store is selected.

Checkers landing page with specific store selected.

Conclusion

Cookies are central to the operation of many modern web sites. Managing cookies effectively can make or break some web scraping projects. The Zyte API provides mechanisms to access the cookies returned with a response and specify the cookies sent along with a request.

In the next post I’ll show how sessions can be used with the Zyte API to make the use of cookies quite painless.

If you need help building web crawlers or advice of data acquisition, then get in touch with Fathom Data. The team has over a decade’s experience building bespoke data solutions. As an AWS Partner they can also set up and maintain your cloud infrastructure.