Web Scraping with the Zyte API

Zyte is a data extraction platform, useful for web scraping and data processing at scale. It’s intended to simplify data collection and, based on my experience certainly does!

The extensive API documentation is a good place to start. There were three services that were of particular interest to me:

Each of these has a place in different workflows. I’ll dig into them below.

Why Zyte?

But first it’s worth considering why one would use a service like Zyte. Here are the ones that are relevant to me:

  1. efficiency and scale (infrastructure to allow a high volume of requests)
  2. bypassing challenges (handles CAPTCHAs and IP bans) ⭐
  3. browser automation (no need to run Selenium, Playwright or the like locally) ⭐
  4. data extraction (results in JSON rather than HTML)
  5. compliance (avoid legal issues or running foul of terms of service)
  6. proxies!

I’ve highlighted the most important items for me. Services like Cloudflare that throw up browser challenges can be rather belligerent. And it can also tedious to use browser automation to simply retrieve page content. The Zyte API has saved me a lot of time with challenging targets and helped me build simpler and more robust scrapers.

We’re going to try to extract price data from the UnivarSolutions site. This is an instructive target since the site is being served via Cloudflare.

UnivarSolutions product page for Diethylenetriamine.

Furthermore, some of the key content on the site is rendered via JavaScript. Below is the site with JavaScript disabled.

UnivarSolutions product page for Diethylenetriamine with JavaScript disabled.

None of the important pricing information is available. Since this is likely the most compelling information on the page it will determine our approach to scraping.

API

In most cases you need to simply send a POST request to the https://api.zyte.com/v1/extract endpoint. You need to provide an API key for authentication. Actions are specified via the payload, which includes the target URL and the details of what you want done.

Direct Approach

Let’s see what happens when we send a GET request directly to the target URL.

import logger
import httpx
from bs4 import BeautifulSoup

URL = "https://www.univarsolutions.com/diethylenetriamine-3275000"

response = httpx.get(URL)

logger.info(f"Status code: {response.status_code}.")

soup = BeautifulSoup(response.content, "html.parser")

title = soup.select_one("title")
logger.info(f"Title: {title}.")

Here’s the output.

2025-01-19 06:08:32,609 [   INFO] Status code: 403.
2025-01-19 06:08:32,611 [   INFO] Title: <title>Just a moment...</title>.

The (Forbidden) status code indicates that the request has been blocked. And the “Just a moment…” page title suggests that the response contains a JavaScript challenge. Ah, Cloudflare! 🚧 Clearly this is going to be an uphill battle!

HTTP

If you’re scraping a static site or hitting an API then you only need to send a simple HTTP(S) request. You do this by setting the "httpResponseBody" key in the request payload.

import logger
import os
import httpx
from base64 import b64decode
from bs4 import BeautifulSoup

ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

URL = "https://www.univarsolutions.com/diethylenetriamine-3275000"

response = httpx.post(
    "https://api.zyte.com/v1/extract",
    auth=(ZYTE_API_KEY, ""),
    json={
        "url": URL,
        "httpResponseBody": True,
        "httpRequestMethod": "GET",  # (default)
    },
)

logger.info(f"API status code:  {response.status_code}.")

# RESPONSE PAYLOAD =============================================================

payload = response.json()

url = payload.get("url")
status_code = payload.get("statusCode")

logger.info(f"Site status code: {status_code}.")
logger.info(f"URL: {url}.")

body = payload.get("httpResponseBody")
#
if body:
    html = b64decode(body).decode("utf-8")

    soup = BeautifulSoup(html, "lxml")

    title = soup.select_one("head > title")

    wrapper = soup.select_one("span.price-wrapper")
    price = wrapper.get("data-price-amount") if wrapper else None

    logger.info(f"Title: {title.text}.")
    logger.info(f"Price: {price}.")

    with open("univarsolutions-http.html", "wt") as f:
        f.write(html)

It’s important to differentiate between the status code of the API (was the API request successful?) and the status code of the target site (was Zyte’s request to the target site successful?). In this case both yield (OK) status codes. ✅

2025-01-19 06:13:59,876 [   INFO] API status code:  200.
2025-01-19 06:13:59,879 [   INFO] Site status code: 200.

Parsing the content of the HTML yields the following:

2025-01-20 04:48:52,681 [   INFO] URL: https://www.univarsolutions.com/diethylenetriamine-3275000.
2025-01-20 04:48:52,740 [   INFO] Title: Diethylenetriamine | Univar Solutions.
2025-01-20 04:48:52,740 [   INFO] Price: None.

No price data are retrieved (despite using the appropriate selector!) because the price is rendered via JavaScript. Since the page was not processed by a JavaScript interpreter this is precisely what was expected. If we want those data then we need to use browser automation, which we’ll get to in a moment.

🚨 Zyte is not infallible and I found that after running this script a few times in succession (with a couple of minutes delay between each run) I received a (Unknown Status Code) status code from the API. According to the Zyte API documentation:

Zyte API sends an HTTP 520 response when a temporary error, usually a ban that could not be avoided in a timely fashion, prevents downloading the requested URL.

In this case you don’t get back any HTML, so there’s nothing to parse.

2025-01-19 06:27:40,082 [   INFO] API status code:  520.
2025-01-19 06:27:40,082 [   INFO] Site status code: None.
2025-01-19 06:27:40,082 [   INFO] URL: None.

Extensions

Zyte’s HTTP requests offer a number of optional features.

  • Multiple URLs — You can submit a list of URLS, in which case each will be requested sequentially.
  • Request Method — Although the API will send a GET request by default, you can specify any other HTTP method.
  • Request Body — Use the "httpRequestText" or "httpRequestBody" key to specify a request payload.
  • Request Headers — Use the "customHttpRequestHeaders" key to specify custom headers.
  • Forms — It’s a bit of a chore but you can submit forms too.

Browser

You can use Zyte’s browser automation to retrieve the target via a browser. This enables many of the features that you’d have access to locally via Selenium, Playwright or Puppeteer. To enable browser automation you set the "browserHtml" field in the API request payload.

import logger
import os
import httpx
from bs4 import BeautifulSoup

ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

URL = "https://www.univarsolutions.com/diethylenetriamine-3275000"

response = httpx.post(
    "https://api.zyte.com/v1/extract",
    auth=(ZYTE_API_KEY, ""),
    json={
        "url": URL,
        "browserHtml": True,
    },
    timeout=30,
)

logger.info(f"API status code:  {response.status_code}.")

# RESPONSE PAYLOAD =============================================================

payload = response.json()

url = payload.get("url")
status_code = payload.get("statusCode")

logger.info(f"Site status code: {status_code}.")
logger.info(f"URL: {url}.")

html = payload["browserHtml"]

soup = BeautifulSoup(html, features="lxml")

title = soup.select_one("head > title")

wrapper = soup.select_one("span.price-wrapper")
price = wrapper.get("data-price-amount") if wrapper else None

logger.info(f"Title: {title.text}.")
logger.info(f"Price: {price}.")

with open("univarsolutions-browser.html", "wt") as f:
    f.write(html)

💡 I needed to extend the timeout (default is just 5 seconds) to cater for the slightly longer response time from the Zyte API.

Both the API and the target site give a (Unknown Status Code) status code.

2025-01-19 06:56:55,282 [   INFO] API status code:  200.
2025-01-19 06:56:55,284 [   INFO] Site status code: 200.

And now, since the page is rendered via a JavaScript interpreter we are also able to extract the price data.

2025-01-19 06:56:55,316 [   INFO] Title: Diethylenetriamine | Univar Solutions.
2025-01-19 06:56:55,316 [   INFO] Price: 1,305.33.

Extensions

Zyte’s browser automation is flexible. Here are some of the options available:

  • Multiple URLs — As for HTTP requests.
  • Screenshots — Use the "screenshot" key to request a screenshot from the browser. This can be particularly useful for debugging.
  • Actions — Use actions to specify a series of browser actions to perform before capturing output. There’s enormous flexibility in what can be done, from simply waiting to detailed page interactions.
  • Network Capture — You can capture network responses received by the browser. You can also filter those responses since there can be an awful lot of network activity behind the scenes!
  • JavaScript Toggle — Although you generally want the browser to be running its JavaScript engine, it’s possible to toggle it off. One immediate advantage of this is faster response times. Whether or not you disabling JavaScript is feasible will depend on the requirements of the target site.

Python Package

The zyte-api Python package wraps all of this functionality, making it easily accessible from within a Python session without having to resort to direct interactions with the API. The package is Open Source and the repository is hosted on GitHub. Installation is simple.

pip install zyte-api

As you’ll see below, this package is really just a light wrapper around the API.

Automatic Extraction

The Zyte API also offers automated extraction. This comes close to a generic “scraper in the cloud”. You specify one or more structured data types, the corresponding content is extracted from the page and returned as JSON. Here are some of the data types available:

The product data type seems most suitable for the target we have been considering.

This time, instead of accessing the API directly, we’ll make use of the Python package installed above. We still pass in a dictionary with the specifications of what we want done. The result is a dictionary, so no need to run the json() method on a response object.

import logger
import os
import json
from zyte_api import ZyteAPI

ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

URL = "https://www.univarsolutions.com/diethylenetriamine-3275000"
slug = URL.rstrip("/").split("/")[-1]

client = ZyteAPI(api_key=ZYTE_API_KEY)

response = client.get(
    {
        "url": (URL),
        "product": True,
    }
)

url = response.get("url")
status_code = response.get("statusCode")

logger.info(f"Site status code: {status_code}.")
logger.info(f"URL: {url}.")

product = response.get("product")

with open(f"univarsolutions-extract-{slug}.json", "wt") as f:
    json.dump(product, f, indent=2)

Below is the liberally edited response (nothing added, just trimmed out some extra content for clarity).

{
  "url": "https://www.univarsolutions.com/diethylenetriamine-3275000",
  "statusCode": 200,
  "product": {
    "name": "Diethylenetriamine, Technical Grade, Liquid, 438 lb Drum",
    "price": "1305.33",
    "currency": "USD",
    "currencyRaw": "$",
    "sku": "70052",
    "metadata": {
      "probability": 0.9896460175514221,
      "dateDownloaded": "2025-01-19T07:05:38Z"
    }
  }
}

For the purpose of comparison, here are the corresponding results for another product:

{
  "url": "https://www.univarsolutions.com/citric-acid-50-food-ko-797167",
  "statusCode": 200,
  "product": {
    "name": "Citric Acid 50% Solution - Food Grade (Kosher) - 575 lb Drum",
    "price": "942.98",
    "currency": "USD",
    "currencyRaw": "$",
    "availability": "InStock",
    "sku": "16141628",
    "metadata": {
      "probability": 0.9894679188728333,
      "dateDownloaded": "2025-01-20T06:08:11Z"
    }
}

The probability field can be used to filter out pages which do not actually conform to the selected data type. For example, requesting product data from a page showing a job post would result in a low probability.

Proxy Mode

These services (HTTP requests, browser requests and automated extraction) are handy. But what if you have an existing scraper and you just need to submit your requests via a proxy? Perhaps you prefer having more granular control over the scraping process? Or you get a kick out of watching your scraper automagically interact with a browser via Selenium? In this case you can make use of Zyte’s proxy mode.

We’ll change our target site to one that clearly illustrates how these proxy service works. WhatIsMyIPAddress gives you your IP address and location (or, in this case, the IP address and location of the proxy).

import os
import logger
from playwright.sync_api import sync_playwright, TimeoutError
from bs4 import BeautifulSoup

ZYTE_API_KEY = os.getenv("ZYTE_API_KEY")

URL = "https://whatismyipaddress.com"

CSS = "#ipv4 > a"

with sync_playwright() as p:
    browser = p.chromium.launch(
        proxy={
            "server": "http://api.zyte.com:8011",
            "username": ZYTE_API_KEY,
            "password": "",
        },
        headless=True,
    )
    context = browser.new_context(
        ignore_https_errors=True, viewport={"width": 1920, "height": 1080}
    )
    page = context.new_page()

    page.goto(URL, timeout=60000)

    page.wait_for_selector(CSS, state="visible")

    try:
        page.get_by_role("button", name="AGREE", exact=True).click(timeout=5000)
    except TimeoutError:
        pass

    page.screenshot(path="whatismyipaddress.png", full_page=True)

    html = page.content()

    browser.close()

soup = BeautifulSoup(html, features="lxml")

address = soup.select_one(CSS)
city = soup.select_one(
    ".ip-information .inner .information:nth-of-type(2) span:last-child"
)

logger.info(f"IP: {address.text:15} [{city.text}]")

Run it.

2025-01-22 05:53:36,503 [   INFO] IP: 154.16.212.118  [El Segundo]

The effective IP address for is definitely not my usual one. Looking at the screenshot below we can see the location, which is also certainly not where I’m situated! So the proxy has effectively teleported my request halfway around the globe.

WhatIsMyIPAddress site showing IP address and location for proxy.

Now run it a few more times.

2025-01-22 05:59:09,905 [   INFO] IP: 89.117.105.93   [Salt Lake City]
2025-01-22 05:59:51,745 [   INFO] IP: 191.96.47.196   [Chicago]
2025-01-22 06:00:20,663 [   INFO] IP: 181.215.130.50  [New York City]
2025-01-22 06:01:09,879 [   INFO] IP: 191.96.47.129   [Chicago]
2025-01-22 06:01:49,609 [   INFO] IP: 23.94.58.16     [Northfield]
2025-01-22 06:06:24,088 [   INFO] IP: 84.46.235.34    [Los Angeles]
2025-01-22 06:12:33,869 [   INFO] IP: 181.215.131.236 [Dublin]
2025-01-22 06:13:32,364 [   INFO] IP: 89.117.162.163  [Sterling]
2025-01-22 06:14:17,905 [   INFO] IP: 89.117.82.178   [Dallas]
2025-01-22 06:14:54,038 [   INFO] IP: 198.46.253.184  [Buffalo]

The proxy IP is rotating as you’d hope and the location is skipping around. There seems to be a bit of a US bias, but there are other countries too. It’s also apparent that the service providers used for the proxies would be classified as residential.

Conclusion

Web scraping from a target with anti-bot measures doesn’t have to be a headache. The Zyte API and proxy service offer some alternatives that will help you to get around these issues. If you’re curious about how I’ve used the Zyte API to streamline my workflow, let’s have a conversation!