Avoiding Duplication

Avoiding data duplication is a persistent challenge with acquiring data from websites or APIs. You can try to brute force it: pull the data again and then compare it locally to establish whether it’s fresh or stale. But there are other approaches that, if supported, can make this a lot simpler.

In this post we’ll look at the use of the following headers:

  • Last-Modified
  • If-Modified-Since
  • If-Unmodified-Since
  • ETag and
  • If-None-Match.

Last-Modified

The Last-Modified response header contains the date and time that the remote resource was last modified. Provided that this is correctly implemented on the remote server, this can help you avoid duplication.

Suppose that we’re pulling the US Energy Information Administration’s Weekly Natural Gas Storage Report. These data are available as JSON here.

import requests
import json

response = requests.get("https://ir.eia.gov/ngs/wngsr.json")

data = json.loads(response.content.decode('utf-8-sig'))

The JSON document has a Byte Order Mark (BOM), indicating that it’s UTF-8 encoded, so we need to decode before unpacking the data. I dumped the formatted JSON here for reference. The response headers (from responce.headers) are printed below.

{
  "Content-Type": "text/plain",
  "Content-Length": "7187",
  "Connection": "keep-alive",
  "Last-Modified": "Thu, 10 Oct 2024 14:15:17 GMT",
  "Accept-Ranges": "bytes",
  "Date": "Sat, 12 Oct 2024 04:34:43 GMT",
  "Cache-Control": "public, max-age=30",
  "X-Cache": "Hit from cloudfront",
  "Age": "22"
}

Notice that the Last-Modified header tells us when the data was last updated on the server. If we’re already downloaded the data since then, then there’s no need to grab it again.

In the code above, however, I’d sent a GET request, so the data had already been downloaded. We can be frugal with resources by sending a HEAD request. This will simply retrieve the response headers. Based on these we can then decide whether or not to download the data.

import requests

response = requests.head("https://ir.eia.gov/ngs/wngsr.json")

print(response.headers["Last-Modified"])
Thu, 10 Oct 2024 14:15:17 GMT

The Last-Modified header is useful. But it’s passive. How about being proactive and specifying a modification date as a request header?

If-Modified-Since

The If-Modified-Since request header can be used to avoid downloading stale data. Not all sites support this. However, you might get a hint from looking at the response headers.

{
  "Server": "nginx/1.18.0 (Ubuntu)",
  "Date": "Sat, 12 Oct 2024 04:52:37 GMT",
  "Content-Type": "text/xml, application/rss+xml; charset=UTF-8",
  "Transfer-Encoding": "chunked",
  "Connection": "keep-alive",
  "Cache-Control": "private, must-revalidate",
  "Pragma": "no-cache",
  "Expires": "0",
  "Access-Control-Allow-Origin": "*",
  "Access-Control-Allow-Credentials": "true",
  "Access-Control-Allow-Headers": "Accept,Origin,DNT,user_token,
    Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,
    Content-Type,Content-Range,Range",
  "Access-Control-Allow-Methods": "GET,POST,OPTIONS,PUT,DELETE,PATCH"
}

In the Access-Control-Allow-Headers header we see that the If-Modified-Since request header is supported.

The way that the If-Modified-Since request header works is as follows:

  • if the resource has been modified then you’ll get a 200 response
  • otherwise you’ll get a 304 response.

Let’s use the same site as before (which, incidentally, does not publicise its support of the If-Modified-Since header). We’re going to make two requests: for the first we’ll use a time before the last modified time from above (so we know that the content has changed since then!) and for the second we’ll use a time just a few minutes ago.

from datetime import datetime, UTC
import requests

URL = "https://ir.eia.gov/ngs/wngsr.json"

headers = {
  "If-Modified-Since": "Thu, 10 Oct 2024 04:00:00 GMT" # A few days ago.
}

response = requests.get(URL, headers=headers)

print(response.status_code)
print(len(response.content))

headers = {
  "If-Modified-Since": "Sat, 12 Oct 2024 04:00:00 GMT" # A few minutes ago.
}

response = requests.get(URL, headers=headers)

print(response.status_code)
print(len(response.content))

Now check the output. The results of the first request are:

200
7187

The status code is 200 (indicating that the resource has been updated) and we get 7187 bytes of data back. What about the second request?

304
0

The 304 status code indicates that the resource has not been modified and we get an empty response (0 bytes of data).

There’s no need to first check with HEAD. You can just fire off a GET request and if the data has not been modified then you’ll get nothing back.

🚨 The time format used for this header ("%a, %d %b %Y %H:%M:%S") is very specific!

There’s also an If-Unmodified-Since request header which performs a somewhat related role, but it applied to POST rather than GET or HEAD requests.

ETag

ETags (or Entity Tags) are a pivotal component of website caching. They represent a hashed version of the resource content which can be used to check whether or not it has been updated. First send a GET request to the Fastly homepage.

import requests

response = requests.get("https://www.fastly.com/")

Here are selected response headers:

{
  "Connection": "keep-alive",
  "Content-Length": "191681",
  "content-type": "text/html",
  "etag": "\"57aee7bfdde6404f4c2d55dcb79b4ec4\"",
  "Content-Encoding": "gzip",
  "Date": "Sat, 12 Oct 2024 06:11:04 GMT",
  "X-Cache": "MISS, HIT, HIT",
  "X-Cache-Hits": "0, 55, 1",
  "X-Timer": "S1728713465.766493,VS0,VE1",
  "Vary": "Accept-Encoding",
  "X-XSS-Protection": "1; mode=block",
  "Cache-Control": "max-age=0, must-revalidate"
}

Looking at the value of specific response headers we see that 191681 bytes were returned. The etag header is a hashed (most commonly MD5 or SHA-1) representation of the current content on the site. The Cache-Control headers indicates that the data should be considered immediately stale and should be revalidated on all subsequent requests.

We can use the If-None-Match request header to pass along an ETag.

import requests

URL = "https://www.fastly.com/"

headers = {
  "If-None-Match": '"57aee7bfdde6404f4c2d55dcb79b4ec4"' # Current etag
}

response = requests.get(URL, headers=headers) # 304

headers = {
  "If-None-Match": '"ebeb4dac1352c124452375a81286c21b"' # Older etag
}

response = requests.get(URL, headers=headers) # 200

Depending on whether or not the provided ETag matches the current ETag, the response status code will be either:

  • 304 — resource has not been modified (ETag matches) or
  • 200 — resource has been modified (ETag has been superseded).

In the case of a 304 status the response will be empty.

🚨 Note that the value of the etag response header is quoted and the If-None-Match request header also needs to be quoted! Also, the If-None-Match can be a comma-separated list of quoted ETags.

There are some ETag variants:

  • Sometimes there can be a “W/” prefix, indicating a “weak” validator.
  • In principle an ETag should represent the uncompressed content of the resource. However, sometimes it will be based on the compressed content, in which case there might be a “-gzip” or “-br” suffix on the ETag.

Conclusion

Use these headers to make your data acquisition more efficient by only updating when content has actually changed.