Scraping JSON-LD Data

JSON-LD (JavaScript Object Notation for Linked Data) is a lightweight, flexible and standardised format intended to provide context and meaning to the data on a webpage. It’s easy and convenient for both humans and machines to read and write.

Why JSON-LD?

The primary reason for including JSON-LD data in a webpage is to make the content easier for search engines to parse and classify. However, as we’ll see, JSON-LD data is game-changer if you’re web scraping.

My heart leaps for joy, and with my song I praise Him. Psalm 28:7

That is literally the way that I feel when I find that a webpage has JSON-LD data. I probably won’t need to worry about crafting any complex CSS selectors or XPath because most of the data I’m looking for is conveniently embedded in a JSON object. Sometimes there might be important data that’s not included in the JSON-LD, but this is rare. If it’s been well designed then this object will capture most of the pertinent details from the page.

What does JSON-LD look like?

You’ll typically find JSON-LD embedded in the <head> of a webpage within a <script> tag of type application/ld+json. It’s possible to have it in the <body>, but this is generally discouraged because having it in the head ensures that it’s available early in the page load.

Here’s a simplified example of JSON-LD data in a <script> tag. It was based on the entry in this position description.

<script type="application/ld+json">
  {
    "@context": "http://schema.org/",
    "@type": "JobPosting",
    "identifier": {
      "@type": "PropertyValue",
      "name": "id",
      "value": "16607479"
    },
    "title": "Senior Manufacturing Process Scientist",
    "datePosted": "2025-01-31T16:41:14+00:00",
    "hiringOrganization": {
      "@type": "Organization",
      "name": "Universal Display Corporation"
    },
    "jobLocation": {
      "@type":"Place",
      "address": {
        "@type": "PostalAddress",
        "addressLocality": "Ewing",
        "addressRegion": "NJ"
      }
    }
  }
</script>

My approach to accessing these data has four phases:

Get the webpage data (almost always the JSON-LD will be a static part of the page).
Locate the relevant tag and extract the JSON content.
Parse the JSON.
Profit and pleasure.

Reserved Keywords

The JSON-LD standard includes a selection of reserved keywords that have special significance. These have names that begin with an “at” (@) symbol. A few of the most important keywords are:

@context — the schema or vocabulary (most often the “standard” http://schema.org/);
@type — the kind of thing that the data describes;
@id — a unique identifier for the thing; and
@language — language used for corresponding value.

In the same data above you’ll find both the @context and @type keywords.

Schema

The top level @context and @type fields tell you where you can find the relevant schema, which specifies the structure of the rest of the data. In the example above the type is JobPosting, which means that the schema of the data should adhere to the JobPosting schema. Knowing the schema makes the data easier to process because you’ll know which fields to expect and what their names are. Standards are great! 🚀

There are many other schemas. Take a look here for a complete list. A few that sparked my interest:

You’ll see examples of a few of these below.

Example: Job Posting

Let’s start by extracting the JSON-LD data from a Research Chemist position posted on the Universal Display careers pages.

import json
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup

URL_BASE = "https://oled.catsone.com"
URL_PATH = "/careers/103309-General/jobs/16508281-Research-Chemist"

# Retrieve the raw HTML for the webpage.
response = requests.get(urljoin(URL_BASE, URL_PATH))
response.raise_for_status()

# Parse the raw HTML.
soup = BeautifulSoup(response.text, "lxml")

# Locate the <script type="application/ld+json"> tag.
script = soup.find("script", type="application/ld+json")

# Load the JSON content of that tag.
data = json.loads(script.string)

# Dump the JSON to a file.
with open("job-posting.json", "w") as f:
    json.dump(data, f, indent=2)

The script retrieves the raw HTML for the webpage via a GET request using the get() function from the requests package. It then parses the HTML using Beautiful Soup and locates the <script> tag with JSON-LD content. The content of that tag is loaded as a JSON string and then dumped to a file. The resulting JSON data can be found here.

In practice you might extract relevant data from the JSON document rather than persistig the whole thing. Below is the lightly redacted (for clarity!) version of the JSON-LD data. These data capture all of the pertinent details from the webpage and are much easier to work with.

{
  "@context": "http://schema.org/",
  "@type": "JobPosting",
  "identifier": {
    "@type": "PropertyValue",
    "name": "id",
    "value": "16508281"
  },
  "title": "Research Chemist",
  "datePosted": "2024-07-19T12:54:42+00:00",
  "hiringOrganization": {
    "@type": "Organization",
    "name": "Universal Display Corporation",
    "sameAs": "https://oled.com/about/careers/"
  },
  "jobLocation": {
    "@type": "Place",
    "address": {
      "@type": "PostalAddress",
      "addressLocality": "New Castle",
      "addressRegion": "DE",
      "postalCode": ""
    }
  }
}

📢 Entries like the description field often contain HTML and Unicode entities. For the purposes of persisting these data you might want to clean them first. I’ve got a post with some suggestions on how to do that.

Example: Recipe

Gathering job postings is a useful application of web scraping and can have various commercial applications. However, the intrigue of job postings pales into insignificance relative to that of food! 🍣🍔🍕🥗

If you wanted to gather recipe data, then Allrecipes would be a good place to start. Let’s retrieve the specifications of Janet’s Famous Rich Banana Bread. We could scrape the page the old fashioned way. But that’d require some CSS or XPath gymnastics. Is there JSON-LD data? Yes, indeed there is! 🤸

import json
import requests
from bs4 import BeautifulSoup

URL = "https://www.allrecipes.com/recipe/17066/janets-rich-banana-bread/"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:135.0) Gecko/20100101 Firefox/135.0",
}

response = requests.get(URL, headers=headers)
response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")

script = soup.find("script", type="application/ld+json")

data = json.loads(script.string)

with open("recipe.json", "w") as f:
    json.dump(data, f, indent=2)

The data below have been vigorously redacted but still include the most interesting bits. This site publishes a spectacular volume of data in JSON-LD. 👏 There’s really no need to manually scrape anything from the page because everything of interest is available in easily parsed JSON.

{
  "@context": "http://schema.org",
  "@type": [
    "Recipe",
    "NewsArticle"
  ],
  "datePublished": "2000-02-29T19:52:17.000-05:00",
  "dateModified": "2024-01-18T10:17:51.124-05:00",
  "name": "Janet&#39;s Rich Banana Bread",
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.8",
    "ratingCount": "10620"
  },
  "cookTime": "PT60M",
  "nutrition": {
    "@type": "NutritionInformation",
    "calories": "291 kcal",
    "carbohydrateContent": "36 g",
    "cholesterolContent": "30 mg",
    "fiberContent": "1 g",
    "proteinContent": "3 g",
    "saturatedFatContent": "8 g",
    "sodiumContent": "295 mg",
    "fatContent": "16 g",
    "unsaturatedFatContent": "0 g"
  },
  "prepTime": "PT10M",
  "recipeIngredient": [
    "1 cup white sugar",
    "0.5 cup butter, melted",
    "2 eggs",
    "1 teaspoon vanilla extract",
    "1.5 cups all-purpose flour",
    "1 teaspoon baking soda",
    "0.5 teaspoon salt",
    "0.5 cup sour cream",
    "0.5 cup chopped walnuts",
    "2 medium bananas, sliced"
  ]
}

📢 The name field includes an example of an embedded HTML entity. Refer to Handling HTML Entities and Unicode.

This bread is strong. It does not crumble like weak kingdoms before my horde. The richness of bananas is worthy of a conqueror's feast. But why stop at walnuts? Add more: honey, meat, the spoils of war! A true warrior eats to gain strength, not just for pleasure. And yet, even I, Attila, am pleased. Attila the Hun (Possibly misattributed)

Example: Book

Love food. Also love books. If you’re a bibliophile and not on Goodreads then you should take a look. It’s a great place to find unbiased reviews and recommendations (the shill proportion seems to be low!). It also happens to be a decent site for web scraping.

Let’s grab the data for The Unlikely Pilgrimage of Harold Fry, which I vigorously recommend.

import json
import requests
from bs4 import BeautifulSoup

URL = "https://www.goodreads.com/book/show/13227454-the-unlikely-pilgrimage-of-harold-fry"

response = requests.get(URL)
response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")

script = soup.find("script", type="application/ld+json")

data = json.loads(script.string)

print(json.dumps(data, indent=2))

Here’s the lightly pruned result. I removed the lengthy list of awards!

{
  "@context": "https://schema.org",
  "@type": "Book",
  "name": "The Unlikely Pilgrimage of Harold Fry (Harold Fry, #1)",
  "bookFormat": "Hardcover",
  "numberOfPages": 320,
  "inLanguage": "English",
  "isbn": "9780812993295",
  "author": [
    {
      "@type": "Person",
      "name": "Rachel Joyce",
      "url": "https://www.goodreads.com/author/show/5309857.Rachel_Joyce"
    }
  ],
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": 3.94,
    "ratingCount": 187262,
    "reviewCount": 21148
  }
}

It doesn’t capture all of the reviews, but the core information relating to the book is all there.

Example: Music

Reading and music go well together. Let’s take a look at what JSON-LD data are available for music on AllMusic. We’ll grab the data for “The Silence” by Manchester Orchestra. If you haven’t heard this, then do yourself a favour and listen to it now.

If you’re anything like me then you’ll need to take a few deep breaths to get over a visceral reaction to that song.

Let’s get back to web scraping.

import json
import requests
from bs4 import BeautifulSoup

URL = "https://www.allmusic.com/song/the-silence-mt0054522571"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:135.0) Gecko/20100101 Firefox/135.0",
}

response = requests.get(URL, headers=headers)
response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")

script = soup.find("script", type="application/ld+json")

data = json.loads(script.string)

print(json.dumps(data, indent=2))

This is the complete JSON-LD data. It contains all of the key information (name of the song, album and artist) with links to associated pages (each of which also includes JSON-LD data).

{
  "@context": "http://schema.org",
  "@type": "MusicRecording",
  "url": "https://www.allmusic.com/song/the-silence-mt0054522571",
  "name": "The Silence",
  "byArtist": [
    {
      "@type": "MusicGroup",
      "name": "Manchester Orchestra",
      "url": "https://www.allmusic.com/artist/manchester-orchestra-mn0000485793"
    }
  ],
  "inAlbum": [
    {
      "@type": "MusicAlbum",
      "name": "A Black Mile to the Surface",
      "url": "https://www.allmusic.com/album/a-black-mile-to-the-surface-mw0003065955",
      "byArtist": [
        {
          "@type": "MusicGroup",
          "name": "Manchester Orchestra",
          "url": "https://www.allmusic.com/artist/manchester-orchestra-mn0000485793"
        }
      ],
      "datePublished": "2017-07-28"
    }
  ]
}

Example: Movie

As a final example, let’s take a look at the JSON-LD data published by IMDb. We’ll pull the data for The Road, a movie I have not seen. But it’s based on a book by Cormac McCarthy that I loved.

Here’s the script to retrieve the JSON-LD. Essentially the same formula as the previous examples.

import json
import requests
from bs4 import BeautifulSoup

URL = "https://www.imdb.com/title/tt0898367/"

headers = {
    "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:135.0) Gecko/20100101 Firefox/135.0",
}

response = requests.get(URL, headers=headers)
response.raise_for_status()

soup = BeautifulSoup(response.text, "lxml")

script = soup.find("script", type="application/ld+json")

data = json.loads(script.string)

print(json.dumps(data, indent=2))

Below is the abridged data. I sliced out a few fields with long content, but this captures the core data. Essentially everything that you are likely to need, with links to other resources (all of which in turn have their own JSON-LD data!).

{
  "@context": "https://schema.org",
  "@type": "Movie",
  "url": "https://www.imdb.com/title/tt0898367/",
  "name": "The Road",
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingCount": 261353,
    "bestRating": 10,
    "worstRating": 1,
    "ratingValue": 7.2
  },
  "genre": [
    "Drama",
    "Thriller"
  ],
  "datePublished": "2010-01-08",
  "keywords": "apocalypse,post apocalypse,on the road,based on novel,cataclysm",
  "actor": [
    {
      "@type": "Person",
      "url": "https://www.imdb.com/name/nm0001557/",
      "name": "Viggo Mortensen"
    },
    {
      "@type": "Person",
      "url": "https://www.imdb.com/name/nm0000234/",
      "name": "Charlize Theron"
    },
    {
      "@type": "Person",
      "url": "https://www.imdb.com/name/nm2240346/",
      "name": "Kodi Smit-McPhee"
    }
  ],
  "director": [
    {
      "@type": "Person",
      "url": "https://www.imdb.com/name/nm0384825/",
      "name": "John Hillcoat"
    }
  ],
  "duration": "PT1H51M"
}

Pathological Data

Most often you’ll find that JSON-LD data is easy to work with. But sometimes it can be a steaming mess. One problem I’ve come across repeatedly is double quotes appearing within values. In the example data below there are quote nicknames within each of the name values. Any JSON parser will choke on that data.

{
  "@context": "https://schema.org",
  "@type": "Person",
  "name": "Michael "Mike" Carter",
  "knows": [
    {
      "@type": "Person",
      "name": "Emma "Em" Thompson"
    },
    {
      "@type": "Person",
      "name": "Daniel "Dan" Harris"
    }
  ]
}

Here’s a function that escapes the inner double quotes. There might be edge cases where this doesn’t quite do the job. Let me know and I’ll tweak accordingly.

import re

def json_fix_unescaped_quotes(json: str) -> str:
    def escape_quotes(match):
        key, value = match.groups()
        return f'{key}: "{value.replace('"', '\\"')}"'

    return re.sub(r'(".*?")\s*:\s*"(.*?)"(?=[,}\n])', escape_quotes, json)

Conclusion

JSON-LD data can make the task of scraping a webpage a lot easier. If your target has JSON-LD then you’re probably not going to have to fight too hard to get what you need. Perhaps not quite as convenient as an API, but it’s a close second.

Sometimes you’ll find that the JSON-LD data contains HTML. You’ll probably want to clean that up before persisting it, specifically processing Unicode and HTML entities.