Unravelling an Enormous JSON

I have a challenge: extracting data from an enormous JSON file. The structure of the file is not ideal: it’s a mapping at the top level, which means that for most standard approaches the entire document needs to be loaded before it can be processed. It would have been so much easier if the top level structure was an array. But, alas. It’s almost as if the purveyors of the data have made it intentionally inaccessible.

Getting the Data

The data come from Transparency in Coverage and are in the form of a compressed JSON file.

wget -O united-healthcare-rates.json.gz https://tinyurl.com/united-healthcare-rates

Let’s see how big the file is.

ls -lh united-healthcare-rates.json.gz
-rw-rw-r-- 1 ubuntu ubuntu 4.0G Jul 24 20:27 united-healthcare-rates.json.gz

The compressed JSON is 4 GiB. But that’s just the beginning. Now we need to decompress.

gunzip united-healthcare-rates.json.gz

How big is it now?

ls -lh united-healthcare-rates.json
-rw-rw-r-- 1 ubuntu ubuntu 122G Aug  3 05:36 united-healthcare-rates.json

Whoah! 122 GiB. That’s a serious chunk of data.

The data schema is described here. However, that schema is of little use if we cannot load and parse the data.

Initial Attempts

My immediate reaction was: okay, I’ll process this in Python. Wrong! The file was far too big to read in one chunk (obviously!). What about using an iterative or streaming approach with something like the ijson library? Seems like a good idea and would have worked if the JSON file was an array, but since it’s a mapping, streaming is not a viable option.

What about loading the data into a document database like MongoDB and then using NoSQL to investigate the contents? Again, this seems like a reasonable idea. However, there are constraints on the size of documents which you can load into a document database. Another dead end.

Simple Shell Solution

There’s a handy shell command for working with JSON data: jq. Install it if you don’t already have it.

sudo apt install jq

Document Keys

Let’s take a look at the mapping keys.

jq 'keys' united-healthcare-rates.json
[
  "in_network",
  "last_updated_on",
  "provider_references",
  "reporting_entity_name",
  "reporting_entity_type",
  "version"
]

That takes a while to run (because there’s a lot of data to parse), but we can now see the number and names of the items.

Scalar Components

A few of the top level items ("version", "last_updated_on", "reporting_entity_name" and "reporting_entity_type") are simple scalars. I can access these fairly easily using head and grep.

Array Components

The interesting data is in the "in_network" and "provider_references" items. I decided to unpack those into separate files so that I could deal with each individually.

jq ".in_network" united-healthcare-rates.json >in-network.json
jq ".provider_references" united-healthcare-rates.json >provider-references.json

I now have two files to work with, each of which is a JSON array. I created another file, provider-references-sample.json, which contains a sample of the first few records from provider-references.json.

Let’s take a look at the sample.

[
  {
    "provider_groups": [
      {
        "npi": [
          1720734973
        ],
        "tin": {
          "type": "ein",
          "value": "870324719"
        }
      },
      {
        "npi": [
          1770627234
        ],
        "tin": {
          "type": "ein",
          "value": "870569774"
        }
      }
    ],
    "provider_group_id": 0
  },
  {
    "provider_groups": [
      {
        "npi": [
          1386945947,
          1588908388
        ],
        "tin": {
          "type": "ein",
          "value": "371705906"
        }
      }
    ],
    "provider_group_id": 1
  }
]

For the purpose of this post I’m going to use R to unpack the arrays. You could equally use Python (or one of any other languages able to work with JSON).

library(jsonlite)

PROVIDER_REFERENCES <- "provider-references-sample.json"
IN_NETWORK <- "in-network-sample.json.bz2"

Now load the provider references sample data and do some rectangling to convert it into a tidy data frame.

unpack_group <- function(group) {
  with(
        group,
        tibble(
          provider_group_id,
          provider_groups
        )
      ) %>%
  unnest_wider(provider_groups) %>%
  unnest_wider(tin, names_sep = "_") %>%
  unnest_longer(npi) %>%
  select(provider_group_id, tin_type, tin_value, npi)
}

read_json(PROVIDER_REFERENCES) %>%
  map_dfr(unpack_group)
# A tibble: 4 × 4
  provider_group_id tin_type tin_value        npi
              <int> <chr>    <chr>          <int>
1                 0 ein      870324719 1720734973
2                 0 ein      870569774 1770627234
3                 1 ein      371705906 1386945947
4                 1 ein      371705906 1588908388

That looks reasonable and is in a convenient form for further analysis or merging with other data.

The structure of the in network data is a little more complicated but can be attacked using a similar approach. Here’s a sample of the resulting data frame (I’ve included only a subset of the fields).

# A tibble: 3 × 5
  arrangement code  code_type code_type_version name                            
  <chr>       <chr> <chr>     <chr>             <chr>                           
1 ffs         0001A CPT       2022              IMM ADMN SARSCOV2 30MCG/0.3ML D…
2 ffs         0001U CPT       2022              RBC DNA HEA 35 AG 11 BLD GRP WH…
3 ffs         0001  MS-DRG    2022              Heart Transplant or Implant of …

Streaming

Using the read_json() function to load the JSON data will not be practical with larger files. However, since we now have JSON arrays to deal with we can use a streaming approach.

First convert the data to NDJSON format.

jq -c '.[]' provider-references-sample.json >provider-references-sample-streaming.json

Now handle the document one line at a time using stream_in(). This completely avoids having to load the entire document. The code below could be a lot more efficient, but it’s just to illustrate the approach.

PROVIDER_REFERENCES_CSV <- "provider-references-sample.csv"

unlink(PROVIDER_REFERENCES_CSV)

extract <- function(group) {
  unpack_group(unlist(group, recursive = FALSE)) %>%
    # Append data to the CSV file.
    write_csv(
      file = PROVIDER_REFERENCES_CSV,
      append = TRUE
    )
}

stream_in(
  file("provider-references-sample-streaming.json"),
  # Specify a custom handler to extract data from each chunk of JSON data.
  handler = extract,
  # Handle one record at a time. Increase page size for greater throughput.
  pagesize = 1,
  simplifyVector = FALSE
)

Take a look at the resulting CSV file.

cat "provider-references-sample.csv"
0,ein,870324719,1720734973
0,ein,870569774,1770627234
1,ein,371705906,1386945947
1,ein,371705906,1588908388

The same data that we extracted earlier, also in a clean, flat format. But now extracted using an approach that can handle large volumes of data.

Conclusion

The key to cracking this problem was using the jq tool to break down the enormous JSON document into smaller parts. The resulting parts were still big, but had a different structure, which meant that I could use streaming techniques to process them.