I have a challenge: extracting data from an enormous JSON file. The structure of the file is not ideal: it’s a mapping at the top level, which means that for most standard approaches the entire document needs to be loaded before it can be processed. It would have been so much easier if the top level structure was an array. But, alas. It’s almost as if the purveyors of the data have made it intentionally inaccessible.
Getting the Data
The data come from the Transparency in Coverage collection provided by United HealthCare (UHC) and are in the form of a compressed JSON file. I provided an overview of the Transparency in Coverage regulations here.
wget -O united-healthcare-rates.json.gz https://tinyurl.com/united-healthcare-rates
Let’s see how big the file is.
ls -lh united-healthcare-rates.json.gz
-rw-rw-r-- 1 ubuntu ubuntu 4.0G Jul 24 20:27 united-healthcare-rates.json.gz
The compressed JSON is 4 GiB. But that’s just the beginning. Now we need to decompress.
gunzip united-healthcare-rates.json.gz
How big is it now?
ls -lh united-healthcare-rates.json
-rw-rw-r-- 1 ubuntu ubuntu 122G Aug 3 05:36 united-healthcare-rates.json
Wow! 122 GiB. That’s a serious chunk of data.
The data schema is described here. However, that schema is of little use if we cannot load and parse the data.
Initial Attempts
My immediate reaction was: okay, I’ll process this in Python. Wrong! The file was far too big to read in one chunk (obviously!). What about using an iterative or streaming approach with something like the ijson
library? Seems like a good idea and would have worked if the JSON file was an array, but since it’s a mapping, streaming is not a viable option.
What about loading the data into a document database like MongoDB and then using NoSQL to investigate the contents? Again, this seems like a reasonable idea. However, there are constraints on the size of documents which you can load into a document database. Another dead end.
Simple Shell Solution
There’s a handy shell command for working with JSON data: jq
. Install it if you don’t already have it.
sudo apt install jq
Document Keys
Let’s take a look at the mapping keys.
jq 'keys' united-healthcare-rates.json
[
"in_network",
"last_updated_on",
"provider_references",
"reporting_entity_name",
"reporting_entity_type",
"version"
]
That takes a while to run (because there’s a lot of data to parse), but we can now see the number and names of the items.
Scalar Components
A few of the top level items ("version"
, "last_updated_on"
, "reporting_entity_name"
and "reporting_entity_type"
) are simple scalars. I can access these fairly easily using head
and grep
.
Array Components
The interesting data is in the "in_network"
and "provider_references"
items. I decided to unpack those into separate files so that I could deal with each individually.
jq ".in_network" united-healthcare-rates.json >in-network.json
jq ".provider_references" united-healthcare-rates.json >provider-references.json
I now have two files to work with, each of which is a JSON array. I created another file, provider-references-sample.json
, which contains a sample of the first few records from provider-references.json
.
Let’s take a look at the sample.
[
{
"provider_groups": [
{
"npi": [
1720734973
],
"tin": {
"type": "ein",
"value": "870324719"
}
},
{
"npi": [
1770627234
],
"tin": {
"type": "ein",
"value": "870569774"
}
}
],
"provider_group_id": 0
},
{
"provider_groups": [
{
"npi": [
1386945947,
1588908388
],
"tin": {
"type": "ein",
"value": "371705906"
}
}
],
"provider_group_id": 1
}
]
For the purpose of this post I’m going to use R to unpack the arrays. You could equally use Python (or one of any other languages able to work with JSON).
library(jsonlite)
PROVIDER_REFERENCES <- "provider-references-sample.json"
IN_NETWORK <- "in-network-sample.json.bz2"
Now load the provider references sample data and do some rectangling to convert it into a tidy data frame.
unpack_group <- function(group) {
with(
group,
tibble(
provider_group_id,
provider_groups
)
) %>%
unnest_wider(provider_groups) %>%
unnest_wider(tin, names_sep = "_") %>%
unnest_longer(npi) %>%
select(provider_group_id, tin_type, tin_value, npi)
}
read_json(PROVIDER_REFERENCES) %>%
map_dfr(unpack_group)
# A tibble: 4 × 4
provider_group_id tin_type tin_value npi
<int> <chr> <chr> <int>
1 0 ein 870324719 1720734973
2 0 ein 870569774 1770627234
3 1 ein 371705906 1386945947
4 1 ein 371705906 1588908388
That looks reasonable and is in a convenient form for further analysis or merging with other data.
The structure of the in network data is a little more complicated but can be attacked using a similar approach. Here’s a sample of the resulting data frame (I’ve included only a subset of the fields).
# A tibble: 3 × 5
arrangement code code_type code_type_version name
<chr> <chr> <chr> <chr> <chr>
1 ffs 0001A CPT 2022 IMM ADMN SARSCOV2 30MCG/0.3ML DIL RECON 1ST DOSE
2 ffs 0001U CPT 2022 RBC DNA HEA 35 AG 11 BLD GRP WHL BLD CMN ALLEL
3 ffs 0001 MS-DRG 2022 Heart Transplant or Implant of Heart Assist System with MCC
Streaming
Using the read_json()
function to load the JSON data will not be practical with larger files. However, since we now have JSON arrays to deal with we can use a streaming approach.
First convert the data to NDJSON format.
jq -c '.[]' provider-references-sample.json >provider-references-sample-streaming.json
Now handle the document one line at a time using stream_in()
. This completely avoids having to load the entire document. The code below could be a lot more efficient, but it’s just to illustrate the approach.
PROVIDER_REFERENCES_CSV <- "provider-references-sample.csv"
unlink(PROVIDER_REFERENCES_CSV)
extract <- function(group) {
unpack_group(unlist(group, recursive = FALSE)) %>%
# Append data to the CSV file.
write_csv(
file = PROVIDER_REFERENCES_CSV,
append = TRUE
)
}
stream_in(
file("provider-references-sample-streaming.json"),
# Specify a custom handler to extract data from each chunk of JSON data.
handler = extract,
# Handle one record at a time. Increase page size for greater throughput.
pagesize = 1,
simplifyVector = FALSE
)
Take a look at the resulting CSV file.
cat "provider-references-sample.csv"
0,ein,870324719,1720734973
0,ein,870569774,1770627234
1,ein,371705906,1386945947
1,ein,371705906,1588908388
The same data that we extracted earlier, also in a clean, flat format. But now extracted using an approach that can handle large volumes of data.
Conclusion
The key to cracking this problem was using the jq
tool to break down the enormous JSON document into smaller parts. The resulting parts were still big, but had a different structure, which meant that I could use streaming techniques to process them.