Fathom Data has been doing a lot of work with the HCRIS (Healthcare Cost Report Information System) data. The underlying reports are submitted as a spreadsheet with multiple sheets. The data are then extracted and recorded in a simple tabular format, with each field linked to a worksheet code (wksht_cd
), column number (clmn_num
) and line number (clmn_num
). These three keys are then mapped to a single compound key. The resulting data look something like this:
key value
---------------------
a_c1_30 8960498
a_c2_30 2559809
a_c6_30 -1581948
a_c1_200 45448537
a_c2_200 100473288
a_c6_200 -23947062
a7_1_c1_1 2696198
a7_1_c1_6 87655935
a7_1_c2_6 4190056
a7_1_c1_8 168428817
a7_1_c2_8 4391574
a7_1_c1_10 168428817
a7_1_c2_10 4391574
Of course, for these data to make any sense at all it’s necessary to map backwards from the compound key to the location within the original spreadsheet. And once you know where the data belong, you’ll also want to know what the data mean.
Fortunately this relationship is documented and available as a PDF file.
This is useful, but for analytical purposes a PDF is not ideal. We used the {pdftools}
package to scrape the text from the PDF and, after a significant bit of wrangling ended up with a nice clean CSV file.
head(hcris_fields)
# A tibble: 6 × 6
key type wksht_cd clmn_num line_num label
<chr> <chr> <chr> <chr> <chr> <chr>
1 a7_1_c1_1 NUMERIC A700001 00100 00100 Land
2 a7_1_c1_10 NUMERIC A700001 00100 01000 Total
3 a7_1_c1_2 NUMERIC A700001 00100 00200 Land Improvements
4 a7_1_c1_3 NUMERIC A700001 00100 00300 Buildings and Fixtures
5 a7_1_c1_4 NUMERIC A700001 00100 00400 Building Improvements
6 a7_1_c1_5 NUMERIC A700001 00100 00500 Fixed Equipment
Resources
Here are the artefacts of this analysis:
- HCRIS record layout (CSV) and (XLSX)
- PostgreSQL table definition and
- MySQL table definition.