Working with Fairly Wide Data

The concept of “wide data” is relative. In some domains 100 columns is considered “wide”, while in others that’s perfectly normal and you’d need to have thousands (or tens of thousands!) of columns for it to be considered even remotely “wide”. The data that we work with at Fathom Data generally lies in the first domain, but from time to time we do work on data that is considerably wider.

This post touches on a couple of approaches for dealing with that sort of data. We’ll be using some HCRIS (Healthcare Cost Report Information System) data, which are available for download here. Specifically, we’ll be working with an extract from the hcris2552_10_2017.csv file, which contains “select variables in flat shape”.

How Wide is it?

How many columns are there in the CSV file?

awk -F, 'NR == 1 {print NF}' hcris2552_10_2017.csv
1708

This qualifies as “wide” in our books. The original file has numerous rows, but for this post, we will only be looking at the first 100 rows.

Opening Gambit: Cram it into a Database

My initial thought was to just pop the entire file contents into a database and then do the processing in SQL. However, I certainly don’t have the fortitude to create an appropriate table definition by hand. However, the csvsql tool from the csvkit package will generate that table definition for me. Zero fortitude required.

csvsql --dialect postgresql hcris2552_10_2017.csv

Take a look at the resulting SQL script. Neatly formatted and including appropriate constraints where relevant.

Useful options for csvsql (there are many others, but these are the ones that I’ve used frequently):

  • --dialect — the required flavour of SQL (choose from firebird, mssql, mysql, oracle, postgresql, sqlite, and sybase);
  • --date-format and --datetime-format — the format of date/time fields in the CSV (using strptime() placeholders);
  • --no-create — don’t output CREATE TABLE, just INSERT; and
  • --snifflimit — number of bytes to read for determining CSV format.

A Problem: Column Limit in PostgreSQL

But there’s a problem: the resulting table has too many columns! (PostgreSQL has a hard limit at 1600 columns per table.)

There are some options for fixing this (there are undoubtedly others, but these are the ones that came to mind immediately):

  • discard some columns (starting with the empty ones) or
  • split into multiple tables.

Neither of these was particularly palatable. Although csvsql could magically produce a table definition for me, I had to resort to another approach.

Tidy Data

Looking critically at the data I realised that most of the “columns” should not be columns at all. This was long data stored in wide format. To put it bluntly, this was untidy data.

We generally try to work with tidy data, which means that it satisfies the following rules:

  1. each variable must have its own column;
  2. each observation must have its own row; and
  3. each value must have its own cell.

The HCRIS data flagrantly violates rules 2 and 3.

Fortunately, R has tooling in the {tidyr} package that makes it simple to transform this kind of deviant data into something less perverse.

Change of Plan

Pull in a few fundamental R libraries.

library(readr)
library(dplyr)
library(tidyr)
library(janitor)

Use read_csv() to swiftly load the HCRIS data into a data frame.

hcris <- read_csv("hcris2552_10_2017.csv")

How many rows and columns?

dim(hcris)
[1]  100 1708

What kind of data are we looking at?

table(sapply(hcris, class))
character   logical   numeric 
       61       624      1023 

Mostly numeric columns, with a decent number of logical and a smattering of character. A quick check reveals that almost all of the logical columns only have NA entries. These are effectively empty columns and they’re just wasting space.

Removing Empty Columns

The first thing that we’ll do is strip out the empty columns using {janitor}.

hcris <- remove_empty(hcris, "cols")

How many columns are we left with?

ncol(hcris)
[1] 1085

Well, that would fit into a PostgreSQL table now, but we’ve already started down this road, so let’s press on.

table(sapply(hcris, class))

character   logical   numeric 
       61         1      1023 

We still have a mixture of column types. And one of them is logical. Since we’re already removed the empty columns, that one must really contain Boolean data. I did a quick investigation and found that the column concerned, util_cd, contains text data but was deemed to be logical since all of the values were "F" (which gets translated into FALSE). Digging a little further (looking at data for other years) I found that this is a categorical column (no surprise given its name) with levels "F", "N" and "L". Let’s fix this properly.

hcris <- hcris %>%
  mutate(
    util_cd = factor(ifelse(is.logical(util_cd) & !util_cd, "F", NA), levels = c("F", "N", "L"))
  )

One last look at the column types and counts.

table(sapply(hcris, class))
character    factor   numeric 
       61         1      1023 

Yup, that looks about right.

Pivoting

We’re ready to pivot the data. But there are still a few issues that we need to handle. One problem is that the columns that we are going to pivot don’t have a single type: there are both numeric and character values. We also need to identify the columns that we are not going to pivot.

# Unique row identified.
#
RECID <- "rpt_rec_num"

# What columns are not going to pivot?
#
COMMON <- c(
  "prvdr_ctrl_type_cd", "prvdr_num", "rpt_stus_cd", "initl_rpt_sw", "last_rpt_sw", "trnsmtl_num", "fi_num",
  "adr_vndr_cd", "util_cd", "fy_bgn_dt", "fy_end_dt", "proc_dt", "fi_creat_dt", "fi_rcpt_dt", "fyear"
)

We’ll create three separate data frames, one with common columns (not to be pivoted), another with numeric columns and finally one with character columns.

hcris_common <- hcris %>% select(all_of(c(RECID, COMMON)))

hcris_numeric <- hcris %>% select(
  any_of(RECID) |                      # Include unique ID
  where(is.numeric) &                  # and all numeric columns
  !any_of(COMMON)                      # but not common columns.
)
hcris_character <- hcris %>% select(
  any_of(RECID) |                      # Include unique ID
  where(is.character) &                # and all character columns
  !any_of(COMMON)                      # but not common columns.
)

Okay, so let’s pivot the numeric data.

hcris_numeric <- hcris_numeric %>% pivot_longer(
  -all_of(RECID),
  names_to = "cell",
  values_to = "nmrc",
  values_drop_na = TRUE
)
# A tibble: 30,202 × 3
   rpt_rec_num cell           nmrc
         <dbl> <chr>         <dbl>
 1      629478 a_c1_30    25350799
 2      629478 a_c2_30     8295648
 3      629478 a_c6_30     -112553
 4      629478 a_c2_113    2977228
 5      629478 a_c1_200  160050242
 6      629478 a_c2_200  184197641
 7      629478 a_c6_200  -66508288
 8      629478 a7_1_c1_1  20064648
 9      629478 a7_1_c2_3   1016540
10      629478 a7_1_c1_5 104979100
# … with 30,192 more rows

Looks good! Now the character data.

hcris_character <- hcris_character %>% pivot_longer(
  -all_of(RECID),
  names_to = "cell",
  values_to = "alpha",
  values_drop_na = TRUE
)
# A tibble: 30,202 × 3
   rpt_rec_num cell           nmrc
         <dbl> <chr>         <dbl>
 1      629478 a_c1_30    25350799
 2      629478 a_c2_30     8295648
 3      629478 a_c6_30     -112553
 4      629478 a_c2_113    2977228
 5      629478 a_c1_200  160050242
 6      629478 a_c2_200  184197641
 7      629478 a_c6_200  -66508288
 8      629478 a7_1_c1_1  20064648
 9      629478 a7_1_c2_3   1016540
10      629478 a7_1_c1_5 104979100
# … with 30,192 more rows

Unpack Field Name

It might not surprise you to learn that the cell field in the pivoted data (previously the column names) contains values for multiple quantities. The image below (taken from the HCRIS documentation) indicates how this field can be unpacked to yield a worksheet code (wksht_cd), line and column number.

Unpacking the identifier to yield a worksheet code, line and column number.
A quick note about the data format seems in order. These data encode information from a large spreadsheet consisting of multiple worksheets. The `cell` column is an aggregate of the worksheet code, line and column number for the corresponding entry in the spreadsheet.

We could use separate() to tease this field apart. However, there are several intricacies and exceptions which are somewhat fiddly to deal with, so I’ll postpone those for another day.

Gambit Revisited

I didn’t show the SQL produced by csvsql because, well, there was a lot of it. But it’s still interesting to take a look. Let’s dump one of the pivoted tables to a CSV file now and run csvsql over it.

write_csv(hcris_numeric, file = "hcris-numeric.csv")

Take a look at the file.

head hcris-numeric.csv
rpt_rec_num,cell,nmrc
629478,a_c1_30,25350799
629478,a_c2_30,8295648
629478,a_c6_30,-112553
629478,a_c2_113,2977228
629478,a_c1_200,160050242
629478,a_c2_200,184197641
629478,a_c6_200,-66508288
629478,a7_1_c1_1,20064648
629478,a7_1_c2_3,1016540

And now run csvsql to generate a table definition suitable for PostgreSQL.

csvsql --dialect postgresql hcris-numeric.csv
CREATE TABLE "hcris-numeric" (
        rpt_rec_num DECIMAL NOT NULL, 
        cell VARCHAR NOT NULL, 
        nmrc DECIMAL NOT NULL
);

What about SQLite with a more civilised table name and no constraints on field length?

csvsql --dialect sqlite --tables hcris_nmrc --no-constraints hcris-numeric.csv
CREATE TABLE hcris_nmrc (
        rpt_rec_num FLOAT, 
        cell VARCHAR, 
        nmrc FLOAT
);

This really is a handy tool. And it’ll save you a bunch of time. Even if you need to tweak the table definition a bit, the fact that you can automagically get something that works straight away is priceless.

Summary

Just to pull this all together, we started with a CSV data file with a large number of columns. First, an attempt was made to use csvsql to create a SQL table definition that would allow us to copy the raw data straight into a database. However, the resulting table ended up being too wide for the database to ingest. Instead, we pivoted the data around into a long format. Not only does this format translate easily into a SQL table, but for analytical purposes, it’s also a lot easier to work with.

Banner for Fathom Data.