Descriptive Statistics

In the previous installment we derived two categorical variables. This time we will extract descriptive statistics.

Firstly, let’s start by looking at a summary of the entire National Health and Nutrition Examination Survey data set. We can exclude the identifier field, since this has no real significance.

summary(DS0012[, c(-1, -7)])

 gender        age             mass            height           BMI             BMI.category
M:4448   Min.   : 2.00   Min.   : 10.40   Min.   :0.815   Min.   :12.50   underweight:1759
F:4413   1st Qu.:12.00   1st Qu.: 49.00   1st Qu.:1.503   1st Qu.:19.97   normal     :2589
Median :33.00   Median : 68.70   Median :1.624   Median :25.16   overweight :2260
Mean   :35.45   Mean   : 66.68   Mean   :1.561   Mean   :25.71   obese      :2253
3rd Qu.:56.00   3rd Qu.: 85.20   3rd Qu.:1.717   3rd Qu.:30.08
Max.   :80.00   Max.   :218.20   Max.   :2.038   Max.   :73.43


This gives the quantiles and mean for each of the numerical variables, and the counts for each of the categorical variables. The average age of the subjects is 35. The subjects have masses between 10.4 and 218.2 kg.

We could have extracted these statistics for each of the numerical variables individually.

mean(DS0012$BMI)  [1] 25.7057  median(DS0012$BMI)

[1] 25.15504

quantile(DS0012\$BMI)

      0%      25%      50%      75%     100%
12.50312 19.97228 25.15504 30.08150 73.42526


It gets a little painful to type out the variable name every time, but we can attach the DS0012 variable to R’s search path, which makes things much more compact.

attach(DS0012)
mean(BMI)

[1] 25.7057


That’s better. We can also get a table of counts for an individual categorical variable.

table(age.category)

age.category
2220      757     2105     1793     1986


This is precisely the information that we got in the summary above: children make up the largest portion of the sample, followed by adults and then seniors. Teenagers are in the minority. What about generating a contingency table which cross-tabulates two categorical variables?

table(age.category, BMI.category)

            BMI.category
age.category underweight normal overweight obese
child           1537    519        117    47
teenager         111    390        143   113
mature            31    421        638   703
senior            30 4   94        732   730


Now that is interesting: it seems that the majority of children in the data are underweight. Should we be concerned? No, the interpretation of BMI for children is different: the nominal thresholds between each of the categories no longer apply and BMI is compared to typical values for children of similar age. Among teenagers and adults the majority of the sample have normal BMIs. However, even the overweight and obese categories for adults are already well populated. In the mature and senior portion of the sample, BMIs more often indicate overweight or obese.

Finally, let’s generate a three way contingency table of BMI, age and gender.

(bmi.age.gender = table(BMI = BMI.category, age = age.category, gender))

, , gender = M

age
BMI           child teenager adult mature senior
underweight   818       67    14     10     15
normal        259      199   393    184    227
overweight     52       75   354    374    390
obese          23       53   291    313    337

, , gender = F

age
BMI           child teenager adult mature senior
underweight   719       44    36     21     15
normal        260      191   372    237    267
overweight     65       68   276    264    342
obese          24       60   369    390    393


It’s a little difficult to make sense of all that, but as we will see later on, there are great tools for understanding the contents of multiway contingency tables.

Right, that has given us a general feel for what the data looks like. The next step is to generate some plots.

The last thing that we need to do is detach the DS0012 variable

detach(DS0012)