Scaling Density Plots

I’m a density plot devotee. And, using geom_density() from {ggplot2} these plots are effortless to produce. However, sometimes the results of geom_density() are not exactly what I’m after. Here’s how I tweak them to give me precisely what I need.

The Data

We’ll use a slightly modified version of the penguins data from the {palmerpenguins} package. The data have been filtered to reduce the number of records for Chinstrap penguins by 50% and the number of records for male penguins (all species) by 75%. The distribution of samples across the species and sex dimensions is now skewed, with male and Chinstrap penguins being relatively scarce.

I have also included data for the recently discovered (and possibly apocryphal) Sparkle penguin species (believed to have been named by a precocious 6 year old with a passion for shiny things and unicorns).

# A tibble: 8 × 3
  species   sex    count
  <fct>     <fct>  <int>
1 Adelie    female    73
2 Adelie    male      18
3 Chinstrap female    17
4 Chinstrap male       4
5 Gentoo    female    58
6 Gentoo    male      15
7 Sparkle   female    80
8 Sparkle   male      10

The total sample count is 275, of which 47 are male and 228 are female.

Sparkle Penguins

Let’s start by focusing our attention on those Sparkle penguins. The data consists of 10 male and 80 female Sparkle penguins. Let’s generate a density plot of flipper length using geom_density().

ggplot(sparkle) +
  geom_density(aes(x = flipper_length_mm, fill=sex), alpha = 0.5)
Density of flipper length per gender of Sparkle Penguin.

One of the unique (and remarkable!) characteristics of this species is that the length of their flippers is uniformly distributed between 180 and 230 mm. These bounds are indicated by the vertical dashed lines. The above plot is completely consistent with this: the flipper length density is the same (or at least very similar!) for the two sexes. The difference between the curve for male and female is an artifact of the kernel density estimator used by geom_density(). Since there are more observations of female Sparkle penguins, the distribution of flipper lengths is sharper (closer to square). The area under both curves is 1, which means that each curve can be interpreted as a probability density function (PDF).

But what if we want to actually plot the density of observations (in penguins per mm)? To do this we need to add in a y aesthetic and use the after_stat() function to delay the mapping.

ggplot(sparkle) +
  geom_density(aes(x = flipper_length_mm, y = after_stat(count), fill=sex), alpha = 0.5) +
  facet_grid(sex ~ .)
Density of flipper length observations per gender of Sparkle Penguin.

The shape of the curves remains the same, but now the area under the curves reflects the number of samples for each sex and the height of the curve represents the density of penguins in the sample (in penguins per mm). I’ve split the plot into two facets and overlaid a rug onto each to show the actual distribution of the samples.

So now we have two different views of the data arising from geom_density():

  • a smoothed estimate of the underlying distribution and
  • a smoothed distribution of the actual samples.

All the Penguins

Let’s broaden our scope and include all of the penguins. First let’s take a look at the vanilla output from geom_density(). This shows us the distribution of flipper length across all species broken down by gender. Each curve gives the appropriate PDF. If we wanted to generate samples of flipper length with the appropriate distribution, then this is the data that we would want.

ggplot(penguins) +
  geom_density(aes(x = flipper_length_mm, fill=sex), alpha = 0.5) +
  facet_grid(sex ~ .)
Density of flipper length per gender for all penguins.

If, however, we provide a delayed count as the y aesthetic them we get the count density of penguin samples in the data. These curves tell us more about the actual sampled data than they do about the underlying distributions.

ggplot(penguins) +
  geom_density(aes(x = flipper_length_mm, y = after_stat(count), fill=sex), alpha = 0.5)
Density of flipper length observations per gender for all penguins.

Penguins on the Ridges

The {ggridges} package includes geoms which provide a complementary view to geom_density() and work particularly well when you need to break the data down into a number of categories. The same two views can be produced here too.

ggplot(penguins) +
  geom_density_ridges(
    aes(x = flipper_length_mm, y = species, fill = sex),
    scale = 1.5,
    alpha = 0.5
  )
Density of flipper length per gender for all penguins as ridge plot.

Because geom_density_ridges() uses the y aesthetic to determine the ridge offset, we use the height aesthetic to specify the delayed count.

ggplot(penguins) +
  geom_density_ridges(
    aes(x = flipper_length_mm, y = species, fill = sex, height = after_stat(count)),
    stat="density",
    scale = 1.5,
    alpha = 0.5
  )
Density of flipper length observations per gender for all penguins as ridge plot.

Something similar could be achieved directly with {ggplot2} by using facets, but I think that ridgeline plots really are 🚀.