# Scaling Density Plots

I’m a density plot devotee. And, using `geom_density()` from `{ggplot2}` these plots are effortless to produce. However, sometimes the results of `geom_density()` are not exactly what I’m after. Here’s how I tweak them to give me precisely what I need.

## The Data

We’ll use a slightly modified version of the `penguins` data from the `{palmerpenguins}` package. The data have been filtered to reduce the number of records for Chinstrap penguins by 50% and the number of records for male penguins (all species) by 75%. The distribution of samples across the `species` and `sex` dimensions is now skewed, with male and Chinstrap penguins being relatively scarce.

I have also included data for the recently discovered (and possibly apocryphal) Sparkle penguin species (believed to have been named by a precocious 6 year old with a passion for shiny things and unicorns).

``````# A tibble: 8 × 3
species   sex    count
<fct>     <fct>  <int>
3 Chinstrap female    17
4 Chinstrap male       4
5 Gentoo    female    58
6 Gentoo    male      15
7 Sparkle   female    80
8 Sparkle   male      10
``````

The total sample count is 275, of which 47 are male and 228 are female.

## Sparkle Penguins

Let’s start by focusing our attention on those Sparkle penguins. The data consists of 10 male and 80 female Sparkle penguins. Let’s generate a density plot of flipper length using `geom_density()`.

``````ggplot(sparkle) +
geom_density(aes(x = flipper_length_mm, fill=sex), alpha = 0.5)
``````

One of the unique (and remarkable!) characteristics of this species is that the length of their flippers is uniformly distributed between 180 and 230 mm. These bounds are indicated by the vertical dashed lines. The above plot is completely consistent with this: the flipper length density is the same (or at least very similar!) for the two sexes. The difference between the curve for male and female is an artifact of the kernel density estimator used by `geom_density()`. Since there are more observations of female Sparkle penguins, the distribution of flipper lengths is sharper (closer to square). The area under both curves is 1, which means that each curve can be interpreted as a probability density function (PDF).

But what if we want to actually plot the density of observations (in penguins per mm)? To do this we need to add in a `y` aesthetic and use the `after_stat()` function to delay the mapping.

``````ggplot(sparkle) +
geom_density(aes(x = flipper_length_mm, y = after_stat(count), fill=sex), alpha = 0.5) +
facet_grid(sex ~ .)
``````

The shape of the curves remains the same, but now the area under the curves reflects the number of samples for each sex and the height of the curve represents the density of penguins in the sample (in penguins per mm). I’ve split the plot into two facets and overlaid a rug onto each to show the actual distribution of the samples.

So now we have two different views of the data arising from `geom_density()`:

• a smoothed estimate of the underlying distribution and
• a smoothed distribution of the actual samples.

## All the Penguins

Let’s broaden our scope and include all of the penguins. First let’s take a look at the vanilla output from `geom_density()`. This shows us the distribution of flipper length across all species broken down by gender. Each curve gives the appropriate PDF. If we wanted to generate samples of flipper length with the appropriate distribution, then this is the data that we would want.

``````ggplot(penguins) +
geom_density(aes(x = flipper_length_mm, fill=sex), alpha = 0.5) +
facet_grid(sex ~ .)
``````

If, however, we provide a delayed count as the `y` aesthetic them we get the count density of penguin samples in the data. These curves tell us more about the actual sampled data than they do about the underlying distributions.

``````ggplot(penguins) +
geom_density(aes(x = flipper_length_mm, y = after_stat(count), fill=sex), alpha = 0.5)
``````

## Penguins on the Ridges

The `{ggridges}` package includes geoms which provide a complementary view to `geom_density()` and work particularly well when you need to break the data down into a number of categories. The same two views can be produced here too.

``````ggplot(penguins) +
geom_density_ridges(
aes(x = flipper_length_mm, y = species, fill = sex),
scale = 1.5,
alpha = 0.5
)
``````

Because `geom_density_ridges()` uses the `y` aesthetic to determine the ridge offset, we use the `height` aesthetic to specify the delayed count.

``````ggplot(penguins) +
geom_density_ridges(
aes(x = flipper_length_mm, y = species, fill = sex, height = after_stat(count)),
stat="density",
scale = 1.5,
alpha = 0.5
)
``````

Something similar could be achieved directly with `{ggplot2}` by using facets, but I think that ridgeline plots really are 🚀.