What is stat_density calculating in stacked density charts?

102 Views Asked by At

As an example, here a stacked density chart based on diamonds. A simple table suggests that "Fair" cut should be quite uncommon, and "Ideal" the most common, etc..., but the five cuts seem to occupy a more or less equal area/ proportion on the stacked charts. That is surprising to me.

This is probably a conceptual misunderstanding from my side and I'd be grateful for someone to help. So, what does stacked density exactly show?

library(tidyverse)
library(patchwork)

p1 <- diamonds %>%
  ggplot() +
  geom_density(aes(x = price, fill = cut), position = "stack")

p2 <- diamonds %>%
  ggplot() +
  geom_density(aes(x = price, fill = cut), position = "fill")

p1 / p2 + plot_layout(guides = "collect")


table(diamonds$cut)
#> 
#>      Fair      Good Very Good   Premium     Ideal 
#>      1610      4906     12082     13791     21551

Created on 2023-11-12 with reprex v2.0.2

1

There are 1 best solutions below

7
On BEST ANSWER

The densities are being normalized (i.e. scaled to area 1) before being combined. If you don't want that, use after_stat(count), e.g.

library(tidyverse)

diamonds %>%
  ggplot() +
  geom_density(aes(x = price, y = after_stat(count), 
                   fill = cut), 
               position = "stack")

Created on 2023-11-12 with reprex v2.0.2

Edited to add:

Here's how this works: geom_density() always computes the normalized density for each group. Setting y = after_stat(count) tells it to multiply the density by the count for the group, so you get "number of cases per unit of price" instead of "probability per unit of price", the usual units of the density.

If you'd like to keep the density scale, you can use

diamonds %>%
  ggplot() +
  geom_density(aes(x = price, 
                   y = after_stat(count/sum(tapply(n, group, unique))), 
                   fill = cut), 
               position = "stack") +
  ylab("Density")

This will actually show "subdensities", where the total area is 1, and the area of each region is the proportion of cases of that class. (The expression sum(tapply(n, group, unique)) is a fancy way to compute nrow(diamonds) without referring to the particular dataset. It says: find the unique values of n in each group, and add them up. Since n is the same for all entries in the group, that gives the overall total. Thanks to @Stefan for the trick.)