How to draw 2D quantile-based densities of a scatterplot for the outcome variable in R?

95 Views Asked by At

I realize this question has been asked in similar ways multiple times here. I am not asking for a scatterplot which includes a density heatmap of the data, as this captures the density of both variables as a smooth function. What I am looking for is something like this, which takes a "slice" of the distribution of the outcome variable and overlays it on the scatterplot:

enter image description here

The best I could come up with something like this:

#### Load Library ####
library(tidyverse)

#### Get IQR ####
q <- quantile(iris$Sepal.Length, 
              probs = c(.25,.5,.75))
q

#### Label Quantile Regions ####
qiris <- iris %>% 
  mutate(qs = ifelse(Sepal.Length >= q[3],
                     "Q75",
                     ifelse(Sepal.Length >= q[2],
                            "Q50","Q25")))

#### Plot Density and Scatter ####
ggplot()+
  geom_point(aes(x=Sepal.Width,
                 y=Sepal.Length),
             data=iris)+
  geom_density(aes(y=Sepal.Length,
                   fill=qs),
               data=qiris)

But predictably this fails because it doesn't place the "slice" of the distribution in relation to the predictor values.

enter image description here

I then came up with a slightly better solution that locates the distribution of values correctly:

library(ggridges)

ggplot(qiris, 
       aes(x = Sepal.Length,
           y = qs)) + 
  stat_density_ridges(quantiles = c(0.25,0.5,0.75),
                      geom="density_ridges_gradient",
                      jittered_points = TRUE,
                      position = "raincloud", 
                      alpha = 0.6, 
                      scale = 0.6)+
  coord_flip()

Which gives me this:

enter image description here

However there are still three issues here. First, I can't fit a regression line through it. Second, I would prefer the data points to be next to each other like a normal scatter plot rather than separated spatially by quantile so they're too far away. Third, this isn't including the other variable at all, which is important.

Edit

The answer by Allen looks good at first, but I think there is something I'm not seeing in his code. To try to figure this out, I tried using another dataset and saving the inputs as objects in R to make it easier to swap everything. When I do this, I get flat lines across the plot.

#### Load Library ####
library(tidyverse)

#### Save Objects ####
dfy <- mtcars$mpg # y var
dfx <- mtcars$hp # x var
data <- mtcars # dataset

#### QDATA ####
qdata <- data %>% 
  mutate(cut_group = cut(dfy, 
                         quantile(dfy, c(0.125, 0.375, 0.625, 0.875)),
                         labels = c('Q25', 'Q50', 'Q75')),
         baseline = quantile(dfy, 
                             c(0.25, 0.5, 0.75))[as.numeric(cut_group)]) %>%
  filter(complete.cases(.)) %>%
  group_by(cut_group) %>%
  reframe(dfxx = density(dfx)$x,
          dfy = first(baseline) - density(dfx, bw = 0.5)$y/3) %>%
  rename(dfx = dfxx) 

ggplot(data,
       aes(dfy,
           dfx)) +
  geom_smooth(method = 'lm', 
              color = 'gray',
              se = FALSE) +
  geom_point(color = 'navy',
             shape = 21,
             fill = NA) +
  geom_path(data = qdata,
            aes(group = cut_group), 
            color = 'darkgreen',
            linewidth = 1.5) +
  theme_classic() +
  theme(panel.border = element_rect(fill = NA, 
                                    linewidth = 1))

Like so:

enter image description here

1

There are 1 best solutions below

5
On

I would probably do this by precalculating density of the quantiles and drawing them on as geom_path:

quartiles <- quantile(iris$Sepal.Width)
midpoints <- quartiles[-5] + 0.5 * diff(quartiles)

qiris <- iris %>% 
  mutate(Q = cut(Sepal.Width, quartiles, labels = paste0('Q', 1:4)),
         baseline = midpoints[as.numeric(Q)]) %>%
  filter(complete.cases(.)) %>%
  group_by(Q) %>%
  reframe(SepalLength = density(Sepal.Length)$x,
          Sepal.Width = first(baseline) - density(Sepal.Length, bw = 0.5)$y/3) %>%
  rename(Sepal.Length = SepalLength) 

ggplot(iris, aes(Sepal.Width, Sepal.Length)) +
  annotate('rect', xmin = quartiles[-5], xmax = quartiles[-1], ymin = -Inf,
           ymax = Inf, fill = c('gray', NA, 'gray', NA), alpha = 0.2) +
  annotate('text', x = midpoints, y = 9, label = paste0('Q', 1:4)) +
  geom_smooth(method = 'lm', color = 'gray', se = FALSE) +
  geom_point(color = 'navy', shape = 21, fill = NA) +
  geom_path(data = qiris, aes(group = Q), color = 'darkgreen',
            linewidth = 1.5, alpha = 0.5) +
  theme_classic() +
  theme(panel.border = element_rect(fill = NA, linewidth = 1))

enter image description here

For the mtcars example, you will need to select a different bandwidth and multiplier for the density to get it into approximately the same scale as the existing variables:

quartiles <- quantile(mtcars$mpg)
midpoints <- quartiles[-5] + 0.5 * diff(quartiles)

qmtcars <- mtcars %>% 
  mutate(Q = cut(mpg, quartiles, labels = paste0('Q', 1:4)),
         baseline = midpoints[as.numeric(Q)]) %>%
  filter(complete.cases(.)) %>%
  group_by(Q) %>%
  reframe(HP = density(hp)$x,
          mpg = first(baseline) - density(hp, bw = 100)$y * 500) %>%
  rename(hp = HP) 

ggplot(mtcars, aes(mpg, hp)) +
  annotate('rect', xmin = quartiles[-5], xmax = quartiles[-1], ymin = -Inf,
           ymax = Inf, fill = c('gray', NA, 'gray', NA), alpha = 0.2) +
  annotate('text', x = midpoints, y = 450, label = paste0('Q', 1:4)) +
  geom_smooth(method = 'lm', color = 'gray', se = FALSE) +
  geom_point(color = 'navy', shape = 21, fill = NA) +
  geom_path(data = qmtcars, aes(group = Q), color = 'darkgreen',
            linewidth = 1.5, alpha = 0.5) +
  theme_classic() +
  theme(panel.border = element_rect(fill = NA, linewidth = 1))

enter image description here