Why do some violin plots of continuous data look wavy?

Question

Why do some violin plots of continuous data look wavy?

121 Views Asked by user22332364 At 03 August 2023 at 11:09

I am making violin plots of ATAC-seq peaks (z-scores) over time. I notice that in some of my plots (A), the plots are very "wavy", while in others (B) this pattern is not very apparent.

I have assumed that this is due to the high number of peaks in A (28,258 peaks) affecting the kernel density, versus a lower number of peaks in B (6,438 peaks).

Could anyone confirm if my assumption is correct and/or if I should "correct" A.

I used this code:

C12[,22:27] %>%
  gather(key = "Stage", value = "val") %>%
  ggplot(aes (x = Stage, y = val, fill = Stage)) +
  scale_fill_viridis_d()+
  geom_hline(yintercept = c(0), linetype = 2) +
  geom_violin()+
  theme(axis.text.x = element_blank())+
  ggtitle("Cluster 12")+
  ylab("Z-score")

A: wavy violin

A: wavy violin

B: smooth violin

B: smooth violin

Original Q&A

There are 1 best solutions below

**Allan Cameron** · Answer 1 · 2023-08-03T12:21:02.370000

You haven't included any data in the question, but from looking at your plots, my guess is that your data are rounded to one decimal place. When you have relatively few points, this does not matter much because the smoothing kernel will be relatively wide, but when you have several thousand points, the smoothing kernel becomes narrower. This means that the empty areas between the 0.1 unit steps in your data no longer become 'filled in'.

We can see this in a simple example where all the data are drawn from the same distribution, but we increase the sample size:

set.seed(1)

df <- data.frame(y = round(rgamma(15500, 2, 5), 1) - 1.4,
                 x = factor(rep(paste('n =', c(500, 1e3, 2e3, 4e3, 8e3)),
                                times = c(500, 1e3, 2e3, 4e3, 8e3)),
                            paste('n =', c(500, 1e3, 2e3, 4e3, 8e3))))

library(ggplot2)

ggplot(df, aes(x, y, fill = x)) +
  geom_violin() +
  scale_fill_viridis_d()

Since geom_violin uses StatYDensity, which in turn calculates the bandwidth using stats::bw.nrd0, we can find out the bandwidth used in each of the above groups using:

tapply(df$y, df$x, bw.nrd0)
#>    n = 500   n = 1000   n = 2000   n = 4000   n = 8000 
#> 0.06298354 0.06573885 0.04406086 0.03835721 0.03339189

Where we see that the smoothing kernel is about half the width in the last group compared to the first.

If we change the kernel estimation to bw.bcv, the bandwidth is reduced much less dramatically:

tapply(df$y, df$x, bw.bcv)
#>    n = 500   n = 1000   n = 2000   n = 4000   n = 8000 
#> 0.06525889 0.05905210 0.05958200 0.05858528 0.05353357

Which means we should get greater consistency between your plots if we do:

ggplot(df, aes(x, y, fill = x)) +
  geom_violin(bw = 'bcv') +
  scale_fill_viridis_d()

Whether this is something you should do, as Roland points out in the comments, is perhaps a discussion for a different forum...

Why do some violin plots of continuous data look wavy?

There are 1 best solutions below

Related Questions in R

Related Questions in GGPLOT2

Related Questions in Z-SCORE

Trending Questions

Popular # Hahtags

Popular Questions