I am making violin plots of ATAC-seq peaks (z-scores) over time. I notice that in some of my plots (A), the plots are very "wavy", while in others (B) this pattern is not very apparent.
I have assumed that this is due to the high number of peaks in A (28,258 peaks) affecting the kernel density, versus a lower number of peaks in B (6,438 peaks).
Could anyone confirm if my assumption is correct and/or if I should "correct" A.
I used this code:
C12[,22:27] %>%
gather(key = "Stage", value = "val") %>%
ggplot(aes (x = Stage, y = val, fill = Stage)) +
scale_fill_viridis_d()+
geom_hline(yintercept = c(0), linetype = 2) +
geom_violin()+
theme(axis.text.x = element_blank())+
ggtitle("Cluster 12")+
ylab("Z-score")
A: wavy violin

B: smooth violin

You haven't included any data in the question, but from looking at your plots, my guess is that your data are rounded to one decimal place. When you have relatively few points, this does not matter much because the smoothing kernel will be relatively wide, but when you have several thousand points, the smoothing kernel becomes narrower. This means that the empty areas between the 0.1 unit steps in your data no longer become 'filled in'.
We can see this in a simple example where all the data are drawn from the same distribution, but we increase the sample size:
Since
geom_violinusesStatYDensity, which in turn calculates the bandwidth usingstats::bw.nrd0, we can find out the bandwidth used in each of the above groups using:Where we see that the smoothing kernel is about half the width in the last group compared to the first.
If we change the kernel estimation to
bw.bcv, the bandwidth is reduced much less dramatically:Which means we should get greater consistency between your plots if we do:
Whether this is something you should do, as Roland points out in the comments, is perhaps a discussion for a different forum...