I have a very large data file (>300k rows) and each row is part of a unique sample (>3000 samples). I want to generate a kernel density estimator for each individual sample and extract relevant information (minimum value, maximum value, maximum probability of density estimator, median of density estimator, mean of density estimator) into a separate table along with the sample name.
I have tried to extract information from the ggplot
function stat_density_ridges()
using the approaches laid out here
Adding a mean to geom_density_ridges and here draw line on geom_density_ridges which extract data from stat_density_ridges
and ggplot_build
with purrr::pluck
but it doesn't provide all the information that I want.
The following generates some synthetic data similar to what I want:
set.seed(1)
x = runif( 50, max = 40, min = 20 )
set.seed(2)
y = runif( 50, max = 300, min = 100 )
sample.number = c( rep( 1, 20 ), rep( 2, 15 ), rep( 3, 5 ), rep( 4, 10 ) )
d <- data.frame( x, y , sample.number )
And the plot in ggplot
that shows the distribution:
ggplot( data = d, aes( x = x, y = as.factor( samples ) ) ) +
labs( x = expression( paste( "x" ) ),
y = expression( paste( "sample number" ) ) ) +
stat_density_ridges()
I would like to end up with a data table with the following information:
sample.name
, max(x)
, min(x)
, maximum height of the kernel density estimator and it's x
location, median height of the kernel density estimator and it's x
location, etc.
The only thing I can think to do is to create a long and arduous loop
sample.numbers <- rep( NA, times = max( d$sample.number ) )
max.x <- rep( NA, times = max( d$sample.number ) )
min.x <- rep( NA, times = max( d$sample.number ) )
for( i in 1:max( d$sample.number ) ) {
temp.d = d[ d$sample.number == i, ]
sample.numbers[ i ] = i
max.x[ i ] = max( temp.d$x )
min.x[ i ] = min( temp.d$x )
}
and then somehow adding in a bit that creates a density estimator and extracts the information from that. I'm guessing the indexing in R presents an easier way to get through this for the many thousands of samples I have while using group_by
, but I cannot figure it out. Note, I'm still having trouble getting my head around piping in R, so some simple explaining might be required if solutions have that in them.
There are different ways to do this. Using dplyr and the pipe operator is, in my opinion, the easiest way. I tried adding comments in the code to make it easier to understand. Take a look at this dplyr cheat sheet.
Basically, you use
group_by
to divide your data frame in groups according tosample.number
. Then you usesummarise
to compute summary metrics of the columnx
inside each group.To compute the density you can use
density()
from base R inside thesummarise
. This will return a list with a sample of(x,y)
values of the density function. To extract quantiles from this density function, you can use the packagespatstat
.One observation:
density()
computes a value for the bandwidth that depends on the data set. Since we are separating the different groups, each group can end up having a different bandwidth value. I used the functionbw.nrd
to estimate a single value of the bandwidth using the complete dataset. Then I use this single bandwidth value for all computations.