I'm developing an R package which requires me to report percentile ranks for each of the returned values. However, the distribution I have is huge (~10 million values).
The way I'm currently doing it is by generating an ecdf
function, saving that function to a file and reading it in the package when needed. This is problematic because the file I save ends up being huge (~120mb) and takes too long to load back in:
f = ecdf(rnorm(10000000))
save(f, file='tmp.Rsav')
Is there anyway to make this more efficient maybe somehow by approximating the percentile rank in R?
Thanks
Just do an ecdf on a downsampled distro:
Note you probably want to think about the downsampling a little bit as the example here will return slightly biased answers, but the general strategy should work.