Converting data to percentage rank

498 Views Asked by At

I have data whose mean and variance changes as a function of the independent variable. How do I convert the dependent variable into (estimated) conditional percentage ranks?

For example, say the data looks like Z below:

library(dplyr)
library(ggplot2)

data.frame(x = runif(1000, 0, 5)) %>%
  mutate(y = sin(x) + rnorm(n())*cos(x)/3) ->
  Z

we can plot it with Z %>% ggplot(aes(x,y)) + geom_point(): it looks like a disperse sine function, where the variance around that sine function varies with x. My goal is to convert each y value into a number between 0 and 1 which represents its percentage rank for values with similar x. So values very close to that sine function should be converted to about 0.5 while values below it should be converted to values closer to 0 (depending on the variance around that x).

One quick way to do this is to bucket the data and then simply compute the rank of each observation in each bucket.

Another way (which I think is preferable) to do what I ask is to perform a quantile regression for a number of different quantiles (tau):

library(quantreg)
library(splines)

model.fit <- rq(y ~ bs(x, df = 5), tau = (1:9)/10, data = Z)

which can be plotted as follows:

library(tidyr)

data.frame(x = seq(0, 5, len = 100)) %>%
  data.frame(., predict(model.fit, newdata = .), check.names = FALSE) %>%
  gather(Tau, y, -x) %>% 
  ggplot(aes(x,y)) + 
  geom_point(data = Z, size = 0.1) +
  geom_line(aes(color = Tau), size = 1)

Given model.fit I could now use the estimated quantiles for each x value to convert each y value into a percentage rank (with the help of approx(...)) but I suspect that package quantreg may do this more easily and better. Is there, in fact, some function in quantreg which automates this?

0

There are 0 best solutions below