Computing deciles over calendar years and across different columns using R

185 Views Asked by At

I have the following dataset that I created using dplyr and the function tbl_df():

date     X1    X2
1  2001-01-31 4.698648 4.640957
2  2001-02-28 4.491493 4.398382
3  2001-03-30 4.101235 4.074065
4  2001-04-30 4.072041 4.217999
5  2001-05-31 3.856718 4.114061
6  2001-06-29 3.909194 4.142691
7  2001-07-31 3.489640 3.678374
8  2001-08-31 3.327068 3.534823
9  2001-09-28 2.476066 2.727257
10 2001-10-31 2.015936 2.299102
11 2001-11-30 2.127617 2.590702
12 2001-12-31 2.162643 2.777744
13 2002-01-31 2.221636 2.740961
14 2002-02-28 2.276458 2.834494
15 2002-03-28 2.861650 3.472853
16 2002-04-30 2.402687 3.026207
17 2002-05-31 2.426250 2.968679
18 2002-06-28 2.045413 2.523772
19 2002-07-31 1.468695 1.677434
20 2002-08-30 1.707742 1.920101
21 2002-09-30 1.449055 1.554702
22 2002-10-31 1.350024 1.466806
23 2002-11-29 1.541507 1.844471
24 2002-12-31 1.208786 1.392031

I am interested in computing deciles for each year and each column. For example, the deciles of 2001 for X1, deciles of 2001 for X2, deciles of 2002 for X1, deciles of 2002 for X2 and so on if I have more years and more columns. I tried:

quantile(x, prob = seq(0, 1, length = 11), type = 5) or using apply.yearly() with the quantile() function and an xts object of x (my dataframe above) but none of them do what I actually need to compute. Your help will be appreciated.

2

There are 2 best solutions below

1
On

Assuming you have a simple data.frame, first, bin the dates by year:

df$year <- cut(as.Date(df$date), "year")

And then aggregate by year:

foo <- aggregate(. ~ year, subset(df, select=-date), quantile,
                 prob = seq(0, 1, length = 11), type = 5)

This returns a data frame. But it needs a bit of cleaning. Using unnest from the dev version of tidyr and lapply, you could do the following. Please note that the first row for X1 is for 2001, and the second for 2002.

devtools::install_github("hadley/tidyr")
library(tidyr)

unnest(lapply(foo[-1], as.data.frame), column)

#  column       0%      10%      20%      30%      40%      50%      60%      70%      80%      90%     100%
#1     X1 2.015936 2.094113 2.159140 2.561166 3.375840 3.673179 3.893451 4.055756 4.140261 4.553640 4.698648
#2     X1 1.208786 1.307653 1.439152 1.475976 1.591378 1.876578 2.168769 2.270976 2.405043 2.556870 2.861650
#3     X2 2.299102 2.503222 2.713601 2.853452 3.577888 3.876219 4.102062 4.139828 4.236037 4.471155 4.640957
#4     X2 1.392031 1.444374 1.545912 1.694138 1.867160 2.221936 2.675804 2.825141 2.974432 3.160201 3.472853
0
On

You can try the following function:

df<- read.table(header=T,text='date     X1    X2
1  2001/01/31 4.698648 4.640957
2  2001/02/28 4.491493 4.398382
3  2001/03/30 4.101235 4.074065
4  2001/04/30 4.072041 4.217999
5  2001/05/31 3.856718 4.114061
6  2001/06/29 3.909194 4.142691
7  2001/07/31 3.489640 3.678374
8  2001/08/31 3.327068 3.534823
9  2001/09/28 2.476066 2.727257
10 2001/10/31 2.015936 2.299102
11 2001/11/30 2.127617 2.590702
12 2001/12/31 2.162643 2.777744
13 2002/01/31 2.221636 2.740961
14 2002/02/28 2.276458 2.834494
15 2002/03/28 2.861650 3.472853
16 2002/04/30 2.402687 3.026207
17 2002/05/31 2.426250 2.968679
18 2002/06/28 2.045413 2.523772
19 2002/07/31 1.468695 1.677434
20 2002/08/30 1.707742 1.920101
21 2002/09/30 1.449055 1.554702
22 2002/10/31 1.350024 1.466806
23 2002/11/29 1.541507 1.844471
24 2002/12/31 1.208786 1.392031')

find_quantile <- function(df,year,col,quant) { 
  year_df <- subset(df,year==substring(as.character(date),1,4))
  a <- quantile(year_df[,col] , quant)
  return(a)
}
#where df is the dataframe, 
#year is the year you want (as character), 
#col is the column you want to calculate the quantile (as index i.e. in your case 2 or 3, 
#quant is the quantile

For example:

> find_quantile(df,'2001',2,0.7) #specify the year as character
     70% 
4.023187