I try to expand a dataset in R using the values from different columns (Key2 - KeyX) and then use the column number in a formula do compute some value.
Example of a part of the dataset I want to expand
Year Key2 Key3 Key4 Key5 ...
2001 150 105 140 140
2002 130 70 55 80
2003 590 375 355 385
...
Preferred result.
- i = index number
- col = column number (Key2 = 1, Key3 = 2, etc.)
- p = random number
value = value calculated with the column number and p
year i col p value 2001 1 1 0.7481282 4.0150810 2001 2 1 0.8449366 2.0735090 2001 ... 1 0.1906882 0.9534411 2001 150 1 0.8030162 3.7406410 2001 1 2 0.4147019 4.2246831 2001 2 2 0.3716995 1.8584977 2001 ... 2 0.5280272 2.6401361 2001 105 2 0.8030162 3.7406410 2001 1 3 0.7651376 3.8256881 2001 2 3 0.2298984 1.1494923 2001 ... 3 0.5607825 2.8039128 2001 140 3 0.7222644 3.6113222 etc. 2002 1 1 0.1796613 0.8983065 2002 2 1 0.6390833 3.1954165 2002 ... 1 0.5280272 2.6401367 2002 130 1 0.4238842 2.1194210 2002 1 2 0.7651376 3.8256889 2002 2 2 0.2298984 1.1494928 2002 ... 2 0.5607825 2.8039125 2002 70 2 0.7222644 3.6113227 2002 1 3 0.7512801 3.7564000 2002 2 3 0.4484248 2.2421240 2002 ... 3 0.5662704 2.8313520 2002 55 3 0.7685377 3.8426884 etc.
I use the following code in R, but it is very slow with a large dataset.
I tried to keep the use of loops to a minimum by using rep() but I still have to for-loops in the code.
Is there a faster / more efficient way to do this is R? Using data.table?
val <- c(); i <- c(); cols <- c(); p <- c(); year <- c()
for (year in 1:10) {
for (n in 2:25) {
c <- n-1
pu <- runif(dataset1[[year, n]])
p <- c(p, pu )
tmp <- (c-1)*5 + 5*pu
val <- c(val, tmp)
##
i <- c(i, 1:dataset1[[year, n]])
cols <- c(cols, rep(c, dataset1[[year, n]]) )
year <- c(year, rep(dataset1[[year,1]], dataset1[[year, n]]) )
}
}
res.df <- data.frame(year=year, i=i, cols=cols, p=p, val=val)
res.df <- setDT(res.df)
The core of the problem is the expansion of the values in the
Keycolumns intoi.Here is another
data.tablesolution employingmelt()but differing in implementation details from David's comment:The remaining computations can be done like this (if I've understood OP's intention right)
Benchmarking the different approaches
As the OP is asking for a faster / more efficient way, the three different approaches proposed so far are being benchmarked:
data.tablesolution plus a modification which ensures the result is identical with the expected resulttidyversesolutiondata.tablesolutionBenchmark code
For benchmarking, the
microbenchmarkpackage is used.Note that references to
tidyversefunctions are made explicit in order to avoid name conflicts due to a cluttered name space. The modifieddavid2variant converts factors to numbers of levels.Timing the small sample data set
With the small sample data set with 3 years and 4
Keycolumns provided by the OP the timings are as follows:Even for this small problem size, the
data.tablesolutions are magnitudes faster than thetidyverseapproach with slight advantages for solutionuwe.The results are checked to be equal:
Except for
david1which returns factors instead of integers and a different ordering, all four results are identical.Larger benchmark case
Form OP's code it can be concluded that his production data set consists of 10 years and 24
Keycolumns. In the sample data set the overall mean ofKeyvalues is 215. With these parameters, a larger data set is being created:The larger data set returns 51600 rows which is still of rather moderate size but is 20 times larger than the small sample. Timings are as follows:
For this problem size,
uweis nearly twice as fast as the otherdata.tableimplementations. Thetidyverseapproach is still magnitudes slower.