Fast and efficient way to expand a dataset in R

Question

Fast and efficient way to expand a dataset in R

2.8k Views Asked by Steve At 25 June 2017 at 13:49

I try to expand a dataset in R using the values from different columns (Key2 - KeyX) and then use the column number in a formula do compute some value.

Example of a part of the dataset I want to expand

Year Key2 Key3 Key4 Key5 ...
2001  150  105  140  140
2002  130   70   55   80
2003  590  375  355  385
...

Preferred result.

i = index number
col = column number (Key2 = 1, Key3 = 2, etc.)
p = random number

value = value calculated with the column number and p

year   i col         p     value
2001   1   1 0.7481282 4.0150810
2001   2   1 0.8449366 2.0735090
2001 ...   1 0.1906882 0.9534411
2001 150   1 0.8030162 3.7406410
2001   1   2 0.4147019 4.2246831
2001   2   2 0.3716995 1.8584977
2001 ...   2 0.5280272 2.6401361
2001 105   2 0.8030162 3.7406410
2001   1   3 0.7651376 3.8256881
2001   2   3 0.2298984 1.1494923
2001 ...   3 0.5607825 2.8039128
2001 140   3 0.7222644 3.6113222
etc.

2002   1   1 0.1796613 0.8983065
2002   2   1 0.6390833 3.1954165
2002 ...   1 0.5280272 2.6401367
2002 130   1 0.4238842 2.1194210
2002   1   2 0.7651376 3.8256889
2002   2   2 0.2298984 1.1494928
2002 ...   2 0.5607825 2.8039125
2002  70   2 0.7222644 3.6113227
2002   1   3 0.7512801 3.7564000
2002   2   3 0.4484248 2.2421240
2002 ...   3 0.5662704 2.8313520
2002  55   3 0.7685377 3.8426884
etc.

I use the following code in R, but it is very slow with a large dataset. I tried to keep the use of loops to a minimum by using rep() but I still have to for-loops in the code.

Is there a faster / more efficient way to do this is R? Using data.table?

val <- c(); i <- c(); cols <- c(); p <- c(); year <- c()
for (year in 1:10) {
  for (n in 2:25) {
      c <- n-1
      pu <- runif(dataset1[[year, n]])
      p <- c(p, pu )
      tmp <- (c-1)*5 + 5*pu
      val <- c(val, tmp)
      ##
      i <- c(i, 1:dataset1[[year, n]])
      cols <- c(cols, rep(c, dataset1[[year, n]]) )
      year <- c(year, rep(dataset1[[year,1]], dataset1[[year, n]]) )
  }
}
res.df <- data.frame(year=year, i=i, cols=cols, p=p, val=val)
res.df <- setDT(res.df)

Original Q&A

There are 2 best solutions below

www On 25 June 2017 at 14:10

Here is one idea. df2 contains the expanded Year, col, and i. You can further create p and value for df2.

# Load package
library(tidyverse)

# Create example data frame
dt <- read.table(text = "Year Key2 Key3 Key4 Key5
2001  150  105  140  140
                 2002  130   70   55   80
                 2003  590  375  355  385",
                 header = TRUE, stringsAsFactors = FALSE) 


# Expand the data frame
dt2 <- dt %>%
  gather(col, i, - Year) %>%
  mutate(col = as.numeric(sub("Key", "", col)) - 1) %>%
  rowwise() %>%
  do(data_frame(Year = .$Year, col = .$col, i = seq(1, .$i, 1))) %>%
  select(Year, i, col) %>%
  arrange(Year, col, i)

Update

Another approach from the tidyverse.

# Expand the data frame
dt2 <- dt %>%
  gather(col, i, - Year) %>%
  mutate(col = as.numeric(sub("Key", "", col)) - 1) %>%
  mutate(i = map2(1L, i, seq, by = 1)) %>%
  unnest() %>%
  select(Year, i, col) %>%
  arrange(Year, col, i)

**Uwe** · Accepted Answer · 2017-06-26T08:21:53.067000

The core of the problem is the expansion of the values in the Key columns into i.

Here is another data.table solution employing melt() but differing in implementation details from David's comment:

library(data.table)
DT <- data.table(dataset1)
expanded <- melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
  , .(i = seq_len(value)), by = .(Year, col)]
expanded

      Year col   i
   1: 2001   1   1
   2: 2001   1   2
   3: 2001   1   3
   4: 2001   1   4
   5: 2001   1   5
  ---             
2571: 2003   4 381
2572: 2003   4 382
2573: 2003   4 383
2574: 2003   4 384
2575: 2003   4 385

The remaining computations can be done like this (if I've understood OP's intention right)

set.seed(123L) # make results reproducable
res.df <- expanded[, p := runif(.N)][, value := 5 * (col - 1L + p)][]
res.df

      Year col   i         p     value
   1: 2001   1   1 0.2875775  1.437888
   2: 2001   1   2 0.7883051  3.941526
   3: 2001   1   3 0.4089769  2.044885
   4: 2001   1   4 0.8830174  4.415087
   5: 2001   1   5 0.9404673  4.702336
  ---                                 
2571: 2003   4 381 0.4711072 17.355536
2572: 2003   4 382 0.5323359 17.661680
2573: 2003   4 383 0.3953954 16.976977
2574: 2003   4 384 0.4544372 17.272186
2575: 2003   4 385 0.1149009 15.574505

Benchmarking the different approaches

As the OP is asking for a faster / more efficient way, the three different approaches proposed so far are being benchmarked:

David's data.table solution plus a modification which ensures the result is identical with the expected result
ycw's tidyverse solution
my data.table solution

Benchmark code

For benchmarking, the microbenchmark package is used.

library(magrittr)
bm <- microbenchmark::microbenchmark(
  david1 = {
    expanded_david1 <-
      setorder(
        melt(DT, id = "Year", value = "i", variable = "col")[rep(1:.N, i)], Year, col
      )[, i := seq_len(.N), by = .(Year, col)]
  },
  david2 = {
    expanded_david2 <-
      setorder(
        melt(DT, id = "Year", value = "i", variable = "col")[, col := as.integer(col)][
          rep(1:.N, i)], Year, col)[, i := seq_len(.N), by = .(Year, col)]
  },
  uwe = {
    expanded_uwe <- 
      melt(DT, id.vars = "Year", variable = "col")[, col := rleid(col)][
        , .(i = seq_len(value)), by = .(Year, col)]
  },
  ycw = {
    expanded_ycw <- DT %>%
      tidyr::gather(col, i, - Year) %>%
      dplyr::mutate(col = as.integer(sub("Key", "", col)) - 1L) %>%
      dplyr::rowwise() %>%
      dplyr::do(tibble::data_frame(Year = .$Year, col = .$col, i = seq(1L, .$i, 1L))) %>%
      dplyr::select(Year, i, col) %>%
      dplyr::arrange(Year, col, i)
  },
  times = 100L
)
bm

Note that references to tidyverse functions are made explicit in order to avoid name conflicts due to a cluttered name space. The modified david2 variant converts factors to numbers of levels.

Timing the small sample data set

With the small sample data set with 3 years and 4 Key columns provided by the OP the timings are as follows:

Unit: microseconds
   expr       min         lq        mean    median         uq        max neval
 david1   993.418  1161.4415   1260.4053  1244.320   1350.987   2000.805   100
 david2  1261.500  1393.2760   1624.5298  1568.097   1703.837   5233.280   100
    uwe   825.772   865.4175    979.2129   911.860   1084.226   1409.890   100
    ycw 93063.262 97798.7005 100423.5148 99226.525 100599.600 205695.817   100

Even for this small problem size, the data.table solutions are magnitudes faster than the tidyverse approach with slight advantages for solution uwe.

The results are checked to be equal:

all.equal(expanded_david1[, col := as.integer(col)][order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_david2[order(col, Year)], expanded_uwe)
#[1] TRUE
all.equal(expanded_ycw, expanded_uwe)
#[1] TRUE

Except for david1 which returns factors instead of integers and a different ordering, all four results are identical.

Larger benchmark case

Form OP's code it can be concluded that his production data set consists of 10 years and 24 Key columns. In the sample data set the overall mean of Key values is 215. With these parameters, a larger data set is being created:

n_yr <- 10L
n_col <- 24L
avg_key <- 215L
col_names <- sprintf("Key%02i", 1L + seq_len(n_col))
DT <- data.table(Year = seq(2001L, by = 1L, length.out = n_yr))
DT[, (col_names) := avg_key]

The larger data set returns 51600 rows which is still of rather moderate size but is 20 times larger than the small sample. Timings are as follows:

Unit: milliseconds
   expr         min          lq        mean      median          uq         max neval
 david1    2.512805    2.648735    2.726743    2.697065    2.698576    3.076535     5
 david2    2.791838    2.816758    2.998828    3.068605    3.075780    3.241160     5
    uwe    1.329088    1.453312    1.585390    1.514857    1.634551    1.995142     5
    ycw 1641.527166 1643.979936 1646.004905 1645.091158 1646.599219 1652.827047     5

For this problem size, uwe is nearly twice as fast as the other data.table implementations. The tidyverse approach is still magnitudes slower.

Fast and efficient way to expand a dataset in R

There are 2 best solutions below

Benchmarking the different approaches

Benchmark code

Timing the small sample data set

Larger benchmark case

Update

Related Questions in R

Related Questions in DATA.TABLE

Related Questions in DPLYR

Trending Questions

Popular # Hahtags

Popular Questions