Pair each combination of two columns and calculate sum for a third column in data.table

59 Views Asked by At

I have two very large df: df 1 and df2. Df 1 contains the columns "from", "to" and "count". The values in "from" and "to" represent commuting points and can occur multiple times:

from1 to1 count
10020 10020 20
10020 10020 10
10020 22001 NA
30030 20020 2
45001 32001 100
45001 32001 NA
45001 45001 1
90080 45002 NA

In df 2, I would like to create each possible combination for "from" and "to". Then, I would like to fill in a new column "count_total" the sum of commuters for each individual pair. If the combination does not occur in df 1, I would like to fill in 0. For NA, I would like to fill in 0. My desired output:

from2 to2 count_total
10020 10020 30
10020 22001 0
10020 20020 0
10020 32001 0
10020 45001 0
10020 45002 0
30030 10020 0
30030 22001 0
30030 20020 2

...

I tried the following, however, it did not sum up the values for "count_total" correctly.

    df2 <- CJ(from2 = unique(df1$from1), 
                to2 = unique(df1$to1))


    df2[, count_total := sum(df1$count[
             df1$from1 == from2 &
               df1$to1 == to2
                ]), by = .(from2, to2)]

What did I do wrong? Thanks!

3

There are 3 best solutions below

0
r2evans On BEST ANSWER

We can do a merge-then-summarize:

library(data.table)
setDT(df1)
CJ(from2 = unique(df1$from1), to2 = unique(df1$to1)
  )[df1, count2 := i.count, on = .(from2==from1, to2==to1)
  ][, .(count2 = sum(c(0, count2), na.rm = TRUE)), by = .(from2, to2)]
#     from2   to2 count2
#     <int> <int>  <num>
#  1: 10020 10020     10
#  2: 10020 20020      0
#  3: 10020 22001      0
#  4: 10020 32001      0
#  5: 10020 45001      0
#  6: 10020 45002      0
#  7: 30030 10020      0
#  8: 30030 20020      2
#  9: 30030 22001      0
# 10: 30030 32001      0
# ---                   
# 15: 45001 22001      0
# 16: 45001 32001      0
# 17: 45001 45001      1
# 18: 45001 45002      0
# 19: 90080 10020      0
# 20: 90080 20020      0
# 21: 90080 22001      0
# 22: 90080 32001      0
# 23: 90080 45001      0
# 24: 90080 45002      0
0
Mark On

Steps:

  1. Load required packages
  2. Process dt with lazy_dt(), so we can use functions on it
  3. Summarise, to combine rows with the same from1 and to1
  4. Complete the data, which creates rows for every combination of from1 and to1, with default values of 0
  5. Since it's now lazy, we call as.data.table() to have it actually do the work
pacman::p_load(data.table, dtplyr)

dt <- dt |> lazy_dt()

dt |> 
  summarise(count = sum(count, na.rm = TRUE), .by = c(from1, to1)) |>
  complete(from1, to1, fill = list(count = 0)) |> 
  as.data.table()

Output:

    from1   to1 count
 1: 10020 10020    30
 2: 10020 20020     0
 3: 10020 22001     0
 4: 10020 32001     0
 5: 10020 45001     0
 6: 10020 45002     0
 7: 30030 10020     0
 8: 30030 20020     2
 9: 30030 22001     0
10: 30030 32001     0
11: 30030 45001     0
12: 30030 45002     0
13: 45001 10020     0
14: 45001 20020     0
15: 45001 22001     0
16: 45001 32001   100
17: 45001 45001     1
18: 45001 45002     0
19: 90080 10020     0
20: 90080 20020     0
21: 90080 22001     0
22: 90080 32001     0
23: 90080 45001     0
24: 90080 45002     0
    from1   to1 count
0
s_baldur On

Using set():

library(data.table)

df2 <- df1[, CJ(from1, to1, unique = TRUE)][, count := 0L]

# CJ has already created a key
for (i in 1L:nrow(df1)) {
  if (is.na(df1$count[i])) next
  row <- df2[df1[i], which = TRUE]
  set(df2, row, "count", value = df2[row, count] + df1$count[i])
}