How to sum non-empty successive rows in deeply nested tibbles in R?

80 Views Asked by At

I have a huge tibble containing data from several experimental tasks, each of them having a lot of subjects, and each subject having their own dataframe of results (for the detail, it is an eye tracking experiment). I'm working on this structure in nested tibbles to have a clean reproducible environment instead of countless objects.

For each subject, I'd like to sum up the lines that follow each other and contain a value (in the experimental context, this means grouping together the frames from a single area that the subject has looked at). Here is an example of the structure of the data:

library(tidyverse)

df <- tibble(
  task_name = c("task_1", "task_2"),
  task_data = list(
    tibble(
        subject = c("S1", "S2"),
        data = list(
          tibble(
            var_1 = c(NA, 20, 12, NA, NA, NA),
            var_2 = c(NA, NA, NA, NA, 51, 10),
          ),
          tibble(
            var_1 = c(NA, NA, 15, 54, NA, NA),
            var_2 = c(NA, 13, NA, NA, NA, 10),
          )
        )
      ),
    tibble(
        subject = c("S1", "S2"),
        data = list(
          tibble(
            var_3 = c(3,  NA, NA, 10,  5, NA),
            var_4 = c(NA, NA, NA, NA, NA, NA)
          ),
          tibble(
            var_3 = c(3,  NA, NA, NA,  7, NA),
            var_4 = c(NA, NA, 20, 30, NA, NA)
          )
        )
      )
    )
  )

I want to sum the numbers in var_1/2/3/4, but only if they are successive, e.g. summing 20 and 12 in var_1 and 51 and 10 in var_2 for subject 1. Each row always has a value in a single column, so the column lengths are not an issue. The expected outcome for S1 in task_1 for instance is that var_1 is c(0, 32, 0, 0) and var_2 is c(0, 0, 0, 61), therefore in a shorter tibble of 4 rows.

What I've tried, based on the ideas of several posts similar to this one, is to create lagged columns to check the values before and after each row, sum them under specific conditions, then delete the helper lagged columns created in this process, and finally keep only one of the duplicate summed rows created.

It looks like this:

df_summed <- 
  df |> 
  rowwise() |>
  mutate(
    task_data = case_when(
      # We work only on task_1 for now for testing purposes
      task_name == "task_1" ~
        list(
          task_data |>
            rowwise() |>
            mutate(
              data = list(
                data |> 
                  mutate(
                    # Offsets the column to see what's behind
                    behind = lag(var_1), # <1>
                    # Replaces NAs with 0
                    behind = if_else(is.na(behind), 0, behind),
                    # Same with what's ahead
                    ahead  = lead(var_1),                       
                    ahead  = if_else(is.na(ahead), 0, ahead),  
                    # If the value is 0, keeps it, otherwise sums the lagged columns
                    summed = if_else(is.na(var_1), 0, var_1 + behind + ahead)
                  ) |> 
                  # Removes the columns that were created for the summing process
                  select(!c(var_1, behind, ahead)) |>
                  rename(var_1 = summed) |>
                  # Keeps only one of the two duplicate rows
                  unique()
                )
              )
          ),
      # Let's not touch task_2 right away, for convenience
      TRUE ~ list(task_data)
    )
  )

This lag/lead mechanism is verbose and I intended to make a pipe-friendly function from it to apply it to all the variables, but I can't even make it work in this reprex. What puzzles me the most is that this code works perfectly if the variables are identical across tasks (i.e. if I rename task_3/4 to task_1/2 here), but this is not the case in the real dataset. These variable names are the reason why I used case_when to isolate the tasks, but this does not cut it.

As mentioned above, the expected output of the code pasted here is to modify only task_1, have var_1 equal to c(0, 32, 0, 0, 0) for S1 and var_1 equal to c(0, 0, 69, 0, 0) for S2, the lengths of var_2 will also change in the process, but task_2 should not be modified at all.

The following error is returned:

Error in `mutate()`:
ℹ In argument: `task_data = case_when(...)`.
Caused by error in `case_when()`:
! Failed to evaluate the right-hand side of formula 1.
Caused by error in `mutate()`:
ℹ In argument: `data = list(...)`.
Caused by error in `mutate()`:
ℹ In argument: `behind = lag(var_1)`.
Caused by error in `lag()`:
! object 'var_1' not found
Backtrace:
  1. dplyr::mutate(...)
 29. dplyr::lag(var_1)

I think I understand the error, but I can't figure out why case_when doesn't take care of isolating variables.

Do you see better ways to do this? Also, do you have any ideas as to why the variable names make a difference and why case_when couldn't handle it?

PS: this is my very first question on StackOverflow so I'm sorry if it's too verbose, I tried to make it as explicit and reproducible as possible since it's a very specific issue.

Thank you !

1

There are 1 best solutions below

4
one On BEST ANSWER

We can create a user-defined function to make this pipe-friendly.

summed <- function(x){
  behind <-  lag(x,default = 0)
  ahead <- lead(x,default = 0)
  summed <- if_else(is.na(x),0,x+behind+ahead)
  summed
}



df_summed <- 
  df |> 
  rowwise() |>
  mutate(
    task_data = case_when(
      # We work only on task_1 for now for testing purposes
      task_name == "task_1" ~
        list(
          task_data |>
            rowwise() |>
            mutate(
              data = list(
                data |> 
                  mutate(across(starts_with("var"),~summed(.x))
                  ) |> 
                  # Keeps only one of the two duplicate rows
                  unique()
              )
            )
        ),
      # Let's not touch task_2 right away, for convenience
      TRUE ~ list(task_data)
    )
  )

> (df_summed[[2]][[1]])[[2]][[1]]
# A tibble: 3 × 2
  var_1 var_2
  <dbl> <dbl>
1     0     0
2    32     0
3     0    61