Replace any value in an Arrow Table with another value

66 Views Asked by At

Similar to a previous post: Replacing NA/null values in arrow table in R across multiple numeric variables without converting to R dataframe

I'm looking to convert some values in an output arrow table (frequency below) to another value. However, this time, I wish to conditionally replace values from a range of columns from their frequency counts to binarized values of 0 or 1 to display whether the variable for a given column exists for that participant (again, without pulling it into R because the data is large).

However, I have not had much luck doing this within the R/arrow environment itself.

The selection is supposed to include all columns that begin with "concepts_" or "terms_"` (a little under 950 columns, which are integer data) but do not belong to a list of variables from another dataset (Demographics, which are mixes of strings and integers, and thus, either produce errors or would be incorrect to change).

This is the only code I have been able to generate to produce the frequency arrow table from which this is drawn, as all other calls to variations of mutate, select, and across have thrown errors or produced NA across the board:

frequency <-
COI_tag |>
filter(flag1 == 0, flag2 == 0) |>
select(where(is.numeric)) |>
group_by(participantID) |>
mutate(across(everything(), ~ coalesce(.x, 0))) |>
summarise(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)]  %in% c("clean_terms","clean_concepts") == FALSE],sum)) |>
left_join(Demographics) |>
compute()

However, all my attempts to flip the values > 0 to a simple 1 have resulted in similar errors (including a where and several selects not pictured here):

binary <- frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)]  %in% c("clean_terms","clean_concepts") == FALSE] ~ replace(.x, .x > 0, 1)))
Error in `column_select()`:
! Formula shorthand must be wrapped in `where()`.

  # Bad
  data %>% select(... ~ replace(.x, .x > 0, 1))

  # Good
  data %>% select(where(... ~ replace(.x, .x > 0, 1)))
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: Invalid metadata$r 
2: Invalid metadata$r 

binary <- frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)]  %in% c("clean_terms","clean_concepts") == FALSE] ~ replace((.), (.) > 0, 1)))
Error in `column_select()`:
! Formula shorthand must be wrapped in `where()`.

  # Bad
  data %>% select(... ~ replace((.), (.) > 0, 1))

  # Good
  data %>% select(where(... ~ replace((.), (.) > 0, 1)))
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: Invalid metadata$r
2: Invalid metadata$r 

frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)]  %in% c("clean_terms","clean_concepts") == FALSE] ~ replace(.x, .x > 0, 1)))
Error in `column_select()`:
! Formula shorthand must be wrapped in `where()`.

  # Bad
  data %>% select(... ~ replace(.x, .x > 0, 1))

  # Good
  data %>% select(where(... ~ replace(.x, .x > 0, 1)))
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: Invalid metadata$r 
2: Invalid metadata$r 

binary <- frequency |>
mutate(across(everything()), case_when(x. > 0 ~ 1, .x == 0 ~ .))
Error in map_lgl(args, ~inherits(., "Expression")) : 
  object 'x.' not found
In addition: Warning messages:
1: Invalid metadata$r 
2: Invalid metadata$r 

frequency |>
mutate(across(is.numeric, ~1 * (. != 0)))
Error: NotImplemented: Function 'multiply_checked' has no kernel matching input types (bool, bool)
In addition: Warning messages:
1: Invalid metadata$r 
2: Invalid metadata$r

frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)]  %in% c("clean_terms","clean_concepts") == FALSE]), case_when(. > 0 ~ 1, .x == 0 ~ .))

frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)]  %in% c("clean_terms","clean_concepts") == FALSE]), ~ case_when(. > 0 ~ 1,. <= 0 ~ .))
Warning: In ~case_when(. > 0 ~ 1, . <= 0 ~ .) = ~case_when(. > 0 ~ 1, . <= 0 ~ .), only values of size one are recycled; pulling data into R
Error:
! Problem while computing `..2 = ~case_when(. > 0 ~ 1, . <= 0 ~ .)`.
✖ `..2` must be a vector, not a `formula` object.
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: Invalid metadata$r 
2: Invalid metadata$r 
3: Invalid metadata$r 
4: Invalid metadata$r 

I have found several wonderful examples that appear to work well in base R and tidyverse/dplyr, but not on arrow tables like what I have here (which is what frequency is).

If you have any suggestions on what might be wrong with my call, I would greatly appreciate any and all feedback.

Update

Have also tried the following, unfortunately also to no avail:

binary <-
  frequency |>
  mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE], ~as.integer(. > 0)), . = 1) |>
  compute()

binary <-
  frequency |>
  mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE], ~(. > 0)), . = 1) |>
  compute()

frequency |>
    mutate_at(
        vars(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE]),
        funs(case_when(
            . > 0 ~ 1,
            TRUE ~ .
        ))
    ) |>
  compute()

binary <-
    frequency  |>
    mutate_at(.vars = vars(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE]),
        .funs = funs(. = case_when(
          . > 0 ~ 1
        ))) |>
  compute()
0

There are 0 best solutions below