Similar to a previous post: Replacing NA/null values in arrow table in R across multiple numeric variables without converting to R dataframe
I'm looking to convert some values in an output arrow
table (frequency
below) to another value. However, this time, I wish to conditionally replace values from a range of columns from their frequency counts to binarized values of 0 or 1 to display whether the variable for a given column exists for that participant (again, without pulling it into R because the data is large).
However, I have not had much luck doing this within the R/arrow
environment itself.
The selection is supposed to include all columns that begin with "concepts_" or
"terms_"` (a little under 950 columns, which are integer data) but do not belong to a list of variables from another dataset (Demographics, which are mixes of strings and integers, and thus, either produce errors or would be incorrect to change).
This is the only code I have been able to generate to produce the frequency
arrow table from which this is drawn, as all other calls to variations of mutate
, select
, and across
have thrown errors or produced NA
across the board:
frequency <-
COI_tag |>
filter(flag1 == 0, flag2 == 0) |>
select(where(is.numeric)) |>
group_by(participantID) |>
mutate(across(everything(), ~ coalesce(.x, 0))) |>
summarise(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE],sum)) |>
left_join(Demographics) |>
compute()
However, all my attempts to flip the values > 0 to a simple 1 have resulted in similar errors (including a where
and several selects
not pictured here):
binary <- frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE] ~ replace(.x, .x > 0, 1)))
Error in `column_select()`:
! Formula shorthand must be wrapped in `where()`.
# Bad
data %>% select(... ~ replace(.x, .x > 0, 1))
# Good
data %>% select(where(... ~ replace(.x, .x > 0, 1)))
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: Invalid metadata$r
2: Invalid metadata$r
binary <- frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE] ~ replace((.), (.) > 0, 1)))
Error in `column_select()`:
! Formula shorthand must be wrapped in `where()`.
# Bad
data %>% select(... ~ replace((.), (.) > 0, 1))
# Good
data %>% select(where(... ~ replace((.), (.) > 0, 1)))
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: Invalid metadata$r
2: Invalid metadata$r
frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE] ~ replace(.x, .x > 0, 1)))
Error in `column_select()`:
! Formula shorthand must be wrapped in `where()`.
# Bad
data %>% select(... ~ replace(.x, .x > 0, 1))
# Good
data %>% select(where(... ~ replace(.x, .x > 0, 1)))
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: Invalid metadata$r
2: Invalid metadata$r
binary <- frequency |>
mutate(across(everything()), case_when(x. > 0 ~ 1, .x == 0 ~ .))
Error in map_lgl(args, ~inherits(., "Expression")) :
object 'x.' not found
In addition: Warning messages:
1: Invalid metadata$r
2: Invalid metadata$r
frequency |>
mutate(across(is.numeric, ~1 * (. != 0)))
Error: NotImplemented: Function 'multiply_checked' has no kernel matching input types (bool, bool)
In addition: Warning messages:
1: Invalid metadata$r
2: Invalid metadata$r
frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE]), case_when(. > 0 ~ 1, .x == 0 ~ .))
frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE]), ~ case_when(. > 0 ~ 1,. <= 0 ~ .))
Warning: In ~case_when(. > 0 ~ 1, . <= 0 ~ .) = ~case_when(. > 0 ~ 1, . <= 0 ~ .), only values of size one are recycled; pulling data into R
Error:
! Problem while computing `..2 = ~case_when(. > 0 ~ 1, . <= 0 ~ .)`.
✖ `..2` must be a vector, not a `formula` object.
Run `rlang::last_trace()` to see where the error occurred.
Warning messages:
1: Invalid metadata$r
2: Invalid metadata$r
3: Invalid metadata$r
4: Invalid metadata$r
I have found several wonderful examples that appear to work well in base R
and tidyverse/dplyr
, but not on arrow
tables like what I have here (which is what frequency
is).
If you have any suggestions on what might be wrong with my call, I would greatly appreciate any and all feedback.
Update
Have also tried the following, unfortunately also to no avail:
binary <-
frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE], ~as.integer(. > 0)), . = 1) |>
compute()
binary <-
frequency |>
mutate(across(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE], ~(. > 0)), . = 1) |>
compute()
frequency |>
mutate_at(
vars(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE]),
funs(case_when(
. > 0 ~ 1,
TRUE ~ .
))
) |>
compute()
binary <-
frequency |>
mutate_at(.vars = vars(COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)][COI_tag$schema$names[grepl("concept_|term_", COI_tag$schema$names)] %in% c("clean_terms","clean_concepts") == FALSE]),
.funs = funs(. = case_when(
. > 0 ~ 1
))) |>
compute()