So I am trying to convert my dplyr into DT for quicker processing time, but I am unable to convert my ifelse(any(startsWith... statement to DT. Whatever I try, it keeps doing one extreme or the other, or with the case of "Tag" it just says it doesn't exist. Maybe the problem is with rowwise but I can't figure it out. Thanks in advance!
Here's my dplyr code:
df <- df %>%
rowwise() %>%
mutate(
'Position' = coalesce(
ifelse(any(c_across(starts_with("Tag")) == "goalkeeper"), "Goalkeeper", NA),
ifelse(any(c_across(starts_with("Tag")) == "striker"), "Striker", NA),
),
Favorite = ifelse(any(c_across(starts_with("Tag")) == "favorite"), TRUE, FALSE),
across(starts_with("Tag"), ~ifelse(. %in% c("goalkeeper", "striker", "favorite"), NA_character_, .))
)
my DT attempts
df[, `Position` := coalesce(
ifelse(any(startsWith(Tag, "goalkeeper")), "Goalkeeper", NA_character_), #tried this
ifelse(grepl("striker", "^Tag"), "Striker", NA_character_), #and this
)]
df[, Favorite := any(startsWith(Tag1, "favorite"))]
df[, (grep("Tag", names(df), value = TRUE)) :=
lapply(.SD, function(x) ifelse(x %in% c("goalkeeper", "striker", "favorite"), NA_character_, x)),
.SDcols = patterns("Tag")]
Data:
| Name | Tag1 | Tag2 | Tag3 |
|---|---|---|---|
| A | goalkeeper | NA | NA |
| B | NA | striker | favorite |
Expected output:
| Name | Position | Favorite |
|---|---|---|
| A | Goalkeeper | FALSE |
| B | Striker | TRUE |
Since you're doing multi-column snapshots row-wise, I don't know that there are awesome ways to go about it, but perhaps this is sufficient?
(And you can easily remove the tags.)
The use of
applyis a little costly in that it causes the frame (.SD, which in this case is just theTag#columns) to be converted to amatrixinternally. It's because of this conversion that the use ofapplyin the context of frame rows can be expensive, rightfully so.An alternative:
The two perform at somewhat the same speed (
median,`itr/sec`) but the first has a lowermem_alloc, perhaps suggesting that it may be better for larger data. But don't be too hasty benchmarking on small data ...Expanding it to be a larger dataset,
we get these benchmarking results:
The
mem_allocis lower for the second (Map) implementation, thoughmedianand`itr/sec`are a little slower. I don't know which is better in your case.