extract integers from characters in R

105 Views Asked by At

I am in R. I want to extract just the numbers from df1. I have for example: df1 <- data.frame( column1 = c("Any[12, 15, 20]", "Any[22, 23, 30]"), column2 = c("Any[4, 17]", "Any[]"), stringsAsFactors = F )

And I want a new df, that takes the integers within the brackets muliples by the row number, and keeps the column information corresponding to it.

e.g. new_df could look like

Time Channel
12 column1
15 column1
20 column1
44 column1
46 column1
60 column1
8 column2
34 column2

I do not need to preserve any "NA" values, e.g If Any[] is empty. Anyone got any idea if this is possible please? I have ENORMOUS amounts of data in this format, so I cannot really do much manually. Cheers!

I already tried: new_df$Time <- as.integer(df1$column1) and that just gave blanks.

I also tried: new_df$Time <- str_extract_all(new_df$Time, "\\d+" ) %>% lapply(function(x) as.integer(x)) %>% sapply(function(x) if.else(length(x) >0, x, NA) )

which only then returned the first integer within each bracket. e.g.

Time Channel
12 column1
44 column1
8 column2
5

There are 5 best solutions below

0
LMc On BEST ANSWER
library(dplyr)
library(purrr)
library(stringr)
library(tidyr)

df1|>
  mutate(across(everything(), \(x) imap(str_extract_all(x, "\\d+"), ~ as.numeric(.x) * .y))) |>
  pivot_longer(everything(), cols_vary = "slowest", names_to = "Time", values_to = "Channel") |>
  unnest_longer(Channel)

How it works

This is similar to the approach you took in your initial attempt except I am using purrr::imap instead of lapply. The advantage here is that imap gives you access to the list element name (.y), which is the row number in this case, in addition to the list element (.x). This makes the multiple multiplication step simple.

str_extract_all extracts all the numbers from a column and outputs those numbers in a list:

str_extract_all(df1$column1, "\\d+")
[[1]]
[1] "12" "15" "20"

[[2]]
[1] "22" "23" "30"

imap iterates over this list and does the multiplication:

imap(str_extract_all(df1$column1, "\\d+"), ~ as.numeric(.x) * .y)
[[1]]
[1] 12 15 20

[[2]]
[1] 44 46 60

Then the other two pipes are to reshape the data.

Output

  Time    Channel
  <chr>     <dbl>
1 column1      12
2 column1      15
3 column1      20
4 column1      44
5 column1      46
6 column1      60
7 column2       4
8 column2      17
0
Gregor Thomas On

This should work. Note that parse_number will issue a warning for rows with no numbers. You could wrap it in suppressWanings() to silence it.

library(dplyr)
library(tidyr)
library(readr)
df1 |>
  mutate(rn = row_number()) |>
  pivot_longer(-rn, names_to = "channel", values_to = "time") |>
  separate_longer_delim(time, delim = ",") |>
  mutate(time = parse_number(time) * rn) |>
  arrange(channel, rn) |>
  select(-rn) |>
  filter(!is.na(time))
# # A tibble: 8 × 2
#   channel  time
#   <chr>   <dbl>
# 1 column1    12
# 2 column1    15
# 3 column1    20
# 4 column1    44
# 5 column1    46
# 6 column1    60
# 7 column2     4
# 8 column2    17
# Warning message:
# There was 1 warning in `mutate()`.
# ℹ In argument: `time = parse_number(time)`.
# Caused by warning:
# ! 1 parsing failure.
# row col expected actual
#   9  -- a number  Any[] 
0
SamR On

Here is a base R solution. First create a list of values. This regex either retrieves the numbers and commas within "Any[]", or it replaces an empty "Any[]" with a blank string.

df_values <- lapply(df1, \(col) gsub("Any\\[(.+)\\]|Any\\[\\]", "\\1", col))

We can then iterate over that list and its names using Map() to get the data in the format that you need, and multiply by the row number.

df_long <- Map(\(x, nm) {
    time <- strsplit(x, ",") |>
        lapply(\(x) as.integer(trimws(x)))

    # Multiply by row number
    time_mult <- unlist(time) * rep(seq(lengths(time)), lengths(time))

    data.frame(
        Time = time_mult,
        Channel = nm
    )
}, df_values, names(df_values)) |>
    do.call(rbind, args = _)

print(df_long, row.names = FALSE)

#  Time Channel
#    12 column1
#    15 column1
#    20 column1
#    44 column1
#    46 column1
#    60 column1
#     4 column2
#    17 column2
0
s_baldur On

Another base R solution:

extract_ints <- function(df) {
  extract_cell_ints <- function(x, multiplier) {
    as.integer(regmatches(x, gregexpr("[0-9]+", x))[[1L]]) * multiplier
  }
  raw_ints <- lapply(
    df1,
    # Using row number as multiplier
    \(cl) Map(extract_cell_ints, x = cl, multiplier = seq_along(cl))
  )
  data.frame(
    Time    = unlist(raw_ints, use.names = FALSE), 
    Channel = rep(names(df), sapply(raw_ints, \(x) sum(lengths(x))))
  )
}

 
extract_ints(df1)

#   Time Channel
# 1   12 column1
# 2   15 column1
# 3   20 column1
# 4   44 column1
# 5   46 column1
# 6   60 column1
# 7    4 column2
# 8   17 column2
0
ThomasIsCoding On

Try the following base R solution, using str2lang + stack

setNames(
    stack(
        lapply(
            df1,
            \(s) eval(str2lang(sprintf(
                "c(%s)",
                toString(paste0(seq_along(s), "*", gsub("Any\\[(.*?)\\]", "c(\\1)", s)))
            )))
        )
    ),
    c("Time", "Channel")
)

which gives

  Time Channel
1   12 column1
2   15 column1
3   20 column1
4   44 column1
5   46 column1
6   60 column1
7    4 column2
8   17 column2