extract integers from characters in R

Question

extract integers from characters in R

105 Views Asked by cmh At 03 January 2024 at 15:57

I am in R. I want to extract just the numbers from df1. I have for example: df1 <- data.frame( column1 = c("Any[12, 15, 20]", "Any[22, 23, 30]"), column2 = c("Any[4, 17]", "Any[]"), stringsAsFactors = F )

And I want a new df, that takes the integers within the brackets muliples by the row number, and keeps the column information corresponding to it.

e.g. new_df could look like

Time	Channel
12	column1
15	column1
20	column1
44	column1
46	column1
60	column1
8	column2
34	column2

I do not need to preserve any "NA" values, e.g If Any[] is empty. Anyone got any idea if this is possible please? I have ENORMOUS amounts of data in this format, so I cannot really do much manually. Cheers!

I already tried: new_df$Time <- as.integer(df1$column1) and that just gave blanks.

I also tried: new_df$Time <- str_extract_all(new_df$Time, "\\d+" ) %>% lapply(function(x) as.integer(x)) %>% sapply(function(x) if.else(length(x) >0, x, NA) )

which only then returned the first integer within each bracket. e.g.

Time	Channel
12	column1
44	column1
8	column2

Original Q&A

There are 5 best solutions below

Gregor Thomas On 03 January 2024 at 16:07

This should work. Note that parse_number will issue a warning for rows with no numbers. You could wrap it in suppressWanings() to silence it.

library(dplyr)
library(tidyr)
library(readr)
df1 |>
  mutate(rn = row_number()) |>
  pivot_longer(-rn, names_to = "channel", values_to = "time") |>
  separate_longer_delim(time, delim = ",") |>
  mutate(time = parse_number(time) * rn) |>
  arrange(channel, rn) |>
  select(-rn) |>
  filter(!is.na(time))
# # A tibble: 8 × 2
#   channel  time
#   <chr>   <dbl>
# 1 column1    12
# 2 column1    15
# 3 column1    20
# 4 column1    44
# 5 column1    46
# 6 column1    60
# 7 column2     4
# 8 column2    17
# Warning message:
# There was 1 warning in `mutate()`.
# ℹ In argument: `time = parse_number(time)`.
# Caused by warning:
# ! 1 parsing failure.
# row col expected actual
#   9  -- a number  Any[]

SamR On 03 January 2024 at 16:20

Here is a base R solution. First create a list of values. This regex either retrieves the numbers and commas within "Any[]", or it replaces an empty "Any[]" with a blank string.

df_values <- lapply(df1, \(col) gsub("Any\\[(.+)\\]|Any\\[\\]", "\\1", col))

We can then iterate over that list and its names using Map() to get the data in the format that you need, and multiply by the row number.

df_long <- Map(\(x, nm) {
    time <- strsplit(x, ",") |>
        lapply(\(x) as.integer(trimws(x)))

    # Multiply by row number
    time_mult <- unlist(time) * rep(seq(lengths(time)), lengths(time))

    data.frame(
        Time = time_mult,
        Channel = nm
    )
}, df_values, names(df_values)) |>
    do.call(rbind, args = _)

print(df_long, row.names = FALSE)

#  Time Channel
#    12 column1
#    15 column1
#    20 column1
#    44 column1
#    46 column1
#    60 column1
#     4 column2
#    17 column2

s_baldur On 03 January 2024 at 16:36

Another base R solution:

extract_ints <- function(df) {
  extract_cell_ints <- function(x, multiplier) {
    as.integer(regmatches(x, gregexpr("[0-9]+", x))[[1L]]) * multiplier
  }
  raw_ints <- lapply(
    df1,
    # Using row number as multiplier
    \(cl) Map(extract_cell_ints, x = cl, multiplier = seq_along(cl))
  )
  data.frame(
    Time    = unlist(raw_ints, use.names = FALSE), 
    Channel = rep(names(df), sapply(raw_ints, \(x) sum(lengths(x))))
  )
}

 
extract_ints(df1)

#   Time Channel
# 1   12 column1
# 2   15 column1
# 3   20 column1
# 4   44 column1
# 5   46 column1
# 6   60 column1
# 7    4 column2
# 8   17 column2

ThomasIsCoding On 03 January 2024 at 21:38

Try the following base R solution, using str2lang + stack

setNames(
    stack(
        lapply(
            df1,
            \(s) eval(str2lang(sprintf(
                "c(%s)",
                toString(paste0(seq_along(s), "*", gsub("Any\\[(.*?)\\]", "c(\\1)", s)))
            )))
        )
    ),
    c("Time", "Channel")
)

which gives

  Time Channel
1   12 column1
2   15 column1
3   20 column1
4   44 column1
5   46 column1
6   60 column1
7    4 column2
8   17 column2

**LMc** · Accepted Answer · 2024-01-03T19:54:07.337000

library(dplyr)
library(purrr)
library(stringr)
library(tidyr)

df1|>
  mutate(across(everything(), \(x) imap(str_extract_all(x, "\\d+"), ~ as.numeric(.x) * .y))) |>
  pivot_longer(everything(), cols_vary = "slowest", names_to = "Time", values_to = "Channel") |>
  unnest_longer(Channel)

How it works

This is similar to the approach you took in your initial attempt except I am using purrr::imap instead of lapply. The advantage here is that imap gives you access to the list element name (.y), which is the row number in this case, in addition to the list element (.x). This makes the multiple multiplication step simple.

str_extract_all extracts all the numbers from a column and outputs those numbers in a list:

str_extract_all(df1$column1, "\\d+")
[[1]]
[1] "12" "15" "20"

[[2]]
[1] "22" "23" "30"

imap iterates over this list and does the multiplication:

imap(str_extract_all(df1$column1, "\\d+"), ~ as.numeric(.x) * .y)
[[1]]
[1] 12 15 20

[[2]]
[1] 44 46 60

Then the other two pipes are to reshape the data.

Output

  Time    Channel
  <chr>     <dbl>
1 column1      12
2 column1      15
3 column1      20
4 column1      44
5 column1      46
6 column1      60
7 column2       4
8 column2      17

extract integers from characters in R

There are 5 best solutions below

Related Questions in R

Related Questions in STRING

Related Questions in DATAFRAME

Related Questions in DATA-TRANSFORM

Trending Questions

Popular # Hahtags

Popular Questions