Reshaping a Data Frame in R separating strings

Question

Reshaping a Data Frame in R separating strings

82 Views Asked by bvowe At 23 February 2024 at 14:56

I have this dataframe in R

 library(dplyr)
 library(tidyr)


df <- tibble(
  ID = 1,
  `Zebra fish (one)` = 3,
  `Zebra fish (two)` = 4,
  `Dog-caut (zero)` = 9,
  `Dog-caut (hello there)` = 12
)

and try to make this one, but as you can see the TYPE column always comes up empty, how do i fix it?

# Reshaping the dataframe
long_df <- df %>%
  pivot_longer(
    cols = -ID, 
    names_to = "CATEGORY_TYPE", 
    values_to = "SCORE"
  ) %>%
  separate(CATEGORY_TYPE, into = c("CATEGORY", "TYPE"), sep = " \\(") %>%
  mutate(TYPE = sub("\\)", "", TYPE))

The data frame should look like this,

ID, CATEGORY, TYPE, SCORE
1, Zebra fish, one, 3
1, Zebra fish, two, 4
1, Dog-caut, zero, 9
1, Dog-caut, hello there, 12

The string I wish to separate is in this format

ID, 'Zebra fish (one)--Hello; blah'
1, 7

And I am hoping to put it in this format:

ID, CATEGORY, TYPE1, TYPE2, VALUE
1, Zebra fish, one, hello, 7

Original Q&A

There are 4 best solutions below

**jpsmith** · Answer 1 · 2024-02-23T15:00:14.263000

This may be a good example of when to use tidyr::separate_wider_delim:

df %>%
  pivot_longer(
    cols = -ID, 
    names_to = "CATEGORY_TYPE", 
    values_to = "SCORE"
  ) %>%
  separate_wider_delim(CATEGORY_TYPE, delim = "..", names = c("CATEGORY", "TYPE"))

#     ID CATEGORY   TYPE         SCORE
#   <dbl> <chr>      <chr>        <dbl>
# 1     1 Zebra.fish one.             3
# 2     1 Zebra.fish two.             4
# 3     1 Dog.caut   zero.            9
# 4     1 Dog.caut   hello.there.    12

And if you want to clean it up with removing periods, add an extra mutate:

df %>%
  pivot_longer(
    cols = -ID, 
    names_to = "CATEGORY_TYPE", 
    values_to = "SCORE"
  ) %>%
  separate_wider_delim(CATEGORY_TYPE, delim = "..", names = c("CATEGORY", "TYPE")) %>%
  mutate(across(everything(), ~trimws(gsub("\\.", " ", .x))))

#   ID    CATEGORY   TYPE        SCORE
#   <chr> <chr>      <chr>       <chr>
# 1 1     Zebra fish one         3    
# 2 1     Zebra fish two         4    
# 3 1     Dog caut   zero        9    
# 4 1     Dog caut   hello there 12

In the updated data you provided, you could try:

have <- data.frame(ID = 1, x = c("Zebra fish (one)--Hello; blah"), VALUE = 7)

have$x <- gsub("(.*);.*", "\\1", have$x) # remove everything after ";"

have %>%
  separate_wider_delim(x, delim = " (", names = c("CATEGORY", "TYPE1")) %>%
  separate_wider_delim(TYPE1, delim = ")--", names = c("TYPE1", "TYPE2")) 

#      ID CATEGORY   TYPE1 TYPE2 VALUE
#   <dbl> <chr>      <chr> <chr> <dbl>
# 1     1 Zebra fish one   Hello     7

**stefan_aus_hannover** · Answer 2 · 2024-02-23T15:04:45.327000

The parens are not in your df when you try to pivot. You would need to account for this before trying to pivot or split by the double period

long_df <- df %>%
  pivot_longer(
    cols = -ID, 
    names_to = "CATEGORY_TYPE", 
    values_to = "SCORE"
  ) %>%
  separate(CATEGORY_TYPE, into = c("CATEGORY", "TYPE"), sep = "\\.\\.") %>%
  mutate(across(everything(), ~ gsub("\\."," ",.x))) %>%
  mutate(TYPE=gsub(" $","",TYPE)) %>%
  separate(TYPE, into = c("TYPE1", "TYPE2"), sep = " ")
long_df[is.na(long_df)] <- ''

**Onyambu** · Answer 3 · 2024-02-23T15:42:09.263000

df %>%
   pivot_longer(-ID, names_to = c('Category', 'Type'), 
                names_pattern = "([^(]+) \\(([^)]+)", 
                values_to = 'Score')

# A tibble: 4 × 4
     ID Category   Type        Score
  <dbl> <chr>      <chr>       <dbl>
1     1 Zebra fish one             3
2     1 Zebra fish two             4
3     1 Dog-caut   zero            9
4     1 Dog-caut   hello there    12

**Greg** · Answer 4 · 2024-02-23T20:43:03.137000

Solution

Here's a one-liner that leverages the names_pattern argument to pivot_longer():

library(dplyr)
library(tidyr)


# ...
# Code to generate your data.
# ...


df %>% pivot_longer(!ID,
  values_to = "SCORE",
  names_to = c("name", "CATEGORY", "TYPE1", NA, "TYPE2", "suffix"),
  names_pattern = "((.*) \\((.*)\\)(--(.*); (.*))?)"
)

Results

Given a df like your sample...

df <- tibble(
  ID = 1,
  `Zebra fish (one)` = 3,
  `Zebra fish (two)` = 4,
  `Dog-caut (zero)` = 9,
  `Dog-caut (hello there)` = 12
)

...we get the following output:

# A tibble: 4 × 7
     ID name                   CATEGORY   TYPE1       TYPE2 suffix SCORE
  <dbl> <chr>                  <chr>      <chr>       <chr> <chr>  <dbl>
1     1 Zebra fish (one)       Zebra fish one         ""    ""         3
2     1 Zebra fish (two)       Zebra fish two         ""    ""         4
3     1 Dog-caut (zero)        Dog-caut   zero        ""    ""         9
4     1 Dog-caut (hello there) Dog-caut   hello there ""    ""        12

But this will also work for a df...

df <- tibble(
  ID = 1,
  `Zebra fish (one)--hello; blah` = 3,
  `Zebra fish (two)--hi; foo` = 4,
  `Dog-caut (zero)--yo; bar` = 9,
  `Dog-caut (hello there)--greetings; baz` = 12
)

...that looks like your second suggestion...

The string I wish to separate is in this format
ID, 'Zebra fish (one)--Hello; blah'
1, 7

...where it yields the following output:

# A tibble: 4 × 7
     ID name                                   CATEGORY   TYPE1       TYPE2     suffix SCORE
  <dbl> <chr>                                  <chr>      <chr>       <chr>     <chr>  <dbl>
1     1 Zebra fish (one)--hello; blah          Zebra fish one         hello     blah       3
2     1 Zebra fish (two)--hi; foo              Zebra fish two         hi        foo        4
3     1 Dog-caut (zero)--yo; bar               Dog-caut   zero        yo        bar        9
4     1 Dog-caut (hello there)--greetings; baz Dog-caut   hello there greetings baz       12

Reshaping a Data Frame in R separating strings

There are 4 best solutions below

Solution

Results

Related Questions in R

Related Questions in DPLYR

Related Questions in TIDYR

Related Questions in GREPL

Trending Questions

Popular # Hahtags

Popular Questions