Reshaping a Data Frame in R separating strings

82 Views Asked by At

I have this dataframe in R

 library(dplyr)
 library(tidyr)


df <- tibble(
  ID = 1,
  `Zebra fish (one)` = 3,
  `Zebra fish (two)` = 4,
  `Dog-caut (zero)` = 9,
  `Dog-caut (hello there)` = 12
)

and try to make this one, but as you can see the TYPE column always comes up empty, how do i fix it?

# Reshaping the dataframe
long_df <- df %>%
  pivot_longer(
    cols = -ID, 
    names_to = "CATEGORY_TYPE", 
    values_to = "SCORE"
  ) %>%
  separate(CATEGORY_TYPE, into = c("CATEGORY", "TYPE"), sep = " \\(") %>%
  mutate(TYPE = sub("\\)", "", TYPE))

The data frame should look like this,

ID, CATEGORY, TYPE, SCORE
1, Zebra fish, one, 3
1, Zebra fish, two, 4
1, Dog-caut, zero, 9
1, Dog-caut, hello there, 12

The string I wish to separate is in this format

ID, 'Zebra fish (one)--Hello; blah'
1, 7

And I am hoping to put it in this format:

ID, CATEGORY, TYPE1, TYPE2, VALUE
1, Zebra fish, one, hello, 7
4

There are 4 best solutions below

5
jpsmith On

This may be a good example of when to use tidyr::separate_wider_delim:

df %>%
  pivot_longer(
    cols = -ID, 
    names_to = "CATEGORY_TYPE", 
    values_to = "SCORE"
  ) %>%
  separate_wider_delim(CATEGORY_TYPE, delim = "..", names = c("CATEGORY", "TYPE"))

#     ID CATEGORY   TYPE         SCORE
#   <dbl> <chr>      <chr>        <dbl>
# 1     1 Zebra.fish one.             3
# 2     1 Zebra.fish two.             4
# 3     1 Dog.caut   zero.            9
# 4     1 Dog.caut   hello.there.    12

And if you want to clean it up with removing periods, add an extra mutate:

df %>%
  pivot_longer(
    cols = -ID, 
    names_to = "CATEGORY_TYPE", 
    values_to = "SCORE"
  ) %>%
  separate_wider_delim(CATEGORY_TYPE, delim = "..", names = c("CATEGORY", "TYPE")) %>%
  mutate(across(everything(), ~trimws(gsub("\\.", " ", .x))))

#   ID    CATEGORY   TYPE        SCORE
#   <chr> <chr>      <chr>       <chr>
# 1 1     Zebra fish one         3    
# 2 1     Zebra fish two         4    
# 3 1     Dog caut   zero        9    
# 4 1     Dog caut   hello there 12   

In the updated data you provided, you could try:

have <- data.frame(ID = 1, x = c("Zebra fish (one)--Hello; blah"), VALUE = 7)

have$x <- gsub("(.*);.*", "\\1", have$x) # remove everything after ";"

have %>%
  separate_wider_delim(x, delim = " (", names = c("CATEGORY", "TYPE1")) %>%
  separate_wider_delim(TYPE1, delim = ")--", names = c("TYPE1", "TYPE2")) 

#      ID CATEGORY   TYPE1 TYPE2 VALUE
#   <dbl> <chr>      <chr> <chr> <dbl>
# 1     1 Zebra fish one   Hello     7
2
stefan_aus_hannover On

The parens are not in your df when you try to pivot. You would need to account for this before trying to pivot or split by the double period

long_df <- df %>%
  pivot_longer(
    cols = -ID, 
    names_to = "CATEGORY_TYPE", 
    values_to = "SCORE"
  ) %>%
  separate(CATEGORY_TYPE, into = c("CATEGORY", "TYPE"), sep = "\\.\\.") %>%
  mutate(across(everything(), ~ gsub("\\."," ",.x))) %>%
  mutate(TYPE=gsub(" $","",TYPE)) %>%
  separate(TYPE, into = c("TYPE1", "TYPE2"), sep = " ")
long_df[is.na(long_df)] <- ''
1
Onyambu On
df %>%
   pivot_longer(-ID, names_to = c('Category', 'Type'), 
                names_pattern = "([^(]+) \\(([^)]+)", 
                values_to = 'Score')

# A tibble: 4 × 4
     ID Category   Type        Score
  <dbl> <chr>      <chr>       <dbl>
1     1 Zebra fish one             3
2     1 Zebra fish two             4
3     1 Dog-caut   zero            9
4     1 Dog-caut   hello there    12
0
Greg On

Solution

Here's a one-liner that leverages the names_pattern argument to pivot_longer():

library(dplyr)
library(tidyr)


# ...
# Code to generate your data.
# ...


df %>% pivot_longer(!ID,
  values_to = "SCORE",
  names_to = c("name", "CATEGORY", "TYPE1", NA, "TYPE2", "suffix"),
  names_pattern = "((.*) \\((.*)\\)(--(.*); (.*))?)"
)

Results

Given a df like your sample...

df <- tibble(
  ID = 1,
  `Zebra fish (one)` = 3,
  `Zebra fish (two)` = 4,
  `Dog-caut (zero)` = 9,
  `Dog-caut (hello there)` = 12
)

...we get the following output:

# A tibble: 4 × 7
     ID name                   CATEGORY   TYPE1       TYPE2 suffix SCORE
  <dbl> <chr>                  <chr>      <chr>       <chr> <chr>  <dbl>
1     1 Zebra fish (one)       Zebra fish one         ""    ""         3
2     1 Zebra fish (two)       Zebra fish two         ""    ""         4
3     1 Dog-caut (zero)        Dog-caut   zero        ""    ""         9
4     1 Dog-caut (hello there) Dog-caut   hello there ""    ""        12

But this will also work for a df...

df <- tibble(
  ID = 1,
  `Zebra fish (one)--hello; blah` = 3,
  `Zebra fish (two)--hi; foo` = 4,
  `Dog-caut (zero)--yo; bar` = 9,
  `Dog-caut (hello there)--greetings; baz` = 12
)

...that looks like your second suggestion...

The string I wish to separate is in this format

ID, 'Zebra fish (one)--Hello; blah'
1, 7

...where it yields the following output:

# A tibble: 4 × 7
     ID name                                   CATEGORY   TYPE1       TYPE2     suffix SCORE
  <dbl> <chr>                                  <chr>      <chr>       <chr>     <chr>  <dbl>
1     1 Zebra fish (one)--hello; blah          Zebra fish one         hello     blah       3
2     1 Zebra fish (two)--hi; foo              Zebra fish two         hi        foo        4
3     1 Dog-caut (zero)--yo; bar               Dog-caut   zero        yo        bar        9
4     1 Dog-caut (hello there)--greetings; baz Dog-caut   hello there greetings baz       12