How can be splitted sentences contained in a cell into different rows in R

194 Views Asked by At

I tried several times and it does not work. How can I split sentences contained in a cell into different rows maintaining the rest of the values?

Example: Dataframe df has 20 columns. Row j, Column i contains some comments which are separated by " | " I want to have a new dataframe df2 which increases the amount of rows depending the number of sentences. This means, if cell j,i has Sentence A | Sentence B

Row j, Column i has Sentence A Row j+1, Column i has Sentence B Columns 1 to i-1 and i+1 to 20 have the same value in rows j and j+1.

I do not know if this has an easy solution.

Thank you very much.

2

There are 2 best solutions below

2
On

Here is a solution using 3 tidyverse packages that accounts for an unknown maximum number of comments

library(dplyr)
library(tidyr)
library(stringr)

# Create function to calculate the max number comments per observation within 
# df$col3 and create a string of unique "names"
cols <- function(x) {
    cmts <- str_count(x, "([|])")
    max_cmts <- max(cmts, na.rm = TRUE) + 1
    features <- c(sprintf("V%02d", seq(1, max_cmts)))
}

# Create the data
df1 <- data.frame(col1 = c("a", "b", "c", "d"),
                  col2 = c(1, 2, 3, 4),
                  col3 = c("fitz|buzz", NA, 
                           "hello world|today is Thursday | its 2:00|another comment|and yet another comment", "fitz"),
                  stringsAsFactors = FALSE)

# Generate the desired output
df2 <- separate(df1, col3, into = cols(x = df1$col3), 
                sep = "([|])", extra = "merge", fill = "right") %>% 
    pivot_longer(cols = cols(x = df1$col3), values_to = "comments", 
                 values_drop_na = TRUE) %>% 
    select(-name)

Which results in

df2
# A tibble: 8 x 3
  col1   col2 comments                 
  <chr> <dbl> <chr>                    
1 a         1 "fitz"                   
2 a         1 "buzz"                   
3 c         3 "hello world"            
4 c         3 "today is Thursday "     
5 c         3 " its 2:00"              
6 c         3 "another comment"        
7 c         3 "and yet another comment"
8 d         4 "fitz" 
2
On

We could use cSplit from splitstackshape

library(splitstackshape)
cSplit(df, 'col3', sep="\\|", "long", fixed = FALSE)
#   col1 col2                col3
#1:    a    1                fitz
#2:    a    1                buzz
#3:    b    2                 foo
#4:    b    2                 bar
#5:    c    3         hello world
#6:    c    3   today is Thursday
#7:    c    3           its 2:00
#8:    d    4                fitz

data

df <- structure(list(col1 = c("a", "b", "c", "d"), col2 = c(1, 2, 3, 
4), col3 = c("fitz|buzz", "foo|bar", "hello world|today is Thursday | its 2:00", 
"fitz")), class = "data.frame", row.names = c(NA, -4L))