How to keep only information inside a complex string in R?

991 Views Asked by At

I want to keep a string of character inside a complex string. I think that I can use regex to do keep the thing that I need. Basically, I want to keep only the information between the \" and \" in Function=\"SMAD5\". I also want to keep the empty strings: Function=\"\"

df=structure(1:6, .Label = c("ID=Gfo_R000001;Source=ENST00000513418;Function=\"SMAD5\";", 
"ID=Gfo_R000002;Source=ENSTGUT00000017468;Function=\"CENPA\";", 
"ID=Gfo_R000003;Source=ENSGALT00000028134;Function=\"C1QL4\";", 
"ID=Gfo_R000004;Source=ENSTGUT00000015300;Function=\"\";", "ID=Gfo_R000005;Source=ENSTGUT00000019268;Function=\"\";", 
"ID=Gfo_R000006;Source=ENSTGUT00000019035;Function=\"\";"), class = "factor")

This should look like this:

"SMAD5"
"CENPA"
"C1QL4"
NA
NA
NA

So far that What I was able to do:

gsub('.*Function=\"',"",df)

[1] "SMAD5\";" "CENPA\";" "C1QL4\";" "\";"      "\";"      "\";"     

But I'm stuck with a bunch of \";". How can I remove them with one line?

I tried this:

gsub('.*Function=\"' & '.\"*',"",test)

But it's giving me this error:

Error in ".*Function=\"" & ".\"*" : 
  operations are possible only for numeric, logical or complex types
3

There are 3 best solutions below

0
On BEST ANSWER

You may use

gsub(".*Function=\"([^\"]*).*","\\1",df)

See the regex demo

Details:

  • .* - any 0+ chars as many as possible up to the last...
  • Function=\" - a Function=" substring
  • ([^\"]*) - capturing group 1 matching 0+ chars other than a "
  • .* - and the rest of the string.

The \1 is the backreference restoring the contents of the Group 1 in the result.

2
On

The regular expression can be constructed more readably using rebus.

rx <- 'Function="' %R% 
  capture(zero_or_more(negated_char_class('"')))

Then matching is as mentioned by Wiktor and sandipan.

rx <- 'Function="' %R% capture(zero_or_more(negated_char_class('"')))
str_match(df, rx)
stri_match_first_regex(df, rx)

gsub(any_char(0, Inf) %R% rx %R% any_char(0, Inf), REF1, df)
0
On

With stringr we can capture groups too:

library(stringr)
matches <- str_match(df, ".*\"(.*)\".*")[,2]
ifelse(matches=='', NA, matches)
# [1] "SMAD5" "CENPA" "C1QL4" NA      NA      NA