Retaining a character string in the vector that doesn't meet strsplit() criteria

58 Views Asked by At

I have different character strings that look like this:

t <- c("probable linoleate 9S-lipoxygenase 5 [Malus domestica]", "PREDICTED:  protein STRUBBELIG-RECEPTOR FAMILY 3 [Malus domestica]")

I want to remove the 'PREDICTED:' from the character string containing it.

My script looks like this:

t <- sapply(strsplit(t, split= ": ", fixed = TRUE), function(x) (x[2]))

But, this is the result: [1] NA "protein STRUBBELIG-RECEPTOR FAMILY 3 [Malus domestica]"

So, for some reason, it erased t[1], and correctly performed the operation on t[2]. I tried adding grep() to my string:

t <- sapply(strsplit(t, if(grep('^*.', t), split= ": " else t, fixed = TRUE), function(x) (x[2]))). 

I also tried writing a loop:

for(i in t){
  if(i == grep('PREDICTED', t[i]) split= ": " else t[i])
}

Any help is greatly appreciated. Thanks!

1

There are 1 best solutions below

0
On BEST ANSWER

To remove the PREDICTED: word you may use a simple non-regex sub:

t <- c("probable linoleate 9S-lipoxygenase 5 [Malus domestica]", "PREDICTED:  protein STRUBBELIG-RECEPTOR FAMILY 3 [Malus domestica]")
sub("PREDICTED:  ", "", t, fixed=TRUE)

See the online R demo

If the word before the first colon can be any, use a regex solution:

t <- c("probable linoleate 9S-lipoxygenase 5 [Malus domestica]", "PREDICTED:  protein STRUBBELIG-RECEPTOR FAMILY 3 [Malus domestica]")
sub("^[^:]*:\\s*", "", t)

See another demo. Here, ^[^:]*:\\s* matches 0+ chars other than : at the start of the string, then : and then 0+ whitespaces (this is matched only once since sub is used, not gsub.

In both cases, the output is the same:

[1] "probable linoleate 9S-lipoxygenase 5 [Malus domestica]"
[2] "protein STRUBBELIG-RECEPTOR FAMILY 3 [Malus domestica]"