How to create a regex expression to get a substring between 2 pipes

800 Views Asked by At

I have a dataset that I'm trying to work with where I need to get the text between two pipe delimiters. The length of the text is variable so I can't use length to get it. This is the string:

ENST00000000233.10|ENSG00000004059.11|OTTHUMG000

I want to get the text between the first and second pipes, that being ENSG00000004059.11. I've tried several different regex expressions, but I can't really figure out the correct syntax. What should the correct regex expression be?

4

There are 4 best solutions below

0
On BEST ANSWER

Here is a regex.

x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
sub("^[^\\|]*\\|([^\\|]+)\\|.*$", "\\1", x)
#> [1] "ENSG00000004059.11"

Created on 2022-05-03 by the reprex package (v2.0.1)

Explanation:

  • ^ beginning of string;
  • [^\\|]* not the pipe character zero or more times;
  • \\| the pipe character needs to be escaped since it's a meta-character;
  • ^[^\\|]*\\| the 3 above combined mean to match anything but the pipe character at the beginning of the string zero or more times until a pipe character is found;
  • ([^\\|]+) group match anything but the pipe character at least once;
  • \\|.*$ the second pipe plus anything until the end of the string.

Then replace the 1st (and only) group with itself, "\\1", thus removing everything else.

0
On

Try this: \|.*\| or in R \\|.*\\| since you need to escape the escape characters. (It's just escaping the first pipe followed by any character (.) repeated any number of times (*) and followed by another escaped pipe).

Then wrap in str_sub(MyString, 2, -2) to get rid of the pipes if you don't want them.

0
On

Another option is to get the second item after splitting the string on |.

x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"

strsplit(x, "\\|")[[1]][[2]]
# strsplit(x, "[|]")[[1]][[2]]

# [1] "ENSG00000004059.11"

Or with tidyverse:

library(tidyverse)

str_split(x, "\\|") %>% map_chr(`[`, 2)

# [1] "ENSG00000004059.11"
0
On

Maybe use the regex for look ahead and look behind to extract strings that are surrounded by two "|".

The regex literally means - look one or more characters (.+?) behind "|" ((?<=\\|)) until one character before "|" ((?=\\|)).

library(stringr)

x <- "ENST00000000233.10|ENSG00000004059.11|OTTHUMG000"
str_extract(x, "(?<=\\|).+?(?=\\|)")

[1] "ENSG00000004059.11"