regex in sed removing only the first occurrence from every line

157 Views Asked by At

I have the following file I would like to clean up

cat file.txt

MNS:N+    GYPA*01 or GYPA*M   
MNS:M+    GYPA*02 or GYPA*N
MNS:Mc    GYPA*08 or GYP*Mc
MNS:Vw    GYPA*09 or GYPA*Vw
MNS:Mg    GYPA*11 or GYPA*Mg
MNS:Vr    GYPA*12 or GYPA*Vr

My desired output is:

MNS:N+  GYPA*01 or GYPA*M   
MNS:M+  GYPA*02 or GYPA*N
MNS:Mc  GYPA*08 or GYP*Mc
MNS:Vw  GYPA*09 or GYPA*Vw
MNS:Mg  GYPA*11 or GYPA*Mg
MNS:Vr  GYPA*12 or GYPA*Vr

I would like to remove everything between ":" and the first occurence of "or"

I tried sed 's/MNS:d*?or /MNS:/g' though it removes the second "or" as well.

I tried every option in https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/

to no avail. should I create alias sed='perl -pe'? It seems that sed does not properly support regex

5

There are 5 best solutions below

3
RavinderSingh13 On BEST ANSWER

perl should be more suitable here because we need Lazy match logic here.

perl -pe 's|(:.*?or +)(.*)|:\2|' Input_file

by using .*?or we are checking for the first nearest match for or string in the line.

0
tshiono On

If it is sure the or always occurs twice a line as provided example, please try:

sed 's/\(MNS:\).\+ or \(.\+ or .*\)/\1\2/' file.txt

Result:

MNS:N+    GYPA*01 or GYPA*M   
MNS:M+    GYPA*02 or GYPA*N
MNS:Mc    GYPA*08 or GYP*Mc
MNS:Vw    GYPA*09 or GYPA*Vw
MNS:Mg    GYPA*11 or GYPA*Mg
MNS:Vr    GYPA*12 or GYPA*Vr

Otherwise using perl is a better solution which supports the shortest match as RavinderSingh13 answers.

0
builder-7000 On

ex supports lazy matching with \{-}:

ex -s '+%s/:\zs.\{-}or //g|wq' input_file

The pattern :\zs.\{-}or matches any character after the first : up to the first or.

0
potong On

This might work for you (GNU sed):

sed '/:.*\<or\>/{s/\<or\>/\n/;s/:.*\n//}' file

If a line contains : followed by the word or, then substitute the first occurrence of the word or with a unique delimiter (e.g.\n) and then remove everything between : and the unique delimiter.

1
Ed Morton On

Wrt I would like to remove everything between ":" and the first occurence of "or" - no you wouldn't. The first occurrence of or in the 2nd line of sample input is as the start of orweqqwe. That text immediately after : looks like it could be any set of characters so couldn't it contain a standalone or, e.g. MNS:2 or eqqwe or M+ GYPA*02 or GYPA*N

Given that and the fact it's apparently a fixed number of characters to be removed on every line, it seems like this is what you should really be using:

$ sed 's/:.\{14\}/:/' file
MNS:N+    GYPA*01 or GYPA*M
MNS:M+    GYPA*02 or GYPA*N
MNS:Mc    GYPA*08 or GYP*Mc
MNS:Vw    GYPA*09 or GYPA*Vw
MNS:Mg    GYPA*11 or GYPA*Mg
MNS:Vr    GYPA*12 or GYPA*Vr