Regex - capture group whish is optionally enclosed in sequence of characters

39 Views Asked by At

I have a file with lines I need to extract from the JSON-like syntax. My regex works good in most cases. It extracts desired symbols into a second capture group. But I noticed sometimes my desired text is optionally can be enclosed by some tags which I want to ignore.

Sample file:

    {"title_available" "text1"}
    {"title_value" "<c(20a601)>text2"}
    {"tags"
        {"all" "text3"}
        {"ignore" "text4"}
        {"chargeFactor" "text5 %1%"}
        {"resourceSpeed" "%1% text6"}
    }
    {"rules" "bla-bla-bla\n\n \"BLA\" bla-bla-bla."}
            {"id1" "<c(c3baae)>text7</c>"}

My regex: \s+{\"\S+\" \"(<c\(\S+\)>)?(.+)\"}

Desired output:

text1
text2
text3
text4
text5 %1%
%1% text6
bla-bla-bla\n\n \"BLA\" bla-bla-bla.
text7

Current output:

all ok except:
text7</c>

enter image description here

I guees I need to use a lookahead somehow with second capture group, but I didn't find how. Also I'm not sure if I should use a capture group for skipping first optional <c...>. Can someone help with this pls?

P.S. speed or simplicity of the pattern doesn't matter.

2

There are 2 best solutions below

1
TheHungryCub On BEST ANSWER

It seems like your regular expression is not excluding the closing tag </c> from the third capture group. To fix this, you can adjust your regex to exclude the closing tag if it's present.

Like:

\s+{"\S+" "(?:<c\S+>)?(.+?)(?:<\/c>)?"}
1
Cary Swoveland On

Matching the following regular expression produces the desired result for all your examples. (Note there is no capture group.) As you have not stated requirements, however, I do not know if it is correct for other strings.

(?:\\\"|[^<>\"])+(?=(?:<[^>]*>)?\"}) 

Demo

The expression can be broken down as follows.

(?:            # begin a non-capture group
  \\\"         # match '\' followed by '"'
|              # or
  [^<>\"]      # match a character other than '<', '>' and '"'
)+             # end non-capture group and execute it >= 1 times
(?=            # begin a positive lookahead
  (?:          # begin a non-capture group
    <          # match '<'
    [^>]*      # match >= 0 characters other than '>'
    >          # match '<'
  )?           # end non-capture group and make it optional
  \"}          # match '"}'
)              # end positive lookahead