Extract subtring using regex python

105 Views Asked by At

Hello I have this string and I need extract from this some sub strings according some delimiters:

string = """
1538 a
123
skua456
789
5
g
15563 blu55g
b
456
16453 a
789
5
16524 blu
g
55
1734 a
987
987
55
aasf
552
18278 blu
ttry
"""

And I need extract exactly this strings:

string1 = 
"""
1538 a
123
skua456
789
5
g
15563 blu55g
"""
string2 = """
16453 a
789
5
16524 blu
"""
string3 = 
"""
1734 a
987
987
55
aasf
552
18278 blu
"""

I have tried a lot of types: re.findall, re.search, re.match. But I never geted the result expected.

For eg: this code bellow print all string:

re.split(r"a(.*)blu", a)[0]
1

There are 1 best solutions below

0
On BEST ANSWER

You do not need a regex for this, you may get lines between lines containing a and blu:

text = "1538 a\n123\nskua456\n789\n5\ng\n15563 blu55g\nb\n456\n16453 a\n789\n5\n16524 blu\ng\n55\n1734 a\n987\n987\n55\naasf\n552\n18278 blu\nttry"
f = False
result = []
block = []
for line in text.splitlines():
    if 'a' in line:
        f = True
    if f:
        block.append(line)
    if 'blu' in line and f:
        f = False
        result.append("\n".join(block))
        block = []

print(result)
# => ['1538 a\n123\nskua456\n789\n5\ng\n15563 blu55g', '16453 a\n789\n5\n16524 blu', '1734 a\n987\n987\n55\naasf\n552\n18278 blu']

See the Python demo.

With regex, you can use

print( re.findall(r'(?m)^.*a(?s:.*?)blu.*', text) )
print( re.findall(r'(?m)^.*a(?:\n.*)*?\n.*blu.*', text) )

See this Python demo.

The first regex means:

  • (?m)^ - multiline mode on, so ^ matches any line start position
  • .*a - any zero or more chars other than line break chars as many as possible, and then a
  • (?s:.*?) - any zero or more chars including line break chars as few as possible
  • blu.* - blue and then any zero or more chars other than line break chars as many as possible.

The second regex matches

  • (?m)^ - start of a line
  • .*a - any zero or more chars other than line break chars as many as possible, and then a
  • (?:\n.*)*? - zero or more lines, as few as possible
  • \n.*blu.* - a newline, any zero or more chars other than line break chars as many as possible, blu and any zero or more chars other than line break chars as many as possible.