Multi line string matcher with optional intervening phrase

138 Views Asked by At

I would like to grab Text distributed between two lines.

For Example :

PO Number Dept.number
4000813852 7

I would like to get PO Number 4000813852 It's like a table-based data but in the context of the whole document appears to be normal text.

I have used re.MULTILINE like r'PO Number.*\n[0-9]+'

it workes in this case but it is not the best solution because maybe PO Number comes in the middle as

Invoice Number PO Number Dept.number
123456666     4000813852  7
2

There are 2 best solutions below

0
On BEST ANSWER

You can do this with two capture groups and re.DOTALL option enabled. The expression assumes that the number you are interested is the only one with 10 digits in your text.

The expression is:

(PO\sNumber).*(\d{10})

Python snippet:

import re

first_string = """PO Number Dept.number
4000813852 7"""

second_string = """Invoice Number PO Number Dept.number
123456666     4000813853  7"""

PO_first = re.search(r'(PO\sNumber).*(\d{10})',first_string,re.DOTALL)
print(PO_first.group(1)+" "+PO_first.group(2))

PO_second = re.search(r'(PO\sNumber).*(\d{10})',second_string,re.DOTALL)
print(PO_second.group(1)+" "+PO_second.group(2))

Output:

PO Number 4000813852
PO Number 4000813853
2
On

With a single regex:

data="""PO Number Dept.number
    4000813852 7
    Invoice Number PO Number Dept.number
    123456666     4000813852  7
    """

re.findall(r"(PO Number)\s*Dept.number\s*(?:(?:\d+)\s+(\d+)|(\d+))\s+\d",data)
Out: 
[('PO Number', '', '4000813852'), ('PO Number', '4000813852', '')]

I don't use re.MULTILINE, as \s matches newline,too.