regex to find string between repeating markers

69 Views Asked by At

I have a string that looks like this:

**** SOURCE#24 ****

[1]  Source Location [Local/Remote]          : Remote

 Remote Host Name : PNQ
 User Name        : foo

[2]  Source directory                        : HDPGWRF
[3]  Directory poll interval                 : 30
[4]  File name format                        : ACR_FILEFORMAT
[5]  Delete files from source                : y

**** SOURCE#25 ****

[1]  Source Location [Local/Remote]          : Remote

 Remote Host Name : PNR
 User Name        : foo

[2]  Source directory                        : HDPGWRF
[3]  Directory poll interval                 : 30
[4]  File name format                        : ACR_FILEFORMAT
[5]  Delete files from source                : y

**** SOURCE#26 ****
etc.....

I want a capture group that captures everything after the '[1]' up to the ends of the line that starts with [5], based on the Remote Host Name (eg PNR or PNQ). So only lines [1] through [5] around the selected name.

I've been trying lookahead and lookbehind and just can't figure this out. It looks like lookbehind is greedy, so if I search for the PNR section, it won't stop at the first [1] but grabs everything up to the first [1] in the PNQ section.

This is the closest I've got to making it work, but it only works if I search for the PNQ section:

re.search('SOURCE#.*?\[1\](.*?PNQ.*?.*?HDPGWRF.*?)\*', buf, flags=re.DOTALL).group(1)

This after combing through stackoverflow all afternoon :(

4

There are 4 best solutions below

0
The fourth bird On

You might use a pattern without the re.DOTALL flag but with the re.MULTILINE flag:

\bSOURCE#.*\s*^\[1](.*\s*^(?!\[\d+]).*\bPN[QR](?:\n(?!\[\d+]).*)*(?:\n\[\d+].*)*)

The pattern matches:

  • \bSOURCE# Match literally starting with a word boundary
  • .* Match the rest of the line
  • \s*^ Match optional whitspace chars until a start of the line
  • \[1] That matches [1]
  • ( Capture group 1
    • .* match the rest of the line
    • \s*^ Match optional whitspace chars until a start of the line
    • (?!\[\d+]) Negative lookahead, assert that the lines does not start with [digits]
    • .*\bPN[QR] Match PNB or PNQ at the end of the line
    • (?:\n(?!\[\d+]).*)* Match all following lines that do not start with [digits]
    • (?:\n\[\d+].*)* Match all following lines that do start with [digits]
  • ) Close group 1

See a regex demo and a demo that will not over match using only PNR and a Python demo

0
Timeless On

In case you need to make a Python object of the numbered-filtered items, you can try :

import re

targets = ["PNR", "PNQ"]

with open("file.txt") as f:
    pat = r"\*+\ (SOURCE#\d+) \*+\s+(.+?Remote Host Name : (\w+).+?)(?=\*)"
    
    data = {
        m.group(1) : dict(re.findall(r"\[\d+\]\s*(.+?)\s*:\s*(.+)", m.group(2)))
        for m in re.finditer(pat, f.read(), flags=re.M|re.S)
        if m.group(3) in targets # or ["PNR", "PNQ"]
    }

Output :

import json; print(json.dumps(data, indent=4))

{
    "SOURCE#24": {
        "Source Location [Local/Remote]": "Remote",
        "Source directory": "HDPGWRF",
        "Directory poll interval": "30",
        "File name format": "ACR_FILEFORMAT",
        "Delete files from source": "y"
    },
    "SOURCE#25": {
        "Source Location [Local/Remote]": "Remote",
        "Source directory": "HDPGWRF",
        "Directory poll interval": "30",
        "File name format": "ACR_FILEFORMAT",
        "Delete files from source": "y"
    }
}
0
Reilas On

"... I want a capture group that captures everything after the '[1]' up to the ends of the line that starts with [5], based on the Remote Host Name (eg PNR or PNQ). ..."

Try the following match pattern.

(?ms)^\[1\].+?Remote Host Name\s*:\s*(?:PNR|PNQ).+?^\[5\].+?$
  • (?ms), toggle on "multi-line", and "single-line", mode
  • ^\[1\], match a line that begins with the text, "[1]"
  • .+?Remote Host Name\s*:\s*(?:PNR|PNQ), match all characters up to the literal text "Remote Host Name", optional white-space characters, a ":", followed by "PNR", or "PNQ".
  • .+?^\[5\], match all characters, up to the first line that begins with, "[5]"
  • .+?$, match all characters up to the first, end-of-line

Here is an example, where s is the text.

import re
p = r'(?ms)^\[1\].+?Remote Host Name\s*:\s*(?:PNR|PNQ).+?^\[5\].+?$'
[print(m.group(), end='\n\n') for m in re.finditer(p, s)]

Output

[1]  Source Location [Local/Remote]          : Remote

 Remote Host Name : PNQ
 User Name        : foo

[2]  Source directory                        : HDPGWRF
[3]  Directory poll interval                 : 30
[4]  File name format                        : ACR_FILEFORMAT
[5]  Delete files from source                : y
[1]  Source Location [Local/Remote]          : Remote

 Remote Host Name : PNR
 User Name        : foo

[2]  Source directory                        : HDPGWRF
[3]  Directory poll interval                 : 30
[4]  File name format                        : ACR_FILEFORMAT
[5]  Delete files from source  
2
musca999 On

Thank you all for your help. While none of the suggestions worked "out of the box" (I tried them all), using clues from each I managed to cobble up something that worked by taking advantage of the fact that each block ended with a ': \d+'. Also I had forgot to mention that the blocks could be any size, from [1] to [x]. The regex i ended up using is:

re.search(r'\*\s*SOURCE.*?\[1\].*(?:PNR)(.*?HDPGWRF.*?:\ \d+)\n\n', buf, re.DOTALL).group(1)

The (?:PNR) was the missing part that kept me busy and couldn't figure out.