regex to find string between repeating markers

Question

regex to find string between repeating markers

69 Views Asked by musca999 At 17 November 2023 at 21:15

I have a string that looks like this:

**** SOURCE#24 ****

[1]  Source Location [Local/Remote]          : Remote

 Remote Host Name : PNQ
 User Name        : foo

[2]  Source directory                        : HDPGWRF
[3]  Directory poll interval                 : 30
[4]  File name format                        : ACR_FILEFORMAT
[5]  Delete files from source                : y

**** SOURCE#25 ****

[1]  Source Location [Local/Remote]          : Remote

 Remote Host Name : PNR
 User Name        : foo

[2]  Source directory                        : HDPGWRF
[3]  Directory poll interval                 : 30
[4]  File name format                        : ACR_FILEFORMAT
[5]  Delete files from source                : y

**** SOURCE#26 ****
etc.....

I want a capture group that captures everything after the '[1]' up to the ends of the line that starts with [5], based on the Remote Host Name (eg PNR or PNQ). So only lines [1] through [5] around the selected name.

I've been trying lookahead and lookbehind and just can't figure this out. It looks like lookbehind is greedy, so if I search for the PNR section, it won't stop at the first [1] but grabs everything up to the first [1] in the PNQ section.

This is the closest I've got to making it work, but it only works if I search for the PNQ section:

re.search('SOURCE#.*?\[1\](.*?PNQ.*?.*?HDPGWRF.*?)\*', buf, flags=re.DOTALL).group(1)

This after combing through stackoverflow all afternoon :(

Original Q&A

There are 4 best solutions below

**The fourth bird** · Answer 1 · 2023-11-17T21:35:38.483000

You might use a pattern without the re.DOTALL flag but with the re.MULTILINE flag:

\bSOURCE#.*\s*^\[1](.*\s*^(?!\[\d+]).*\bPN[QR](?:\n(?!\[\d+]).*)*(?:\n\[\d+].*)*)

The pattern matches:

\bSOURCE# Match literally starting with a word boundary
.* Match the rest of the line
\s*^ Match optional whitspace chars until a start of the line
\[1] That matches [1]
( Capture group 1
- .* match the rest of the line
- \s*^ Match optional whitspace chars until a start of the line
- (?!\[\d+]) Negative lookahead, assert that the lines does not start with [digits]
- .*\bPN[QR] Match PNB or PNQ at the end of the line
- (?:\n(?!\[\d+]).*)* Match all following lines that do not start with [digits]
- (?:\n\[\d+].*)* Match all following lines that do start with [digits]
) Close group 1

See a regex demo and a demo that will not over match using only PNR and a Python demo

**Timeless** · Answer 2 · 2023-11-17T21:54:42.477000

In case you need to make a Python object of the numbered-filtered items, you can try :

import re

targets = ["PNR", "PNQ"]

with open("file.txt") as f:
    pat = r"\*+\ (SOURCE#\d+) \*+\s+(.+?Remote Host Name : (\w+).+?)(?=\*)"
    
    data = {
        m.group(1) : dict(re.findall(r"\[\d+\]\s*(.+?)\s*:\s*(.+)", m.group(2)))
        for m in re.finditer(pat, f.read(), flags=re.M|re.S)
        if m.group(3) in targets # or ["PNR", "PNQ"]
    }

Output :

import json; print(json.dumps(data, indent=4))

{
    "SOURCE#24": {
        "Source Location [Local/Remote]": "Remote",
        "Source directory": "HDPGWRF",
        "Directory poll interval": "30",
        "File name format": "ACR_FILEFORMAT",
        "Delete files from source": "y"
    },
    "SOURCE#25": {
        "Source Location [Local/Remote]": "Remote",
        "Source directory": "HDPGWRF",
        "Directory poll interval": "30",
        "File name format": "ACR_FILEFORMAT",
        "Delete files from source": "y"
    }
}

**Reilas** · Answer 3 · 2023-11-18T00:57:27.977000

"... I want a capture group that captures everything after the '[1]' up to the ends of the line that starts with [5], based on the Remote Host Name (eg PNR or PNQ). ..."

Try the following match pattern.

(?ms)^\[1\].+?Remote Host Name\s*:\s*(?:PNR|PNQ).+?^\[5\].+?$

(?ms), toggle on "multi-line", and "single-line", mode
^\[1\], match a line that begins with the text, "[1]"
.+?Remote Host Name\s*:\s*(?:PNR|PNQ), match all characters up to the literal text "Remote Host Name", optional white-space characters, a ":", followed by "PNR", or "PNQ".
.+?^\[5\], match all characters, up to the first line that begins with, "[5]"
.+?$, match all characters up to the first, end-of-line

Here is an example, where s is the text.

import re
p = r'(?ms)^\[1\].+?Remote Host Name\s*:\s*(?:PNR|PNQ).+?^\[5\].+?$'
[print(m.group(), end='\n\n') for m in re.finditer(p, s)]

Output

[1]  Source Location [Local/Remote]          : Remote

 Remote Host Name : PNQ
 User Name        : foo

[2]  Source directory                        : HDPGWRF
[3]  Directory poll interval                 : 30
[4]  File name format                        : ACR_FILEFORMAT
[5]  Delete files from source                : y

[1]  Source Location [Local/Remote]          : Remote

 Remote Host Name : PNR
 User Name        : foo

[2]  Source directory                        : HDPGWRF
[3]  Directory poll interval                 : 30
[4]  File name format                        : ACR_FILEFORMAT
[5]  Delete files from source

**musca999** · Answer 4 · 2023-11-18T11:14:00.903000

Thank you all for your help. While none of the suggestions worked "out of the box" (I tried them all), using clues from each I managed to cobble up something that worked by taking advantage of the fact that each block ended with a ': \d+'. Also I had forgot to mention that the blocks could be any size, from [1] to [x]. The regex i ended up using is:

re.search(r'\*\s*SOURCE.*?\[1\].*(?:PNR)(.*?HDPGWRF.*?:\ \d+)\n\n', buf, re.DOTALL).group(1)

The (?:PNR) was the missing part that kept me busy and couldn't figure out.

regex to find string between repeating markers

There are 4 best solutions below

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in MULTIPLE-OCCURRENCE

Trending Questions

Popular # Hahtags

Popular Questions