Not able to capture with capture groups

72 Views Asked by At

Within each text file, being worked with, there are two pages of listings. They are referenced identically using the same reference numbers (001,002, ... etc) The aim is to separate these two listings and store separately in an array. I have simplified the problem for testing.

# short listings (an unhelpful text file listing structure. But ain't that life? )
# "001 First in listing one 002 Second in listing one 003 Third in listing one 001 First in listing two 002 Second in listing two 003 Third in listing two

# read shortListings from file 
$listings = Get-Content -Path C:\test\test6\shortListings.txt
[string[]]$result = $null

# regular expression where I am trying to separate listing 'ones' & listing 'twos' into string array $result 

# $result[0] = "001 First in listing one 002 Second in listing one 003 Third in listing one"
# $result[1] = "001 First in listing two 002 Second in listing two 003 Third in listing two"

$regex = '(?s)(001.*?)((?=001.*?))'

$result = $listings | Select-String -Pattern $regex -AllMatches | ForEach-Object { $_.Matches.Value}

# Okay. That's listing one. But how do I get listing two?
$result[0]

This is close regex101 example Listings 'two' string can be accessed via $2. But $1 just gives both listings. I'm thinking these are referred as regex backreferences I have used them in Powershell to make replacements. Directly referencing the capture groups. For example:

-CReplace "([A-Z])\'.\s*([A-Z][a-z])",'$1 $2'

But not when accessing Powershell Object Match Values. If anyone can suggest what's missing here it would be appreciated.

Additional Information

This demonstration is a simplified version. In reality this code is part of a Powershell function among many other functions called by a separate script. Also the 'listings', in reality, are NUM NAME address occupations. There are no 'one' or 'two' to differentiate between them.

The issue is: that I feel I'm close to the solution. The following image shows that I can access the second listing using the $ in regex101. My code shows I can access the first listing via Matches[0] enter image description here

2

There are 2 best solutions below

3
Santiago Squarzon On

Here is one way this could be done, using a combination of -split and Group-Object:

$result = (Get-Content shortListings.txt -Raw) -split '\s*(?=\d{3})' -ne '' |
    Group-Object { [regex]::Match($_, '\w+$').Value } -AsHashTable -AsString

However using this method, the Values of the Hashtable would be an array of strings instead of a single string. You can however -join them later:

PS ..\pwsh> $result

Name                 Value
----                 -----
one                  {001 First in listing one, 002 Second...
two                  {001 First in listing two, 002 Second...

PS ..\pwsh> $result['one']

001 First in listing one
002 Second in listing one
003 Third in listing one

PS ..\pwsh> $result['one'] -join ' '

001 First in listing one 002 Second in listing one 003 Third in listing one
0
Dave On
# read shortListings from file 
$listings = Get-Content -Path C:\test\test6\shortListings.txt

# regex produces 3 groups 0=complete listing 1=listings in one 2=listings in two
$regex = '(?s)(001.*?)(?=001.*?)(001.*?$)'

# access group matches using array $line
[string[]]$line = $null
$line = $listings | Select-String -AllMatches -Pattern $regex | ForEach-Object {$_.Matches.groups}

for ($i = 0; $i -lt $line.Count; $i++) {
    if ($i -eq 0) {
        # do nothing for complete listings
    }
    else {
        Write-Host "group:"$i $line[$i]
    }
   
}

Output:

group: 1 001 First in listing one 002 Second in listing one 003 Third in listing one 
group: 2 001 First in listing two 002 Second in listing two 003 Third in listing two