REGEX for overlapped/intermingled rows

66 Views Asked by At

I have this really irritating problem resulting from a dodgy PDF digitisation.

Anyway, a series of rows with different columns, ideally, would be represented like this:

Code  Cost  Quantity
ABC  45.00  4
FED  60.00  5
GHK  30.00  5

With regex it is easy to split these into rows and then get each individual column.

However, I find one particularly annoying bit of text always comes out like this instead:

Code  Cost  Quantity
ABC FED GHK   45.00 60.00 30.00  4 5 5

I cannot for the life of me figure out how to get the regex to separate out each of these overlapped rows, as in the first example. Positive lookaheads can get me some of the way, but what typically happens is that ill get ABC 45.00 4 and then FED 45.00 4, the lookaheads I've built don't iterate through all of the separate columns.

My suspicion is I could use a named pattern or something, matching the first set:

(?>(?<match1>((?>\s|\b)\w{3}\s).+\s+\s(\d+\.\d{2})\s.*\s+\s(\d{1})\s.*))

and then somehow Reuse that capture group, iterating it.

Sticking a positive look ahead only iterates the first group, so I'm obviously doing something stupid:

https://regex101.com/r/Uxx8bZ/1

In theory I could separate out the rows some other way (e.g. for every big space, that's one column) but it seems like this should be possible.

Help appreciated!

1

There are 1 best solutions below

5
On

In the example data, the accompanying data is 3 times whitespace chars followed by non whitespace chars to the right.

If that structure is always the same, you can capture the uppercase chars, and capture the other 2 fields inside a lookahead assertion.

([A-Z]+)(?=\s+\S+\s+\S+\s+(\d+(?:\.\d+)?)\s+\S+\s+\S+\s+(\d+))
  • ([A-Z]+) Capture 1+ times a char A-Z in group 1
  • (?= Positive lookahead, assert to the right
    • \s+\S+\s+\S+\s+(\d+(?:\.\d+)?) After 3 fields capture 1+ digits with an optional decimal part in group 2
    • \s+\S+\s+\S+\s+(\d+) After 3 fields, capture 1+ digits in group 3
  • ) Close lookahead

See a regex demo

An example using re.findall to retieve the capture group values:

import re

pattern = r"([A-Z]+)(?=\s+\S+\s+\S+\s+(\d+(?:\.\d+)?)\s+\S+\s+\S+\s+(\d+))"
s = r"ABC FED GHK   45.00 60.00 30.00  4 5 5"
print(re.findall(pattern, s))

Output

[('ABC', '45.00', '4'), ('FED', '60.00', '5'), ('GHK', '30.00', '5')]