I have this really irritating problem resulting from a dodgy PDF digitisation.
Anyway, a series of rows with different columns, ideally, would be represented like this:
Code Cost Quantity
ABC 45.00 4
FED 60.00 5
GHK 30.00 5
With regex it is easy to split these into rows and then get each individual column.
However, I find one particularly annoying bit of text always comes out like this instead:
Code Cost Quantity
ABC FED GHK 45.00 60.00 30.00 4 5 5
I cannot for the life of me figure out how to get the regex to separate out each of these overlapped rows, as in the first example. Positive lookaheads can get me some of the way, but what typically happens is that ill get ABC 45.00 4 and then FED 45.00 4, the lookaheads I've built don't iterate through all of the separate columns.
My suspicion is I could use a named pattern or something, matching the first set:
(?>(?<match1>((?>\s|\b)\w{3}\s).+\s+\s(\d+\.\d{2})\s.*\s+\s(\d{1})\s.*))
and then somehow Reuse that capture group, iterating it.
Sticking a positive look ahead only iterates the first group, so I'm obviously doing something stupid:
https://regex101.com/r/Uxx8bZ/1
In theory I could separate out the rows some other way (e.g. for every big space, that's one column) but it seems like this should be possible.
Help appreciated!
In the example data, the accompanying data is 3 times whitespace chars followed by non whitespace chars to the right.
If that structure is always the same, you can capture the uppercase chars, and capture the other 2 fields inside a lookahead assertion.
([A-Z]+)
Capture 1+ times a char A-Z in group 1(?=
Positive lookahead, assert to the right\s+\S+\s+\S+\s+(\d+(?:\.\d+)?)
After 3 fields capture 1+ digits with an optional decimal part in group 2\s+\S+\s+\S+\s+(\d+)
After 3 fields, capture 1+ digits in group 3)
Close lookaheadSee a regex demo
An example using re.findall to retieve the capture group values:
Output