I used the PDFPlumber library to extract all the lines in my PDF, a sample line extract looks like this:
Total Return Transportation $16.01
The goal is to put all of these into a data frame. How do I use regex to group this line so that I may isolate the charge type and dollar amount?
Currently, I have:
totals=re.compile(r"(\ATotal) ([\w]+) ([\w]*)")
for line in text.split("\n"):
line2=totals.search(line)
if line2:
print(line)
print(line2.group(1))
else:
pass
Group 1 returns "Total", Group 2 returns "Return" and Group 3 "Transportation" but I'm unable to make a group that retrieves the dollar amount. Any suggestions?
Note: Dollar amounts over $1000 contain a "," that might need to be included in the regex syntax
You could use a pattern with 4 capture groups.
Note that you can write
[\w]
as just\w
.Using
\w*
matches optional word characters, and could possibly also match an empty string.You can match the word characters 1+ times, and use a pattern for the dollar amount matching 1-3 digits at the left and optional parts of a comma and 3 digits in between.
\A
Start of string(Total)
Capture Total in group 1 and match a space(\w+)
Capture 1+ word chars in group 2 and match a space(\w+)
Capture 1+ word chars in group 3 and match a space(
Capture group 4\$\d{1,3}
Match$
and 1-3 digits(?:,\d{3})*(?:\.\d+)?
Optionally repeat 3 digits and optionally match.
and 1+ digits)
Close group 4(?!\S)
Assert a whitespace boundary to the right to prevent a partial matchSee a regex demo and a Python demo.
Output