I used the PDFPlumber library to extract all the lines in my PDF, a sample line extract looks like this:
Total Return Transportation $16.01
The goal is to put all of these into a data frame. How do I use regex to group this line so that I may isolate the charge type and dollar amount?
Currently, I have:
totals=re.compile(r"(\ATotal) ([\w]+) ([\w]*)")
for line in text.split("\n"):
line2=totals.search(line)
if line2:
print(line)
print(line2.group(1))
else:
pass
Group 1 returns "Total", Group 2 returns "Return" and Group 3 "Transportation" but I'm unable to make a group that retrieves the dollar amount. Any suggestions?
Note: Dollar amounts over $1000 contain a "," that might need to be included in the regex syntax
You could use a pattern with 4 capture groups.
Note that you can write
[\w]as just\w.Using
\w*matches optional word characters, and could possibly also match an empty string.You can match the word characters 1+ times, and use a pattern for the dollar amount matching 1-3 digits at the left and optional parts of a comma and 3 digits in between.
\AStart of string(Total)Capture Total in group 1 and match a space(\w+)Capture 1+ word chars in group 2 and match a space(\w+)Capture 1+ word chars in group 3 and match a space(Capture group 4\$\d{1,3}Match$and 1-3 digits(?:,\d{3})*(?:\.\d+)?Optionally repeat 3 digits and optionally match.and 1+ digits)Close group 4(?!\S)Assert a whitespace boundary to the right to prevent a partial matchSee a regex demo and a Python demo.
Output