How to go about isolating dollar amounts using Regex?

147 Views Asked by At

I used the PDFPlumber library to extract all the lines in my PDF, a sample line extract looks like this:

Total Return Transportation $16.01

The goal is to put all of these into a data frame. How do I use regex to group this line so that I may isolate the charge type and dollar amount?

Currently, I have:

totals=re.compile(r"(\ATotal) ([\w]+) ([\w]*)")
for line in text.split("\n"):
    line2=totals.search(line)
    if line2:
        print(line)
        print(line2.group(1))
    else:
        pass

Group 1 returns "Total", Group 2 returns "Return" and Group 3 "Transportation" but I'm unable to make a group that retrieves the dollar amount. Any suggestions?

Note: Dollar amounts over $1000 contain a "," that might need to be included in the regex syntax

2

There are 2 best solutions below

0
On

You could use a pattern with 4 capture groups.

Note that you can write [\w] as just \w.

Using \w* matches optional word characters, and could possibly also match an empty string.

You can match the word characters 1+ times, and use a pattern for the dollar amount matching 1-3 digits at the left and optional parts of a comma and 3 digits in between.

\A(Total) (\w+) (\w+) (\$\d{1,3}(?:,\d{3})*(?:\.\d+)?)(?!\S)
  • \A Start of string
  • (Total) Capture Total in group 1 and match a space
  • (\w+) Capture 1+ word chars in group 2 and match a space
  • (\w+) Capture 1+ word chars in group 3 and match a space
  • ( Capture group 4
    • \$\d{1,3} Match $ and 1-3 digits
    • (?:,\d{3})*(?:\.\d+)? Optionally repeat 3 digits and optionally match . and 1+ digits
  • ) Close group 4
  • (?!\S) Assert a whitespace boundary to the right to prevent a partial match

See a regex demo and a Python demo.

import re
 
strings = [
    "Total Return Transportation $16.01",
    "Total Return Transportation $123,899,116.01",
    "Total Return Transportation $1612.01"
]
 
pattern = r"\A(Total) (\w+) (\w+) (\$\d{1,3}(?:,\d{3})*(?:\.\d+)?)(?!\S)"
 
for s in strings:
    match = re.match(pattern, s)
    if match:
        print(match.group(4))

Output

$16.01
$123,899,116.01
1
On

Just change your regex like so:

totals=re.compile(r"(\ATotal) ([\w]+) ([\w]*) ([\$ ]+?(\d+([,\.\d]+)?))")
>>> totals.search("Total Return Transportation $16.01").group(4)
'$16.01'
>>> totals.search("Total Return Transportation $1,006.01").group(4)
'$1,006.01'