Python Regex Capturing Multiple Matches in separate observations

40 Views Asked by At

I am trying to create variables location; contract items; contract code; federal aid using regex on the following text:

    PAGE    1

                BID OPENING DATE    07/25/18    FROM 0.2 MILES WEST OF ICE HOUSE        07/26/18 CONTRACT NUMBER    03-2F1304   ROAD TO 0.015 MILES WEST OF CONTRACT CODE 'A '

            LOCATION    03-ED-50-39.5/48.7  DIVISION HIGHWAY ROAD   44 CONTRACT ITEMS

        INSTALL SANDTRAPS AND PULLOUTS  FEDERAL AID ACNH-P050-(146)E

PAGE    1

                    BID OPENING DATE    07/25/18    IN EL DORADO COUNTY AT VARIOUS          07/26/18 CONTRACT NUMBER     03-2H6804  LOCATIONS ALONG ROUTES 49 AND 193   CONTRACT CODE 'C ' LOCATION 03-ED-0999-VAR          13 CONTRACT ITEMS



        TREE REMOVAL    FEDERAL AID NONE

PAGE    1

                BID OPENING DATE    07/25/18    IN LOS ANGELES, INGLEWOOD AND       07/26/18 CONTRACT NUMBER    07-296304   CULVER CITY, FROM I-105 TO PORT CONTRACT CODE 'B '

            LOCATION    07-LA-405-R21.5/26.3    ROAD UNDERCROSSING  55 CONTRACT ITEMS



        ROADWAY SAFETY IMPROVEMENT  FEDERAL AID ACIM-405-3(056)E

This text is from one word file; I'll be looping my code on multiple doc files. In the text above are three location; contract items; contract code; federal aid pairs. But when I use regex to create variables, only the first instance of each pair is included.

The code I have right now is:

# imports
import os
import pandas as pd
import re
import docx2txt
import textract
import antiword

all_bod = []
all_cn = []
all_location = []
all_fedaid = []
all_contractcode = []
all_contractitems = []
all_file = []

text = '    PAGE    1

            BID OPENING DATE    07/25/18    FROM 0.2 MILES WEST OF ICE HOUSE        07/26/18 CONTRACT NUMBER    03-2F1304   ROAD TO 0.015 MILES WEST OF CONTRACT CODE 'A '

        LOCATION    03-ED-50-39.5/48.7  DIVISION HIGHWAY ROAD   44 CONTRACT ITEMS

    INSTALL SANDTRAPS AND PULLOUTS  FEDERAL AID ACNH-P050-(146)E

PAGE    1

                BID OPENING DATE    07/25/18    IN EL DORADO COUNTY AT VARIOUS          07/26/18 CONTRACT NUMBER     03-2H6804  LOCATIONS ALONG ROUTES 49 AND 193   CONTRACT CODE 'C ' LOCATION 03-ED-0999-VAR          13 CONTRACT ITEMS



    TREE REMOVAL    FEDERAL AID NONE

    PAGE    1

            BID OPENING DATE    07/25/18    IN LOS ANGELES, INGLEWOOD AND       07/26/18 CONTRACT NUMBER    07-296304   CULVER CITY, FROM I-105 TO PORT CONTRACT CODE 'B '

        LOCATION    07-LA-405-R21.5/26.3    ROAD UNDERCROSSING  55 CONTRACT ITEMS



    ROADWAY SAFETY IMPROVEMENT  FEDERAL AID ACIM-405-3(056)E'

bod1 = re.search('BID OPENING DATE \s+ (\d+\/\d+\/\d+)', text)
bod2 = re.search('BID OPENING DATE\n\n(\d+\/\d+\/\d+)', text)
    
if not(bod1 is None):
    bod = bod1.group(1)
elif not(bod2 is None):
    bod = bod2.group(1)
else:
    bod = 'NA'
    
all_bod.append(bod)
    
# creating contract number
cn1 = re.search('CONTRACT NUMBER\n+(.*)', text)
cn2 = re.search('CONTRACT NUMBER\s+(.........)', text)
    
if not(cn1 is None):
   cn = cn1.group(1)
elif not(cn2 is None):
   cn = cn2.group(1)
else:
   cn = 'NA'
    
all_cn.append(cn)
    
# location
    
location1 = re.search('LOCATION \s+\S+', text)
location2 = re.search('LOCATION \n+\S+', text)
    
if not(location1 is None):
    location = location1.group(0)
elif not(location2 is None):
    location = location2.group(0)
else:
    location = 'NA'
    
all_location.append(location)
    
# federal aid
    
fedaid = re.search('FEDERAL AID\s+\S+', text)
fedaid = fedaid.group(0)
    
all_fedaid.append(fedaid)
    
# contract code
    
contractcode = re.search('CONTRACT CODE\s+\S+', text)
contractcode = contractcode.group(0)
    
all_contractcode.append(contractcode)
    
# contract items
    
contractitems = re.search('\d+ CONTRACT ITEMS', text)
contractitems = contractitems.group(0)
    
all_contractitems.append(contractitems)

This code parses the only first instance of these variables in the text.

contract-number location contract-items contract-code federal-aid
03-2F1304 03-ED-50-39.5/48.7 44 A ACNH-P050-(146)E

But, I am trying to figure out a way to get all possible instances in different observations.

contract-number location contract-items contract-code federal-aid
03-2F1304 03-ED-50-39.5/48.7 44 A ACNH-P050-(146)E
03-2H6804 03-ED-0999-VAR 13 C NONE
07-296304 07-LA-405-R21.5/26.3 55 B ACIM-405-3(056)E

The all_variables in the code are for looping over multiple word files - we can ignore that if we want :).

Any leads would be super helpful. Thanks so much!

1

There are 1 best solutions below

0
pinky On
import re
data = []
df = pd.DataFrame()

regex_contract_number =r"(?:CONTRACT NUMBER\s+(?P<contract_number>\S+?)\s)"
regex_location = r"(?:LOCATION\s+(?P<location>\S+))"
regex_contract_items = r"(?:(?P<contract_items>\d+)\sCONTRACT ITEMS)"
regex_federal_aid =r"(?:FEDERAL AID\s+(?P<federal_aid>\S+?)\s)"
regex_contract_code =r"(?:CONTRACT CODE\s+\'(?P<contract_code>\S+?)\s)"
regexes = [regex_contract_number,regex_location,regex_contract_items,regex_federal_aid,regex_contract_code]

for regex in regexes:
    for match in re.finditer(regex, text):
        data.append(match.groupdict())
    df = pd.concat([df, pd.DataFrame(data)], axis=1)
    data = []

df

enter image description here