Trouble splitting out the city, state and zip from website

79 Views Asked by At

I can't seem to get the city, state, and zip located after the mailing address in the "br" tag to pull out. I have no issue getting any other information extracted.

element = soup.find(lambda tag: tag.name=='th' and 'Mailing Address:' in tag.text)
        if not element:
            print(f"Mailing Address not found for {parcel_number}.")
            return None
        mailing_address = element.find_next_sibling().decode_contents()
        address_lines = mailing_address.split('<br>')
        address = address_lines[0].strip() # first line is the address
        city_state_zip_br = element.find_next_sibling('br') # find the br tag containing the city, state, and zip
        if city_state_zip_br:
            city_state_zip = city_state_zip_br.next_sibling.strip()
            parts = city_state_zip.split(', ')
            if len(parts) < 3:
                city = ""
                state = ""
                zip_code = ""
            else:
                city = parts[0].strip()
                state, zip_code = parts[1].strip().split(' ')
                zip_code = zip_code.strip()
        else:
            city = ""
            state = ""
            zip_code = ""

Here is the HTML code: You will have to scroll to the right to see the city, state and zip on the HTML code below.

<tr><th>Parcel Number:</th><td>1207250000015003</td></tr>
    <tr><th>Type:</th><td>Real</td></tr>
    <tr><th>Property Class:</th><td>2                  </td></tr>
    <tr class="active form-table-title"><th colspan="2">Location</th></tr>
    <tr><th>Address:</th><td>11875 HIGHWAY 43 N AXIS, AL 36505 </td></tr>
    <tr class="active form-table-title">
        <th colspan="2">
            Owner
        </th>
    </tr>
    <tr><th>Name:</th><td>TOWER LOT 1 LLC                                                                                                                                                                                                                                                </td></tr>
    <tr><th>Mailing Address:</th><td>P O BOX 336                                                                                                                                                                                                                                                    <br>                                                                                                                                                                                                                                                               <br>BIRMINGHAM                                       , AL                  35201-0336   

I got it to pull out the state and zip perfect but on the city its pulling in the address with it. Any suggestions?

print out looks like this: city: 10163 KALI OKA RD EIGHT MILE state: AL zip_code: 36613-8790

New code:

mailing_address_lines = [line.strip() for line in mailing_address.split('\n') if line.strip()]
        address = mailing_address_lines[0] # first line is the address
        print(f"address: {address}")
        city_state_zip_pattern = r'^(.+?),\s+([A-Z]{2})\s+(\d{5}(?:-\d{4})?)$'
        city_state_zip_match = re.match(city_state_zip_pattern, mailing_address_lines[-1])
        if city_state_zip_match:
            city = city_state_zip_match.group(1)
            state = city_state_zip_match.group(2)
            zip_code = city_state_zip_match.group(3)
        else:
            city_state_zip = mailing_address_lines[-1]
            address_parts = address.split(city_state_zip)
            if len(address_parts) > 1:
                city = address_parts[1].strip()
            else:
                city = ""
            city_state_zip_parts = city_state_zip.split()
            state = city_state_zip_parts[-2]
            zip_code = city_state_zip_parts[-1]
            address = address_parts[0].strip()
        print(f"address: {address}")
        print(f"city: {city}")
        print(f"state: {state}")
        print(f"zip_code: {zip_code}")
0

There are 0 best solutions below