Extract concatenated p tags and table tag separately into a list

39 Views Asked by At

I have a number of p tags with table tags I am retrieving in order into a list content_items. I am trying to join all p tags and then, once a table is found, append what I have collected already and then parse the table as a separate item in the list. I am able to collect the tables yet for some reason I am unable to collect and join all p tags until I hit a table tag. Code so far:

from bs4 import BeautifulSoup, NavigableString
import html2text

converter = html2text.HTML2Text()
soup = BeautifulSoup(data3, 'html.parser')
content_items = []  # List to store the content items

for tag in soup.descendants:
    content_dict = {'Title': "35.23.060 - DR Zone Standards", 'Content': ''}
    
    if tag.name == "p":
        content_dict['Content'] += converter.handle(str(tag))
 
    elif tag.name == "table":
        if content_dict['Content']:
            content_items.append(content_dict)
        content_dict['Content'] = converter.handle(str(tag))
        content_items.append(content_dict)     

# Print the extracted data
print(json.dumps(content_items, indent=4))
1

There are 1 best solutions below

0
Mahesh Prajapati On

The problem lies in the placement of the content_dict initialization inside the loop. With your current code, you are overwriting the dictionary in each iteration, resulting in the loss of previously collected paragraph content. You should move the dictionary initialization inside the loop, so a new dictionary is created for each iteration.

from bs4 import BeautifulSoup
import html2text
import json

converter = html2text.HTML2Text()
soup = BeautifulSoup(data3, 'html.parser')
content_items = []  # List to store the content items

for tag in soup.descendants:
    # Move the dictionary initialization inside the loop
    content_dict = {'Title': "35.23.060 - DR Zone Standards", 'Content': ''}
    
    if tag.name == "p":
        content_dict['Content'] += converter.handle(str(tag))
 
    elif tag.name == "table":
        if content_dict['Content']:
            content_items.append(content_dict)
        content_dict = {'Title': "35.23.060 - DR Zone Standards", 'Content': converter.handle(str(tag))}
        content_items.append(content_dict)     

# Print the extracted data
print(json.dumps(content_items, indent=4))