Extracting Headings/Chapters and related paragraphs separately from PDF file in Python 3.7

1.3k Views Asked by At

My task is to fetch chapter-wise content from pdf file separately so that i can store into database. So far, i tried regex and tried to split but that only gives me chapter number but didn't help me in splitting the chapters. Next i tried BeatifulSoup library and converted to HTML format. Now i have each page in the list but how can i further split if i find a heading/chapter inside page. For example : If Heading is 1.1.2 then i want its content separately to saved into list. Same for 1.2.1, 2.3.1, 4.5.....

from tika import parser
from io import StringIO
from bs4 import BeautifulSoup
if ext==".pdf":
    file_data=[]
    raw_xml = parser.from_file(text_path, xmlContent=True)
    xhtml_data = BeautifulSoup(raw_xml['content'], features="lxml")
    print(xhtml_data.prettify())
    for page, content in enumerate(xhtml_data.find_all('div', attrs={'class': 'page'})):
    # Parse PDF data using TIKA (xml/html)
        _buffer = StringIO()
        _buffer.write(str(content))
        parsed_content = parser.from_buffer(_buffer.getvalue())
        _buffer.truncate()
        # Add pages
        text = parsed_content['content'].strip()
        file_data.append(text)
0

There are 0 best solutions below