My task is to fetch chapter-wise content from pdf file separately so that i can store into database. So far, i tried regex and tried to split but that only gives me chapter number but didn't help me in splitting the chapters. Next i tried BeatifulSoup library and converted to HTML format. Now i have each page in the list but how can i further split if i find a heading/chapter inside page. For example : If Heading is 1.1.2 then i want its content separately to saved into list. Same for 1.2.1, 2.3.1, 4.5.....
from tika import parser
from io import StringIO
from bs4 import BeautifulSoup
if ext==".pdf":
file_data=[]
raw_xml = parser.from_file(text_path, xmlContent=True)
xhtml_data = BeautifulSoup(raw_xml['content'], features="lxml")
print(xhtml_data.prettify())
for page, content in enumerate(xhtml_data.find_all('div', attrs={'class': 'page'})):
# Parse PDF data using TIKA (xml/html)
_buffer = StringIO()
_buffer.write(str(content))
parsed_content = parser.from_buffer(_buffer.getvalue())
_buffer.truncate()
# Add pages
text = parsed_content['content'].strip()
file_data.append(text)