I am trying to read the contents of a word document which has layout of 3 columns. It has data including text, images and tables. However when i read the file using following code i am getting a table present in second and table present in third column out of order.
def extract_text_and_images_from_word(filepath, output_file):
with open(output_file, 'w', encoding='utf-8') as f:
document = Document(filepath)
for element in document.element.body:
if isinstance(element, CT_Tbl): # Check if the element is a table
table = Table(element, document)
for row in table.rows:
for cell in row.cells:
for paragraph in cell.paragraphs:
cell_text = paragraph.text.strip()
f.write(cell_text + '\n')
elif isinstance(element, CT_P): # Check if the element is a paragraph
paragraph_index = document.element.body.index(element)
if paragraph_index < len(document.paragraphs):
paragraph = document.paragraphs[paragraph_index]
paragraph_text = paragraph.text.strip()
f.write(paragraph_text + '\n')
else:
f.write("Index out of bounds for document.paragraphs\n")
I even checked the document.xml for the word file after converting it to .zip format but it seems even in the xml both the tables seem out of order. My understanding is word saves the xml by converting the document back to single column layout which is causing problem. Any help around this would be greatly appreciated.