Can I selectively extract text from the table using Python-docx?

20 Views Asked by MaxK At 05 February 2024 at 10:21

I have a table in docx to extract to dataframe with three columns and numerous rows. It looks like below.

| Column A | Column B || Column C |
| -------- | -------- |
| Cell 1   | Cell 2   |
| Cell 3   | Cell 4   |

When I extracted the text and print them, the cells are not matched exactly to the respective column. It looks like the blank cells in Column C are ignored and Cell 3 is matched to Column C and Cell 4 is matched to Column A. I recreated below what it looks like after extraction.

| Column A | Column B || Column C |
| -------- | -------- |
| Cell 1   | Cell 2   | Cell 3    |
| Cell 4   | Cell 5   |

In fact, Column C is not technically blank as it has an checkbox object but it looks like no string values.

I wonder if there's a way to fix this problem, such as by adding a line to selectively extract text from column A and B only. Any effective solutions would be very useful for me.

I found the two coding examples to extract text from the table, but both have the same issue.

/Coding 1/

import pandas as pd 
from docx import Document
def read_docx_table(document, table_num=1,nheader=1):
    table = document.tables[table_num-1]
    data = [[cell.text for cell in row.cells] for row in table.rows]
    df = pd.DataFrame(data)
    if nheader == 1:
        df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
    elif nheader ==2:
        outside_col, inside_col = df.iloc[0], df.iloc[1]
        hier_index = pd.MultiIndex.from_tuples(list(zip(outside_col,inside_col)))
        df = pd.DataFrame(data,columns=hier_index).drop(df.index[[0,1]]).reset_index(drop=True)
    elif nheader > 2: 
        print ("more than two headers not currently supported")
        df = pd.DataFrame()
    return df 

doc = Document("OneDrive - /Downloads/Screen.docx")
table_num=1
nheader=1
df = read_docx_table(doc,table_num,nheader)
df.head(20)

/Coding 2/

import pandas as pd
from docx.api import Document

document = Document("OneDrive /Downloads/Screen.docx")
table = document.tables\[0\]

data = \[\]

keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)

    if i == 0:
        keys = tuple(text)
        continue
    row_data = dict(zip(keys, text))
    data.append(row_data)

print(data)

Original Q&A

Can I selectively extract text from the table using Python-docx?

There are 0 best solutions below

Related Questions in PYTHON-DOCX

Related Questions in TEXT-EXTRACTION

Trending Questions

Popular # Hahtags

Popular Questions