I have a table in docx to extract to dataframe with three columns and numerous rows. It looks like below.
| Column A | Column B || Column C |
| -------- | -------- |
| Cell 1 | Cell 2 |
| Cell 3 | Cell 4 |
When I extracted the text and print them, the cells are not matched exactly to the respective column. It looks like the blank cells in Column C are ignored and Cell 3 is matched to Column C and Cell 4 is matched to Column A. I recreated below what it looks like after extraction.
| Column A | Column B || Column C |
| -------- | -------- |
| Cell 1 | Cell 2 | Cell 3 |
| Cell 4 | Cell 5 |
In fact, Column C is not technically blank as it has an checkbox object but it looks like no string values.
I wonder if there's a way to fix this problem, such as by adding a line to selectively extract text from column A and B only. Any effective solutions would be very useful for me.
I found the two coding examples to extract text from the table, but both have the same issue.
/Coding 1/
import pandas as pd
from docx import Document
def read_docx_table(document, table_num=1,nheader=1):
table = document.tables[table_num-1]
data = [[cell.text for cell in row.cells] for row in table.rows]
df = pd.DataFrame(data)
if nheader == 1:
df = df.rename(columns=df.iloc[0]).drop(df.index[0]).reset_index(drop=True)
elif nheader ==2:
outside_col, inside_col = df.iloc[0], df.iloc[1]
hier_index = pd.MultiIndex.from_tuples(list(zip(outside_col,inside_col)))
df = pd.DataFrame(data,columns=hier_index).drop(df.index[[0,1]]).reset_index(drop=True)
elif nheader > 2:
print ("more than two headers not currently supported")
df = pd.DataFrame()
return df
doc = Document("OneDrive - /Downloads/Screen.docx")
table_num=1
nheader=1
df = read_docx_table(doc,table_num,nheader)
df.head(20)
/Coding 2/
import pandas as pd
from docx.api import Document
document = Document("OneDrive /Downloads/Screen.docx")
table = document.tables\[0\]
data = \[\]
keys = None
for i, row in enumerate(table.rows):
text = (cell.text for cell in row.cells)
if i == 0:
keys = tuple(text)
continue
row_data = dict(zip(keys, text))
data.append(row_data)
print(data)