I'm using tabula
in order to concat
all tables
in the following pdf file
To be a one table within excel
format.
Here's my code:
from tabula import read_pdf
import pandas as pd
allin = []
for page in range(1, 115):
table = read_pdf("goal.pdf", pages=page,
pandas_options={'header': None})[0]
allin.append(table)
new = pd.concat(allin)
new.to_excel("out.xlsx", index=False)
Also i tried the following as well:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='all', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Current output: check
But the issue which am facing that from page# 91
i start to see the data not formatted correctly within the excel
file.
I've debug the page individually and i couldn't figure out why it's formatted wrongly especially it's within same format.
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='91', pandas_options={'header': None})[0]
print(table)
Example:
from tabula import read_pdf
import pandas as pd
table = read_pdf("goal.pdf", pages='90-91', pandas_options={'header': None})
new = pd.concat(table, ignore_index=True)
new.to_excel("out.xlsx", index=False)
Here I've ran the code for two pages 90 and 91.
starting from row# 48 you will see the difference here
Where you will notice the issue that name and address placed into one cell. And city and state placed into one call as well
I digged in source code and it has option columns and you can manually define column boundaries. When you set
columns
then you have to useguess=False
.tabula-py
uses program tabula-java and in its documentation I found that it needs values in percents or points (not pixels). So I used program inkscape to measure boundaries in points.Result:
EDIT:
It may need also option
area
(also in points) to skip headers. Or you will have to remove first row on first page.I didn't check all rows but it may need some changes in column boundaries.
EDIT:
Few rows make problem - probably because text in
City
is too long.Result: