Python: from pdf to pandas table

169 Views Asked by At

I have a table on a pdf that looks like this:

enter image description here

I want to transform it into a pandas dataframe. This is what I tried :

import tabula

df = tabula.read_pdf(filename, pages=pages)[0]

If I am going to print the result this is what I am going to have:

Pos    Name    Surname    DateOfBirth   Address
1      James    Brown     1923-01-02    1313 E Main St, 
Nan    Nan      Nan       Nan           Portage MI 49024-
Nan    Nan      Nan       Nan           2001
2      Abram    Red       1934-07-15    1313 E Main St,
Nan    Nan      Nan       Nan           Portage MI 49024-
Nan    Nan      Nan       Nan           2001

And so on..

How can I obtain the desired output? In other words, how can I say that "1313 E Main St, Portage MI 49024-2001" belongs to only one row ?

1

There are 1 best solutions below

0
Timeless On

I can't reproduce the issue with three different libraries :

:

# https://github.com/chezou/tabula-py
from tabula.io import read_pdf

df = read_pdf("file.pdf", pages=[1])[0]

:

# https://github.com/jsvine/pdfplumber
import pdfplumber

pdf = pdfplumber.open("file.pdf")

header, *data = pdf.pages[0].extract_tables()[0]

df = pd.DataFrame(data, columns=header)

:

#https://github.com/pymupdf/PyMuPDF
import fitz

pdf = fitz.open("file.pdf")

df = pdf[0].find_tables().tables[0].to_pandas()

Output :

print(df)

  POS   NAME SURNAME DATE OF BIRTH                                    ADDRESS
0   1  James   Brown    1923-01-02  1313 E Main St,\nPortage MI 49024-\n2001.
1   2  Abram     Red    1934-07-15  1313 E Main St,\nPortage MI 49024-\n2001.
2   3    Leo    Pink    1965-06-12  1313 E Main St,\nPortage MI 49024-\n2001.
3   4   Dane    Blue    1977-04-14  1313 E Main St,\nPortage MI 49024-\n2001.

[4 rows x 5 columns]

NB : They all give the same output except for tabula (where \n is replaced by \r).

PDF used (file.pdf) :

enter image description here

Generated with :

import pandas as pd

df = pd.DataFrame(
    {'POS': [1, 2, 3, 4],
     'NAME': ['James', 'Abram', 'Leo', 'Dane'],
    'SURNAME': ['Brown', 'Red', 'Pink', 'Blue'],
    'DATE OF BIRTH': ['1923-01-02', '1934-07-15', '1965-06-12', '1977-04-14'],
    'ADDRESS': ['1313 E Main St,\nPortage MI 49024-\n2001.',
    '1313 E Main St,\nPortage MI 49024-\n2001.',
    '1313 E Main St,\nPortage MI 49024-\n2001.',
    '1313 E Main St,\nPortage MI 49024-\n2001.']
    }
)

from xlsxwriter import utility
with pd.ExcelWriter("file.xlsx") as writer :
    df.to_excel(writer, index=False)
    idx_col = df.columns.get_loc("ADDRESS")
    wb = writer.book; ws = writer.sheets["Sheet1"]
    wrap = wb.add_format({"text_wrap": True})
    ws.set_column(0, 4, 15, wrap)
    borders = wb.add_format(
        {"bottom": 1, "top": 1, "left": 1, "right": 1}
    );ws.conditional_format(utility.xl_range(
        0, 0, len(df), len(df.columns)-1),
    {"type": "no_errors", "format": borders})
    ws.autofit()
    
import xlwings as xw

with xw.Book("file.xlsx") as wb:
    wb.sheets.active.to_pdf("file.pdf")