Extract tables from multi-column pdf using Python

1k Views Asked by At

I have a pdf in the following format

Lorem ipsum dolor sit amet, consectetur        |Table 2                                        | 
adipiscing elit. Praesent in tortor consequat, |+---------------------------------------------+|
rutrum dolor fringilla, gravida felis.         ||              |               |              ||
Suspendisse quis condimentum diam, ut congue   ||              |               |              ||
quam.                                          |+---------------------------------------------+|
                                               ||              |               |              ||
Table 1                                        ||              |               |              ||
+---------------------------------------------+|+---------------------------------------------+|
|              |               |              ||Lorem ipsum dolor sit amet, consectetur        |
|              |               |              ||adipiscing elit. Praesent in tortor consequat, |
|              |               |              ||rutrum dolor fringilla, gravida felis.         |
|              |               |              ||Suspendisse quis condimentum diam, ut congue   |
+---------------------------------------------+|quam.                                          |
                                               |                                               |
Lorem ipsum dolor sit amet, consectetur        |                                               |
                                               |                                               |

and am trying to extract the two tables named as Table 1 and Table 2. I have the following code right now:

df = tabula.read_pdf("path_to_pdf")

but it recognises the whole page as a table with two columns instead of returning the two tables: Table 1 and Table 2

Output right now: A table with two columns: First column being the left column of this page and second column being the right column of this page

Output needed: Two tables with three columns each: Table 1 and Table 2

1

There are 1 best solutions below

2
On

Have you tried the "multiple_tables" argument?

df = tabula.read_pdf(file_path, multiple_tables=True)

As noted in the Tabula Python Docs:

https://tabula-py.readthedocs.io/en/latest/faq.html#i-want-to-extract-multiple-tables-from-a-document