I have a pdf in the following format
Lorem ipsum dolor sit amet, consectetur |Table 2 |
adipiscing elit. Praesent in tortor consequat, |+---------------------------------------------+|
rutrum dolor fringilla, gravida felis. || | | ||
Suspendisse quis condimentum diam, ut congue || | | ||
quam. |+---------------------------------------------+|
|| | | ||
Table 1 || | | ||
+---------------------------------------------+|+---------------------------------------------+|
| | | ||Lorem ipsum dolor sit amet, consectetur |
| | | ||adipiscing elit. Praesent in tortor consequat, |
| | | ||rutrum dolor fringilla, gravida felis. |
| | | ||Suspendisse quis condimentum diam, ut congue |
+---------------------------------------------+|quam. |
| |
Lorem ipsum dolor sit amet, consectetur | |
| |
and am trying to extract the two tables named as Table 1 and Table 2. I have the following code right now:
df = tabula.read_pdf("path_to_pdf")
but it recognises the whole page as a table with two columns instead of returning the two tables: Table 1 and Table 2
Output right now: A table with two columns: First column being the left column of this page and second column being the right column of this page
Output needed: Two tables with three columns each: Table 1 and Table 2
Have you tried the "multiple_tables" argument?
As noted in the Tabula Python Docs: