Extract multicolumn(?) PDFs in python

664 Views Asked by user760900 At 13 September 2023 at 20:16

I'm trying to write a program to convert multi-page PDFs to plain text in bulk (think many page textbooks). If I run it through PyPDF2, I find issues where if a particular page has 2 columns, it reads incorrectly.

The best solution I have found is the use OCRmyPDF to convert scanned PDFs to text PDFs, and use tabulizer::extract_text() in R (via a this solution, which is a R wrapper of tabula PDF, which has a python wrapper, but this function is based on Apache PDFBox, which does not have a current python wrapper). The only python solution I can find is to run tesseract with both 1 and 2 column options and select the one with more semantic information, but this is incredibly slow.

Original Q&A

There are 1 best solutions below

M. Page On 13 September 2023 at 20:51

In general, extracting text from multi-column PDF is a challenging task. If you want to extract plain text, you might want to use the pdftohml command under linux: https://manpages.ubuntu.com/manpages/trusty/man1/pdftohtml.1.html

It makes a decent job at extracting multi-column text.

There exist ports of this utility for Windows.

If you are satisfied with the result, you can automate the execution of the shell command from python.

Extract multicolumn(?) PDFs in python

There are 1 best solutions below

Related Questions in PYTHON

Related Questions in PDF

Related Questions in OCR

Related Questions in PDFBOX

Related Questions in TEXT-MINING

Trending Questions

Popular # Hahtags

Popular Questions