I'm trying to write a program to convert multi-page PDFs to plain text in bulk (think many page textbooks). If I run it through PyPDF2, I find issues where if a particular page has 2 columns, it reads incorrectly.
The best solution I have found is the use OCRmyPDF to convert scanned PDFs to text PDFs, and use tabulizer::extract_text() in R (via a this solution, which is a R wrapper of tabula PDF, which has a python wrapper, but this function is based on Apache PDFBox, which does not have a current python wrapper). The only python solution I can find is to run tesseract with both 1 and 2 column options and select the one with more semantic information, but this is incredibly slow.
In general, extracting text from multi-column PDF is a challenging task. If you want to extract plain text, you might want to use the
pdftohmlcommand under linux: https://manpages.ubuntu.com/manpages/trusty/man1/pdftohtml.1.htmlIt makes a decent job at extracting multi-column text.
There exist ports of this utility for Windows.
If you are satisfied with the result, you can automate the execution of the shell command from python.