I need to extract the same rectangular area (in the same position) on different pages in a PDF file with several hundred pages.
I am running Linux, and have found a way to do this manually using Tesseract and the front-end gImageReader, and am looking for a way to automate this process.
The information i need to extract is Hindi text (written in Devanagari), so extracting the data as text (without Hindi OCR) would probably yield bad results, but if there is a way to extract it as an image that would also be ok, i could then OCR the collected data in Tesseract in a separate step.
So what i am looking for, is a way to copy the same area from different pages of a PDF, and output them to another file (another PDF or image file for example).
I have seen other similar questions posted, but they are asking specifically to extract text, which is not necessarily needed in this case.
If there is a way to do this by converting the PDF to image files, that would also be interesting.
PS: I am now looking at doing this in the terminal (using Gimp), along the lines of what Dmitri Z is proposing.
For those interested in a GUI, i have found Phatch for Linux, which is great for batch processing images, as well as (batch) cropping PDF files directly.
If someone knows of a way to extract 2 different rectangular areas from 1 image, that would be helpful.
The solution consists of 2 steps: 1) Convert PDF to image The most common tool for that is imagemagick. You can use it as command line tool
as well as with using API python example. You can use c++ API but unfortunately i don't have much experience in Magic++ c++ API.
You might need to install GhostScript for reading PDF.
2) Extracting region of interest (ROI) from image You can use imagemagick here as well
would be an option to use, example:
Other option would be to use OpenCV. In C++ it would be pretty easy: