I need to extract the same rectangular area (in the same position) on different pages in a PDF file with several hundred pages.
I am running Linux, and have found a way to do this manually using Tesseract and the front-end gImageReader, and am looking for a way to automate this process.
The information i need to extract is Hindi text (written in Devanagari), so extracting the data as text (without Hindi OCR) would probably yield bad results, but if there is a way to extract it as an image that would also be ok, i could then OCR the collected data in Tesseract in a separate step.
So what i am looking for, is a way to copy the same area from different pages of a PDF, and output them to another file (another PDF or image file for example).
I have seen other similar questions posted, but they are asking specifically to extract text, which is not necessarily needed in this case.
If there is a way to do this by converting the PDF to image files, that would also be interesting.
PS: I am now looking at doing this in the terminal (using Gimp), along the lines of what Dmitri Z is proposing.
For those interested in a GUI, i have found Phatch for Linux, which is great for batch processing images, as well as (batch) cropping PDF files directly.
If someone knows of a way to extract 2 different rectangular areas from 1 image, that would be helpful.
You can crop two (or more) regions in the same Imagemagick command as follows:
or