Extract the same (rectangular) area from multiple pages of a PDF

1.4k Views Asked by At

I need to extract the same rectangular area (in the same position) on different pages in a PDF file with several hundred pages.

I am running Linux, and have found a way to do this manually using Tesseract and the front-end gImageReader, and am looking for a way to automate this process.

The information i need to extract is Hindi text (written in Devanagari), so extracting the data as text (without Hindi OCR) would probably yield bad results, but if there is a way to extract it as an image that would also be ok, i could then OCR the collected data in Tesseract in a separate step.

So what i am looking for, is a way to copy the same area from different pages of a PDF, and output them to another file (another PDF or image file for example).

I have seen other similar questions posted, but they are asking specifically to extract text, which is not necessarily needed in this case.

If there is a way to do this by converting the PDF to image files, that would also be interesting.

PS: I am now looking at doing this in the terminal (using Gimp), along the lines of what Dmitri Z is proposing.

For those interested in a GUI, i have found Phatch for Linux, which is great for batch processing images, as well as (batch) cropping PDF files directly.

If someone knows of a way to extract 2 different rectangular areas from 1 image, that would be helpful.

2

There are 2 best solutions below

0
On

You can crop two (or more) regions in the same Imagemagick command as follows:

convert image +write mpr:img +delete \
\( mpr:img -crop W1xH1+X1+Y1 +repage +write out1 \) \
\( mpr:img -crop W2xH2+X2+Y2 +repage +write out2 \) \
null:

or

convert image \
\( -clone 0 -crop W1xH1+X1+Y1 +repage +write out1 \) \
\( -clone 0 -crop W2xH2+X2+Y2 +repage +write out2 \) \
null:
4
On

The solution consists of 2 steps: 1) Convert PDF to image The most common tool for that is imagemagick. You can use it as command line tool

$ convert foo.pdf foo.png

as well as with using API python example. You can use c++ API but unfortunately i don't have much experience in Magic++ c++ API.

You might need to install GhostScript for reading PDF.

2) Extracting region of interest (ROI) from image You can use imagemagick here as well

-extract widthxheight{{+-}offset}

would be an option to use, example:

convert -extract 640x480+1280+960 bigImage.rgb extractedImage.rgb

Other option would be to use OpenCV. In C++ it would be pretty easy:

Mat image = imread("yourimage.png");
int x = 10, y = 20, w = 100, h = 100;
imwrite("roiImage", image(Rect(x, y, w, h)));