How to read a pdf containing multiple pages as images in Leptonica

45 Views Asked by At

Tesseract uses leptonica load images on which to do OCR:

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>
int main() {
    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    // Initialize tesseract-ocr with English, without specifying tessdata path
    if (api->Init(NULL, "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }
    // Open input image with leptonica library
    Pix *image = pixRead("./test1dld.png");
    api->SetImage(image);
    ...

However, for reading in a batch of tests, the easy way would be to use a document feeder on a copier and have the machine email the resulting single pdf file where each page is a bitmap. The leptonica documentation mentions converting to pdf, but I can't find how to read from pdf at all, much less a page at a time.

Can anyone point me to an API call that lets me view a bitmap pdf file one by one as individual bitmaps? Preferably a c API not a shell command.

1

There are 1 best solutions below

0
On BEST ANSWER

Leptonica is an image reader - not document (pdf) reader (yes it can create pdf, but reading pdf is a different story).

You will need another library to extract images from pdf. For python I would suggest to try pymudpf, for C++ you can check poppler, qpdf. For C I am not sure if there is (free) solution.