How to get text for multi-page TIFF using Tesseract capi?

2.6k Views Asked by At

I am using the tesseract capi from Python using ctypes. Everything seems to work well except multi-page TIFFs. I only get text from the last page instead of all the text in a multi-page TIFF.

This is what I'm doing:

path = "multipage.tiff"
self.tesseract.TessBaseAPIProcessPages.argtypes = [POINTER(TessBaseAPI), c_char_p, c_char_p, c_int, POINTER(TessResultRenderer)]
self.tesseract.TessBaseAPIProcessPages.restype = c_bool
success = self.tesseract.TessBaseAPIProcessPages(self.api, create_string_buffer(path), None , 0, None)
ocr_r = self.tesseract.TessBaseAPIGetUTF8Text(self.api)
result = string_at(ocr_r) #contains text only from last page

Has anyone come across this before or have knowledge of how to resolve this?

I had opened this as an issue in tesseract but apparently this isn't an issue in tesseract command line or API since the command line works fine and gives text for all pages.

Perhaps something else should be called instead of self.tesseract.TessBaseAPIGetUTF8Text(api) to get all the text?

1

There are 1 best solutions below

0
On

This worked for me:

from PIL import Image
import pytesseract
from pytesseract import image_to_string


image = Image.open(path)
image.load()
parsing = ""
for frame in range(0, image.n_frames):
    image.seek(frame)
    parsing += image_to_string(image)
    parsing += '\n'

The number of pages is stored in n_frames, so you just have to iterate over that number. Hope it helps.