I need to stream images from a scanner to a PDF document. Ultimately I want to OCR the images and save the text to the PDF as well, but I'm more concerned with getting the streaming working first. Right now I'm trying the different Python wrappers for tesseract, but I'm open to other languages if the Python wrappers are too limited (Java?).
While this topic may seem to have been addresses ad nauseam most of the discussion centers on passing a list of a directory full of files. The challenge here is that I'm dealing with documents with upwards of 10,000 images so buffering them all entirely in memory first or building the PDF entirely in memory is not possible, and I'm trying to avoid having to write 10,000 images to a tmp directory and then come back and touch them again with a 2nd pass. From a simplified high level the following semantics is what I'm looking for.
# pseudocode
f = pdfOutputStream.open("myPDFFileName", "wb")
for image in scannerImages.next():
f.appendImageToPDF(image)
f.finalize()
f.close()
Is that even possible, and if so, what would my options be?