Tesseract produced searchable PDF with 8bit depth back to 1bit (tess4j)

996 Views Asked by At

I have a PDFs with 1-bit color depth as an input for OCR processing (tess4j, 5.0.0) with approx. 30kb each. After processing, each PDF has 120-130kb and is saved with 8-bit color depth, which is probably main cause of file size increase.

I would like to know if there is a way to set color depth within Tesseract or associated libs or there is another way to handle this.

ITesseract instance = new Tesseract();
instance.setDatapath("/path/to/tessdata");
instance.setPageSegMode(ITessAPI.TessPageSegMode.PSM_SINGLE_COLUMN);
List<ITesseract.RenderedFormat> formats = new ArrayList<(Arrays.asList(ITesseract.RenderedFormat.PDF));
instance.createDocumentsWithResults(inputPdf.getPath(), "/path/to/result", formats, ITessAPI.TessPageIteratorLevel.RIL_WORD);

Any help greatly appreciated.

1

There are 1 best solutions below

1
On BEST ANSWER

Eventually, I came up with a workaround - you can specify the output by defining RendererFormat. I changed that from PDF to PDF_TEXTONLY, which produced a pdf (~7kb) with the text in the right position but without the original scan/image.

List<ITesseract.RenderedFormat> formats = new ArrayList<>(Arrays.asList(ITesseract.RenderedFormat.PDF_TEXTONLY));

Then I used PDFBox functionality to extract image/images from original pdf. It is possible to specify DPI which also helps to reduce the file size.

PDDocument document = PDDocument.load(inputPdf);
PDFRenderer pdfRenderer = new PDFRenderer(document);
for (int page = 0; page < document.getNumberOfPages(); ++page) {
     BufferedImage bim = pdfRenderer.renderImageWithDPI(page, 300, ImageType.BINARY);
     ImageIOUtil.writeImage(bim, "/path/to/pics/picture_" + page + ".png", 300);
}
document.close();

Then just add an image to the text-only pdf as a watermark (How to insert a image under the text as a pdf background using iText?). This helped reduce the size from 120-130 kb to 60 kb with 300 DPI (even less with lower DPI), which is great given that it is an OCR processed pdf with an original size of 30kb. I know this is not the best solution and I'll be happy for any other contribution or answer.