Tesseract / Tess4j memory leak

1.5k Views Asked by At

We are trying to use Tesseract with Tess4j for OCR text extraction.

On continuous use of tesseract over a period, we notice the RAM used by the application getting increased gradually, During this time, The heap memory is still free. We monitored the off-heap memory using the jconsole. Off-heap memory also seems normal. But the RAM RSS memory is keeps increasing for the application

The problem I'm guessing is memory leak by tesseract during memory allocation of OCR, I'm not sure. Any ideas to approach further, please share

enter image description here

enter image description here

enter image description here

2

There are 2 best solutions below

2
On

I had same issue since last few days. I resolved by removing tess4j and using Tika 1.27 + tesseract. I used Executor service to run 3 threads at a time this kept memory within limits.

    byte fileBytes[] ; // image bytes
    Future<String> future = executorService.submit(() -> {
    TesseractOCRConfig config = new TesseractOCRConfig();
    config.setLanguage("kor+eng");
    config.setEnableImageProcessing(1);
    config.setPreserveInterwordSpacing(true);
    ParseContext context = new ParseContext();
    context.set(TesseractOCRConfig.class, config);

    Parser parser = new AutoDetectParser();
    BodyContentHandler handler = new BodyContentHandler();
    Metadata metadata = new Metadata();
    parser.parse(new ByteArrayInputStream(fileBytes), handler, metadata, context);
    return handler.toString();
});

fileBody = future.get(120, TimeUnit.SECONDS);

While the code given above works, later i made it simpler by just spawning a process to call tesseract directly.

protected String doOcr(byte[] fileBytes, int timeout, String language) {
        String text = null;
        File inputFile = null;
        File outputFile = null;
        try {
            inputFile = File.createTempFile("tesseract-input", ".png");
            String outputPath = inputFile.getAbsolutePath() + "-output";
            outputFile = new File(outputPath + ".txt");
            try (FileOutputStream fos = new FileOutputStream(inputFile)) {
                fos.write(fileBytes);
            }

            String commandCreate[] = { "tesseract", inputFile.getAbsolutePath(), outputPath, "-l", language, "--psm", "1" ,"-c", "preserve_interword_spaces=1"};

            runCommand(commandCreate, timeout);
            if (outputFile.exists()) {
                try (FileInputStream fis = new FileInputStream(outputFile)) {
                    text = IOUtils.toString(fis, Constants.UTF_8);
                }
            }
        } catch (InterruptedException e) {
            logger.warn("timeout trying to read image file body");          
        } catch (Exception e) {
            logger.error(String.format("Cannot read image file body, error : %s", e.getMessage()), e);          
        } finally {
            if (null != inputFile && inputFile.exists()) {
                inputFile.delete();
            }
            if (null != outputFile && outputFile.exists()) {
                outputFile.delete();
            }
        }       
        return text;
    }

protected void runCommand(String command[], int timeout) throws IOException, InterruptedException {
        logger.info("command : " + StringUtils.join(command, " "));
        ProcessBuilder builder = new ProcessBuilder(command);
        builder.inheritIO();
        builder.environment().put("OMP_THREAD_LIMIT", "1"); /* default tesseract uses 4 threads per image. set to 1 */
        Process p = builder.start();
        boolean finished = p.waitFor(timeout, TimeUnit.SECONDS);
        if (!finished) {
            logger.warn("task not finished");
        }
        p.destroyForcibly();
    }
0
On

For those who are stuck and don't want to change their code, or maven library, i've solved setting my tesseract reader class to null after reading and forcing Garbage Collector, with System.gc(); Example:

TessReader reader = new TessReader(); //Custom Class executing doOCR()
String content = reader.getContent();
reader = null;
System.gc();