High CPU usage while parse pdf document with Apache tika

71 Views Asked by mcacorner At 21 December 2023 at 07:51

We are using apache tika 1.13 for content extraction.

allocated 6GB to application.

With heavy pdf around 2 GB, the application takes 95% to 100% CPU and application became unresponsive.

code to extract content from PDF file

public String extractPDFContent(String filename) throws Exception
    {
        ParseContext context = new ParseContext();
        Parser parser = new AutoDetectParser();
        Metadata metadata = new Metadata();
        File file = new File(filename);
        Tika tika = new Tika();
        tika.setMaxStringLength(-1);
        URL url = file.toURI().toURL();
        PDFParserConfig config=new PDFParserConfig();
        config.setSortByPosition(true);
        context.set(PDFParserConfig.class,config);
        ContentHandler handler = null;
        try(InputStream input = TikaInputStream.get(url)) {
            handler = new BodyContentHandler(-1);
            parser.parse(input, handler, metadata, context);
        } catch(Exception ex) {
            log.error("Exception:"+ex.getMessage(),ex);
            throw ex;
        }
        return handler.toString();
    }

as we need all content so passed -1 for below

tika.setMaxStringLength(-1);
new BodyContentHandler(-1)

refer following queries:

is there any way to load partial document and extract content more efficiently?
Provided option in Streaming_the_plain_text_in_chunks will chunk output only. Is there any other streaming option?
How to efficiently(CPU/memory) parse large PDF document with tika?
Is there any config like use temp file for processing like pdfbox

PDDocument.load(new File("sample.pdf"), MemoryUsageSetting.setupTempFileOnly())

also planning for upgrade tika. Thanks in advance...

Original Q&A

High CPU usage while parse pdf document with Apache tika

There are 0 best solutions below

Related Questions in JAVA

Related Questions in APACHE-TIKA

Trending Questions

Popular # Hahtags

Popular Questions