High CPU usage while parse pdf document with Apache tika

71 Views Asked by At

We are using apache tika 1.13 for content extraction.

allocated 6GB to application.

With heavy pdf around 2 GB, the application takes 95% to 100% CPU and application became unresponsive.

code to extract content from PDF file

public String extractPDFContent(String filename) throws Exception
    {
        ParseContext context = new ParseContext();
        Parser parser = new AutoDetectParser();
        Metadata metadata = new Metadata();
        File file = new File(filename);
        Tika tika = new Tika();
        tika.setMaxStringLength(-1);
        URL url = file.toURI().toURL();
        PDFParserConfig config=new PDFParserConfig();
        config.setSortByPosition(true);
        context.set(PDFParserConfig.class,config);
        ContentHandler handler = null;
        try(InputStream input = TikaInputStream.get(url)) {
            handler = new BodyContentHandler(-1);
            parser.parse(input, handler, metadata, context);
        } catch(Exception ex) {
            log.error("Exception:"+ex.getMessage(),ex);
            throw ex;
        }
        return handler.toString();
    }

as we need all content so passed -1 for below

tika.setMaxStringLength(-1);
new BodyContentHandler(-1)

refer following queries:

  1. is there any way to load partial document and extract content more efficiently?

  2. Provided option in Streaming_the_plain_text_in_chunks will chunk output only. Is there any other streaming option?

  3. How to efficiently(CPU/memory) parse large PDF document with tika?

  4. Is there any config like use temp file for processing like pdfbox

    PDDocument.load(new File("sample.pdf"), MemoryUsageSetting.setupTempFileOnly())

also planning for upgrade tika. Thanks in advance...

0

There are 0 best solutions below