We are using apache tika 1.13 for content extraction.
allocated 6GB to application.
With heavy pdf around 2 GB, the application takes 95% to 100% CPU and application became unresponsive.
code to extract content from PDF file
public String extractPDFContent(String filename) throws Exception
{
ParseContext context = new ParseContext();
Parser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
File file = new File(filename);
Tika tika = new Tika();
tika.setMaxStringLength(-1);
URL url = file.toURI().toURL();
PDFParserConfig config=new PDFParserConfig();
config.setSortByPosition(true);
context.set(PDFParserConfig.class,config);
ContentHandler handler = null;
try(InputStream input = TikaInputStream.get(url)) {
handler = new BodyContentHandler(-1);
parser.parse(input, handler, metadata, context);
} catch(Exception ex) {
log.error("Exception:"+ex.getMessage(),ex);
throw ex;
}
return handler.toString();
}
as we need all content so passed -1 for below
tika.setMaxStringLength(-1);
new BodyContentHandler(-1)
refer following queries:
is there any way to load partial document and extract content more efficiently?
Provided option in Streaming_the_plain_text_in_chunks will chunk output only. Is there any other streaming option?
How to efficiently(CPU/memory) parse large PDF document with tika?
Is there any config like use temp file for processing like pdfbox
PDDocument.load(new File("sample.pdf"), MemoryUsageSetting.setupTempFileOnly())
also planning for upgrade tika. Thanks in advance...