Tika unable to parse after detecting mime-type

1.2k Views Asked by At

I have earlier succeeded in parsing all kinds of files with Tika by calling tika.parseToString() without setting any custom configuration or metadata. Now I have the need to filter files to parse based on mime-type.

I can find the mime-type with tika.detect(new BufferedInputStream(inputStream), new Metadata());, but when calling tika.parseToString() afterwards tika uses EmptyParser and the content-type detected is "application/octet-stream". This is default, meaning that tika is unable to find what type of file it is. I have tried to set the content type in Metadata before trying to parse the file, but this leads to org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException. From what I've read this means that the file is malformed, but the same files gets parsed successfully without the check for mime-type beforehand.

Does detect() do something with the InputStream, making the parser unable to parse the files?

I'm using the same tika-instance for both checking the mime-type and parsing, version 1.13

1

There are 1 best solutions below

0
On BEST ANSWER

My issue was caused by passing InputStream to the parse method directly. detect() marks and resets the stream passed, which InputStream does not support. Wrapping the InputStream into a TikaInputStream(TikaInputStream stream = TikaInputStream.get(new BufferedInputStream(inputStream));) solved the issue.