Skip all not support textual extraction parsers in tika-server

27 Views Asked by At

I want to skip all files that impossible to extract any text

For what? I need to index file content.

I could just call /parse rest, and check if there any content in result. But parse rest will call parsers for any file (even video or audio), that spends time.

I want to skip all parsers if mimeType not support textual extraction

Variants:

  1. Analyze all mimeTypes and analyze all parsers in tika. Then call /meta before /parse and mimeType can have content in result.

But it is hard to analyze all such a mimeTypes, and after Tika updates there can be changes in code and some mimeTypes can become supported or unsupported to extract content. So any update need to analyze full Tika code.

  1. Call /meta and check if Content-Encoding exist. But can i use this header for such a case? Maybe some another headers i can use?

  2. Maybe in tika documentation there is some page that contain all mimeTypes or all parsers that can extract text?

  3. Any other methods supported in tika-server for such a case?

0

There are 0 best solutions below