I want to skip all files that impossible to extract any text
For what? I need to index file content.
I could just call /parse
rest, and check if there any content
in result. But parse
rest will call parsers for any file (even video or audio), that spends time.
I want to skip all parsers if mimeType not support textual extraction
Variants:
- Analyze all mimeTypes and analyze all parsers in tika. Then call
/meta
before/parse
and mimeType can have content in result.
But it is hard to analyze all such a mimeTypes, and after Tika updates there can be changes in code and some mimeTypes can become supported or unsupported to extract content. So any update need to analyze full Tika code.
Call
/meta
and check ifContent-Encoding
exist. But can i use this header for such a case? Maybe some another headers i can use?Maybe in tika documentation there is some page that contain all mimeTypes or all parsers that can extract text?
Any other methods supported in tika-server for such a case?