I'm using Apache Tika to OCR files. With PDF files works OK but with djvu is problem. From version 1.14 Tika seems to be supporting Djvu. Any ideas how resolve this?
D:\java -jar tika-app-1.18.jar -eUTF-8 test.djvu
Returns
sep 05, 2018 6:38:59 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.
sep 05, 2018 6:38:59 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"
>
<head>
<meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser"/>
<meta name="resourceName" content="test.djvu"/>
<meta name="Content-Length" content="23038658"/>
<meta name="Content-Type" content="image/vnd.djvu"/>
<title/>
</head>
<body/></html>
Have just checked the current (1.26) sources. It seems that since 1.14 the Apache Tika is able to detect djvu header and report that the file is a djvu document. That's what it exactly did:
Other errors and warnings in your output are irrelevant to djvu.
And Apache Tika has no parsers for djvu so can't do anything more than filetype detection. Nothing regarding djvu support is changed since 1.14. So, Apache Tika is useless for djvu. One may consider it not supporting this format at all.