Extracting text from DjVu with Apache Tika

391 Views Asked by At

I'm using Apache Tika to OCR files. With PDF files works OK but with djvu is problem. From version 1.14 Tika seems to be supporting Djvu. Any ideas how resolve this?

D:\java -jar tika-app-1.18.jar -eUTF-8 test.djvu

Returns

sep 05, 2018 6:38:59 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

sep 05, 2018 6:38:59 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.

    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"
    >
    <head>
    <meta name="X-Parsed-By" content="org.apache.tika.parser.EmptyParser"/>
    <meta name="resourceName" content="test.djvu"/>
    <meta name="Content-Length" content="23038658"/>
    <meta name="Content-Type" content="image/vnd.djvu"/>
    <title/>
    </head>
    <body/></html>
1

There are 1 best solutions below

0
On

Have just checked the current (1.26) sources. It seems that since 1.14 the Apache Tika is able to detect djvu header and report that the file is a djvu document. That's what it exactly did:

    <meta name="resourceName" content="test.djvu"/>
    <meta name="Content-Length" content="23038658"/>
    <meta name="Content-Type" content="image/vnd.djvu"/>

Other errors and warnings in your output are irrelevant to djvu.
And Apache Tika has no parsers for djvu so can't do anything more than filetype detection. Nothing regarding djvu support is changed since 1.14. So, Apache Tika is useless for djvu. One may consider it not supporting this format at all.