TIKA failing to parse CFF font

51 Views Asked by At

We are using Tika as a command line process using the following command:

java -Dlog4j2.formatMsgNoLookups=true -Xms512m -Xmx16384m -jar /tika/tika-app-2.9.1.jar --config=/tika/tika-ocr-config.xml -t test.pdf

Now the test.pdf has some CFF fonts and that is why TIKA is throwing the following error

ERROR [main] 17:05:36,671 org.apache.pdfbox.pdmodel.font.PDCIDFontType0 Can't read the embedded CFF font QLVBNN+HiddenHorzOCR java.io.EOFException: null

The tika-ocr-config.xml is base basic containing just the following parser configration:

<parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">false</param>
                <param name="ocrStrategy" type="string">no_ocr</param>
                <!-- whether or not to add processing to detect angles and extract text accordingly PDFBOX-4371 -->
                <param name="detectAngles" type="bool">true</param>
            </params>
        </parser>
    ```
    

I know there apache tika has a package to parse CFF font

https://pdfbox.apache.org/docs/2.0.11/javadocs/org/apache/fontbox/cff/package-summary.html

If anyone know how to configure tika to use this CFF font or how to ignore CFF font so that the above error does not appear , it will be of great help to us. 

Regards
Rupam 


0

There are 0 best solutions below