Apache Tika SQL3Lite parser

66 Views Asked by At

Can not parse SQL files using the docker image apache/tika:latest-full.
I am only sending requests to the service, not writing java code.

I start the container like so:

docker run -d -p 127.0.0.1:9998:9998 -v `pwd`/tika-config.xml:/tika-config.xml -v `pwd`/home/user/path/jars/:/tika-extras apache/tika:latest-full --config tika-config.xml

/home/user/path/jars/ contains slf4j-api-1.7.36.jar and sqlite-jdbc-3.44.1.0.jar.
Passing this directory to the container sets the jar files to Tika's class path. Guidance found here

I also include these jars in my local CLASSPATH.

I pass in a config file with these lines to include the desired sql parser:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
  <parsers>
    <parser class="org.apache.tika.parser.sqlite3">
    </parser>
    <parser class="org.apache.tika.parser.DefaultParser">
      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
    </parser>
  </parsers>
</properties>

My requests always come back with EmptyParser being used.

curl -T sample.sqlite http://localhost:9998/tika

returns:

<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
    
    <head>
        
        <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.EmptyParser"/>
        
        <meta name="Content-Length" content="51250176"/>
        
        <meta name="Content-Type" content="application/x-sqlite3"/>
        
        <title>&#0;</title>
        
    </head>
    
    <body/>
</html>

When I request from the /parsers endpoint I don't see a SQL parser listed. This must be the root of the issue, but I am not sure what to try next.
Tika SQLite3Parser

Update 1:
Added tika-parser-sqlite3-package-2.9.1.jar, tika-parser-sqlite3-package-2.9.1-tests.jar, and tika-parser-sqlite3-package-2.9.1-sources.jar to the extras folder from the maven repo. I Removed the other jars.

I did find I had to add these manually to the docker. For some reason the command when running the docker isn't doing it for me. The config file is getting passed though so I must be doing something wrong there.

I also changed the config to this:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.sqlite3">
                <mime>application/x-sqlite3</mime>
        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
        </parser>
    </parsers>
</properties>

Still getting parsed by the EmtpyParser.

Update 2:
CLASSPATH was empty in the container. I added this to it but that didnt work either.
export CLASSPATH=:/tika-extras/tika-parser-sqlite3-package-2.9.1.jar

0

There are 0 best solutions below