Can not parse SQL files using the docker image apache/tika:latest-full
.
I am only sending requests to the service, not writing java code.
I start the container like so:
docker run -d -p 127.0.0.1:9998:9998 -v `pwd`/tika-config.xml:/tika-config.xml -v `pwd`/home/user/path/jars/:/tika-extras apache/tika:latest-full --config tika-config.xml
/home/user/path/jars/
contains slf4j-api-1.7.36.jar
and sqlite-jdbc-3.44.1.0.jar.
Passing this directory to the container sets the jar files to Tika's class path. Guidance found here
I also include these jars in my local CLASSPATH
.
I pass in a config file with these lines to include the desired sql parser:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.sqlite3">
</parser>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
</parsers>
</properties>
My requests always come back with EmptyParser being used.
curl -T sample.sqlite http://localhost:9998/tika
returns:
<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.EmptyParser"/>
<meta name="Content-Length" content="51250176"/>
<meta name="Content-Type" content="application/x-sqlite3"/>
<title>�</title>
</head>
<body/>
</html>
When I request from the /parsers
endpoint I don't see a SQL parser listed. This must be the root of the issue, but I am not sure what to try next.
Tika SQLite3Parser
Update 1:
Added tika-parser-sqlite3-package-2.9.1.jar
, tika-parser-sqlite3-package-2.9.1-tests.jar
, and tika-parser-sqlite3-package-2.9.1-sources.jar
to the extras folder from the maven repo. I Removed the other jars.
I did find I had to add these manually to the docker. For some reason the command when running the docker isn't doing it for me. The config file is getting passed though so I must be doing something wrong there.
I also changed the config to this:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.sqlite3">
<mime>application/x-sqlite3</mime>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
</parsers>
</properties>
Still getting parsed by the EmtpyParser.
Update 2:
CLASSPATH was empty in the container. I added this to it but that didnt work either.
export CLASSPATH=:/tika-extras/tika-parser-sqlite3-package-2.9.1.jar