With Apache Jena Fuseki I am trying to load the latest-truthy.nt dataset from Wikidata, but I am getting the following error while trying to import the file. With the inspiration from the following success from Bitplan where they did have success.
Error log:
14:36:16 INFO loader :: Add: 198.500.000 latest-truthy.nt (Batch: 453.309 / Avg: 213.382)
14:36:17 ERROR riot :: [line: 198884173, col: 87] Bad IRI: <https://[email protected]> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
org.apache.jena.riot.RiotException: [line: 198884173, col: 87] Bad IRI: <https://[email protected]> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146)
at org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112)
at org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85)
at org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187)
at org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
at org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70)
at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:109)
at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
at org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:357)
at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:323)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:298)
at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550)
at org.apache.jena.tdb2.loader.base.LoaderOps.inputFile(LoaderOps.java:107)
at org.apache.jena.tdb2.loader.base.LoaderBase.loadOne(LoaderBase.java:125)
at org.apache.jena.tdb2.loader.base.LoaderBase.lambda$load$0(LoaderBase.java:102)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
at org.apache.jena.tdb2.loader.base.LoaderBase.load(LoaderBase.java:99)
at tdb2.tdbloader.lambda$execBulkLoad$4(tdbloader.java:196)
at org.apache.jena.atlas.lib.Timer.time(Timer.java:85)
at tdb2.tdbloader.execBulkLoad(tdbloader.java:194)
at tdb2.tdbloader.loadQuads(tdbloader.java:175)
at tdb2.tdbloader.exec(tdbloader.java:136)
at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:92)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:58)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:45)
at tdb2.tdbloader.main(tdbloader.java:64)
Script to import:
@ECHO off
cd apache-jena-4.0.0
echo start import on %DATE% %TIME%
tdb2_tdbloader --loader=parallel --loc "C:\fuseki\data" "F:\latest-truthy.nt" > tdb2-out.log 2> tdb2-err.log
echo finish import on %DATE% %TIME%
pause
File structure:
- C:/fuseki/
-- apache-jena-4.0.0/
-- apache-jena-fuseki-4.0.0/
-- data/
-- startfusekidb.bat
-- wikidata2fuseki.bat
- F:/
-- latest-truthy.nt
Is this an issue with Fuseki? I can't open the .nt file myself to remove the issue. Is there any flags I can use so it skips validation for the given import with tdbloader?
I am also asking this in the IRC channel of Wikidata to see if they might be able to help me.
UPDATE: I got answer from someone at IRC and they told me a whole lot of errors exist in the dataset Errors in Wikidata So I know need to find a way to skip error related lines and continue loading. But the Fuseki TDB2 Commands don't show anything of help.
Also trying --help outputs the following, thus indicating skipping doesn't exist?
c:\fuseki\apache-jena-4.0.0\bin>tdb2_tdbloader -h
tdbloader--loader= [--desc DATASET | --loc DIR] FILE ...
Location
--loc=DIR Location (a directory)
--tdb= Assembler description file
--graph=IRI Act on a named graph
--loader= Loader to use: 'basic', 'phased' (default), 'sequential', 'parallel' or 'light'
--syntax=LANG Syntax of data from stdin
Symbol definition
--set Set a configuration symbol to a value
--mem=FILE Execute on an in-memory TDB database (for testing)
--desc= Assembler description file
General
-v --verbose Verbose
-q --quiet Run with minimal output
--debug Output information for debugging
--help
--version Version information
--strict Operate in strict SPARQL mode (no extensions of any kind)
@NLxDoDge - thx for pointing to my BITPlan success story. Indeed wikidata nt dumps may contain incompatible triples for import with Jena 4.1 - i ran into a similar problem today with the https://wdumps.toolforge.org/dump/1607 of human settlements.
A single triple:
would spoil the show giving the error:
I simply edited the 1.2 GB wdump-1607.nt file vim where you can jump to the line number with the
and then save the file
You might want to try out your environment with this small 100 MB dump file which expands to 1.2 GB before trying out the wikidata full import which in the end will need >2TB of SSD ! disk space to work.
Please find below the scripts I used for importing the dump and starting the fuseki server.
You should get a result like
script to run fuseki
script to load data