Apache Jena Textsearch over multiple fields very slow

251 Views Asked by At

I'm trying to query a local Apache Jena person database (running on Apache Jena Fuseki 3.17.0) with this SPARQL expression.

PREFIX gndo: <https://d-nb.info/standards/elementset/gnd#>
PREFIX text: <http://jena.apache.org/text#>

SELECT *
WHERE {
    ?y text:query (gndo:surname "Einstein") .
    ?y text:query (gndo:forename "Albert") .
}

And it is very, very slow (> 10s), although both fields are indexed:

[] rdf:type fuseki:Server ;
   fuseki:services (
        :myservice
           ) .

:myservice rdf:type fuseki:Service ;
    fuseki:name "persondata" ;
    fuseki:serviceQuery "query" ;
    fuseki:serviceUpdate "update" ;
    fuseki:serviceUpload "upload" ;
    fuseki:serviceReadWriteGraphStore "data" ;
    fuseki:dataset :text_dataset ;
.

text:TextDataset rdfs:subClassOf ja:RDFDataset .

:text_dataset rdf:type text:TextDataset ;
    text:dataset :geodata ;
    text:index <#indexLucene>;
.

:geodata rdf:type tdb:DatasetTDB ;
    tdb:location "data/dataforTDB" ;
.

<#indexLucene> a text:TextIndexLucene ;
    text:directory <file:data/lucenePersonIndex> ;
    text:entityMap <#entMap> ;
    text:analyzer [ a text:StandardAnalyzer ] ;
.

<#entMap> a text:EntityMap ;
    text:defaultField "forename" ;
    text:langField "lang" ;
    text:entityField "uri" ;
    text:map (
        [ text:field "forename" ;
          text:predicate gndo:forename ]
        [ text:field "surname" ;
          text:predicate gndo:surname ]
    ) .

If I query only over one of the two fields like this

PREFIX gndo: <https://d-nb.info/standards/elementset/gnd#>
PREFIX text: <http://jena.apache.org/text#>

SELECT *
WHERE {
    ?y text:query (gndo:surname "Einstein") .
}

The search is very quick, as expected (< 5ms). What am I doing wrong? This is how it's described here: https://jena.apache.org/documentation/query/text-query.html

So is it just this slow when done like that? It seems illogical as an AND query should be quicker than OR (I'm using a two text queries with a UNION in another query and it's still very quick).

If it's in any way relevant, I'm sending the query over python3 by using the SPARQLWrapper package.

Any help is much appreciated, I'm very new to working with Lucene, Apache Jena and Fuseki, thanks in advance!

For full information, here is also how the fields are organized in the rdf-database:

<https://d-nb.info/gnd/100000096> gndo:variantNameForThePerson "La Peirie, Ambro
ise";
  gndo:variantNameEntityForThePerson _:node1einj54cqx16589985 .

_:node1einj54cqx16589985 gndo:forename "Ambroise";
  gndo:surname "La Peirie" .

I found this piece of information in the Apache Jena documentation now:

In principle it should be possible to extend Jena to allow for creating documents with multiple searchable fields by extending org.apache.jena.sparql.core.DatasetChangesBatched such as with org.apache.jena.query.text.TextDocProducerEntities; however, this form of extension is not currently (Jena 3.13.1) functional.

I'm not sure if I understand this correctly, is this referring to my problem? So it's just not possible to run an AND-query over two indexed fields at the moment?

0

There are 0 best solutions below