why is Sparql endpoint on local sever not returning the full set of entities for a given type?

462 Views Asked by At

Introduction

I have freebase uploaded to Virtuoso open-source. The databaset is located on the following server: http://SERVER_SPARQL:8890/sparql. I want to extract the description of all the entities that have the type common.topic.


C++ class

In order to access the database I am using curlpp and c++. I am pretty sure it's not the problem of the class. I'll post it anyway.

#include "SparqlQuery.h"

#include <curlpp/cURLpp.hpp>
#include <curlpp/Easy.hpp>
#include <curlpp/Options.hpp>
#include <curlpp/Exception.hpp>
#include <sstream>
#include <fstream>

using namespace std;

SparqlQuery::SparqlQuery(const string & query, const string & url):_query(query),_url(url){}

size_t  WriteCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
    ((std::string*)userp)->append((char*)contents, size * nmemb);
    return size * nmemb;
}

string SparqlQuery::retrieveInformations()
{
    CURL *curl;
    CURLcode res;
    curl = curl_easy_init();
    cout << _url << endl;
    string readBuffer = "";
    if(curl)
    {
        //convert string query to URL format
        char * parameters = curl_easy_escape(curl, _query.c_str(), (int)_query.size());

        //Format query according to the sparql endpoint site
        string query = "query=";
        string tothis(parameters);
        string buffer = query + tothis;

        curl_free(parameters);

        //Launch query and retrieve informations in form of an xml structure
        curl_easy_setopt(curl, CURLOPT_URL, _url.c_str());
        curl_easy_setopt(curl, CURLOPT_POSTFIELDS, buffer.c_str());
        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
        curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
        res = curl_easy_perform(curl);
        curl_easy_cleanup(curl);
    }
    return readBuffer;
}

Number of entities

When I use the following query, it returns 43453748 entities that are of the type common.topic. Which is fine

void loadData()
{
  cout <<"loading data" << endl;
  string serverURL = "http://SERVER_SPARQL:8890/sparql";
  string query = "select count(*) where {?head <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdf.basekb.com/ns/common.topic>.}";

  SparqlQuery sparql(query, severURL);
  string result = sparql.retrieveInformations();
  cout << result << endl;
}

Problem

When I use the following query it doesn't output more than 10,000 entities. Even If I try the query from a browser it's not returning the full set of entities. Do you have any idea on why it doesn't return the full results? It's as if there is a limit for the results.

void Freebase::loadData()
{
  cout <<"loading data" << endl;
  string serverURL = "http://SERVER_SPARQL:8890/sparql";
  string query = "select ?head, ?description where { ?head <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdf.basekb.com/ns/common.topic>.?head <http://rdf.basekb.com/ns/common.topic.description>  ?description.}";

  SparqlQuery sparql(query, serverURL);
  string result = sparql.retrieveInformations();
  cout << result << endl;
}

Modifications

I tried to modify the class by adding should-sponge option but now the problem is the following:

Virtuoso 22023 Error SR078: The result set is too long, must limit result for at most 2097151 rows (DAMN IT!!!!!!!!!!!!!!!)

using namespace std;

SparqlQuery::SparqlQuery(const string & query, const string & url):_query(query),_url(url){}

size_t  WriteCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
    ((std::string*)userp)->append((char*)contents, size * nmemb);
    return size * nmemb;
}

string SparqlQuery::retrieveInformations()
{
    CURL *curl;
    CURLcode res;
    curl = curl_easy_init();
    cout << _url << endl;
    string readBuffer = "";
    if(curl)
    {
        //convert string query to URL format
        char * parameters = curl_easy_escape(curl, _query.c_str(), (int)_query.size());
    string val = "grab-all-seealso";
    char * sponge_parameters = curl_easy_escape(curl, val.c_str(), (int)val.size());

        //Format query according to the sparql endpoint site
    string should_sponge= "should-sponge=";
    string last(sponge_parameters);
    string buffer1 = should_sponge + last;

        string query = "query=";
        string tothis(parameters);
        string buffer = query + tothis +"&"+ buffer1;
        cout << buffer << endl;

        curl_free(parameters);
    curl_free(sponge_parameters);

        //Launch query and retrieve informations in form of an xml structure
        curl_easy_setopt(curl, CURLOPT_URL, _url.c_str());
        curl_easy_setopt(curl, CURLOPT_POSTFIELDS, buffer.c_str());
        curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
        curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
        res = curl_easy_perform(curl);
        curl_easy_cleanup(curl);
    }
    return readBuffer;
}

Update

As suggested I actually used LIMIT and OFFSET. According to this post It's not necessary to use ORDER BY. Which in fact suits me better. It's much faster and i am not getting a time limit exception.

for(int offset = 1000; ;offset += 1000 )
  {
    string query = "PREFIX basekb:<http://rdf.basekb.com/ns/> PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> select distinct ?head ?label where { ?head rdf:type basekb:common.topic. ?head rdfs:label ?label}LIMIT 1000 OFFSET " + std::to_string(offset) ;
    SparqlQuery sparql(query, _serverURL);
    string result = sparql.retrieveInformations();
    cout << result << endl;
  }

I just have one more question please. How will I know if I crossed the offest limit? hope that I wont be getting duplicate results.


3

There are 3 best solutions below

3
On BEST ANSWER

To a first approximation, all entities with a common.topic.description property will be of type common.topic, so a simpler way to do this for many applications would be to just use grep on the source RDF file to look for all the lines with common.topic.description triples.

This will give you an answer in minutes instead of hours.

If you want to stick with SPARQL, try adding LIMIT and OFFSET clauses to your query and iterating to get all the results (as mentioned by Joshua Taylor, but buried in a comment).

3
On

You are restricting your sparql query with :

LIMIT 10
0
On

When using SPARQL via HTTP Virtuoso seems to impose a lot of restrictions on your queries and how the result sets are constructed. One that i've frequently heard but can't really understand is that the maximum result set size (despite virtuoso.ini settings) seems to be 1M rows.

There are a couple of sadly less "open" ways to run SPARQL queries on a Virtuoso endpoint that don't fall under these restrictions.

What i'd probably try first if you don't need to do this programmatically over and over again and if you control the server: try to run the query in the isql/isql-vt command line interface. You can just run sparql queries from there like this:

sparql select * {?s ?p ?o} limit 10;

There also seems to be a CSV variable that you can set to get "CSV" like output, but be aware that the encoding is weird.

Other options are the ODBC/JDBC connectors e.g., like the one the DWSLab is using.