Introduction
I have freebase uploaded to Virtuoso open-source. The databaset is located on the following server: http://SERVER_SPARQL:8890/sparql. I want to extract the description of all the entities that have the type common.topic.
C++ class
In order to access the database I am using curlpp
and c++
. I am pretty sure it's not the problem of the class. I'll post it anyway.
#include "SparqlQuery.h"
#include <curlpp/cURLpp.hpp>
#include <curlpp/Easy.hpp>
#include <curlpp/Options.hpp>
#include <curlpp/Exception.hpp>
#include <sstream>
#include <fstream>
using namespace std;
SparqlQuery::SparqlQuery(const string & query, const string & url):_query(query),_url(url){}
size_t WriteCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
((std::string*)userp)->append((char*)contents, size * nmemb);
return size * nmemb;
}
string SparqlQuery::retrieveInformations()
{
CURL *curl;
CURLcode res;
curl = curl_easy_init();
cout << _url << endl;
string readBuffer = "";
if(curl)
{
//convert string query to URL format
char * parameters = curl_easy_escape(curl, _query.c_str(), (int)_query.size());
//Format query according to the sparql endpoint site
string query = "query=";
string tothis(parameters);
string buffer = query + tothis;
curl_free(parameters);
//Launch query and retrieve informations in form of an xml structure
curl_easy_setopt(curl, CURLOPT_URL, _url.c_str());
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, buffer.c_str());
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
res = curl_easy_perform(curl);
curl_easy_cleanup(curl);
}
return readBuffer;
}
Number of entities
When I use the following query, it returns 43453748 entities that are of the type common.topic. Which is fine
void loadData()
{
cout <<"loading data" << endl;
string serverURL = "http://SERVER_SPARQL:8890/sparql";
string query = "select count(*) where {?head <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdf.basekb.com/ns/common.topic>.}";
SparqlQuery sparql(query, severURL);
string result = sparql.retrieveInformations();
cout << result << endl;
}
Problem
When I use the following query it doesn't output more than 10,000 entities. Even If I try the query from a browser it's not returning the full set of entities. Do you have any idea on why it doesn't return the full results? It's as if there is a limit for the results.
void Freebase::loadData()
{
cout <<"loading data" << endl;
string serverURL = "http://SERVER_SPARQL:8890/sparql";
string query = "select ?head, ?description where { ?head <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://rdf.basekb.com/ns/common.topic>.?head <http://rdf.basekb.com/ns/common.topic.description> ?description.}";
SparqlQuery sparql(query, serverURL);
string result = sparql.retrieveInformations();
cout << result << endl;
}
Modifications
I tried to modify the class by adding should-sponge option but now the problem is the following:
Virtuoso 22023 Error SR078: The result set is too long, must limit result for at most 2097151 rows (DAMN IT!!!!!!!!!!!!!!!)
using namespace std;
SparqlQuery::SparqlQuery(const string & query, const string & url):_query(query),_url(url){}
size_t WriteCallback(void *contents, size_t size, size_t nmemb, void *userp)
{
((std::string*)userp)->append((char*)contents, size * nmemb);
return size * nmemb;
}
string SparqlQuery::retrieveInformations()
{
CURL *curl;
CURLcode res;
curl = curl_easy_init();
cout << _url << endl;
string readBuffer = "";
if(curl)
{
//convert string query to URL format
char * parameters = curl_easy_escape(curl, _query.c_str(), (int)_query.size());
string val = "grab-all-seealso";
char * sponge_parameters = curl_easy_escape(curl, val.c_str(), (int)val.size());
//Format query according to the sparql endpoint site
string should_sponge= "should-sponge=";
string last(sponge_parameters);
string buffer1 = should_sponge + last;
string query = "query=";
string tothis(parameters);
string buffer = query + tothis +"&"+ buffer1;
cout << buffer << endl;
curl_free(parameters);
curl_free(sponge_parameters);
//Launch query and retrieve informations in form of an xml structure
curl_easy_setopt(curl, CURLOPT_URL, _url.c_str());
curl_easy_setopt(curl, CURLOPT_POSTFIELDS, buffer.c_str());
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, WriteCallback);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &readBuffer);
res = curl_easy_perform(curl);
curl_easy_cleanup(curl);
}
return readBuffer;
}
Update
As suggested I actually used LIMIT
and OFFSET
. According to this post It's not necessary to use ORDER BY
. Which in fact suits me better. It's much faster and i am not getting a time limit exception.
for(int offset = 1000; ;offset += 1000 )
{
string query = "PREFIX basekb:<http://rdf.basekb.com/ns/> PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#> select distinct ?head ?label where { ?head rdf:type basekb:common.topic. ?head rdfs:label ?label}LIMIT 1000 OFFSET " + std::to_string(offset) ;
SparqlQuery sparql(query, _serverURL);
string result = sparql.retrieveInformations();
cout << result << endl;
}
I just have one more question please. How will I know if I crossed the offest limit? hope that I wont be getting duplicate results.
To a first approximation, all entities with a
common.topic.description
property will be of typecommon.topic
, so a simpler way to do this for many applications would be to just usegrep
on the source RDF file to look for all the lines withcommon.topic.description
triples.This will give you an answer in minutes instead of hours.
If you want to stick with SPARQL, try adding
LIMIT
andOFFSET
clauses to your query and iterating to get all the results (as mentioned by Joshua Taylor, but buried in a comment).