I need to get Wikidata artifacts (instance-types, redirects and disambiguations) for a project.
As the original Wikidata endpoint has time constraints when it comes to querying, I have come across Virtuoso Wikidata endpoint.
The problem I have is that if I try to get for example the redirects with this query, it only returns 100,000 results at most:
PREFIX owl: http://www.w3.org/2002/07/owl#
CONSTRUCT {?resource owl:sameAs ?resource2}
WHERE
{
?resource owl:sameAs ?resource2
}
I’m writing to ask if you know of any way to get more than 100,000 results. I would like to be able to achieve the maximum number of possible results.
Once the results are obtained, I must have 3 files (or as few files as possible) in the Ntriples format: wikidata_intance_types.nt
, wikidata_redirecions.nt
and wikidata_disambiguations.nt
.
Thank you very much in advance.
All the best,
Jose Manuel
Please recognize that in both cases (Wikidata itself, and the Virtuoso instance provided by OpenLink Software, my employer), you are querying against a shared resource, and various limits should be expected.
You should space your queries out over time, and consider smaller chunks than the 100,000 limit you've run into -- perhaps 50,000 at a time, waiting for each query to finish retrieving results, plus another second or ten, before issuing the next query.
Most of the guidance in this article about working with the DBpedia public SPARQL endpoint is relevant for any public SPARQL endpoint, especially those powered by Virtuoso. Specific settings on other endpoints will vary, but if you try to be friendly — by limiting the rate of your queries; limiting the size of partial result sets when using
ORDER BY
,LIMIT
, andOFFSET
to step through to get a full result set for a query that overflows the instance's maximum result set size; and the like — you'll be far more successful.