Filter, subset and download Wikidata

572 Views Asked by At

Is there any easier way to filter data in Wikidata and download a portion of claims?

For e.g., let us say that I want a list of all humans that are alive currently and have an active Twitter profile.

I would like to download a file containing their Q-ids, names and Twitter usernames (https://www.wikidata.org/wiki/Property:P2002).

I expect there to be hundreds of thousands of results, if not millions.

What is the best way to obtain this information?

I am not sure if by submitting a SPARQL query, one can collect results in a file.

I looked at MediaWiki API, but not sure if it allows accessing multiple entities in one go.

Thanks!

1

There are 1 best solutions below

1
On

Wikidata currently has around 190,000 Twitter IDs linked to people. You can easily get them all using the SPARQL Query Interface: Web Interface (with a LIMIT you can remove or increase). In the dropdown on the right, choose SPARQL Endpoint for the Direct Link (no limit, 35MB .csv).

But, in case you run into timeouts with more complicated queries, you can first try LIMIT and OFFSET, or one of:

Wikibase Dump Filter is a CLI tool that downloads the full wikidata dump but filters the stream as it comes in according to your needs. You can put very much the same thing together with some creative pipe|ing and it tends to work better than one would expect.

https://wdumps.toolforge.org wdumps.toolforge.org does more or less the same thing but on-premise, then allows you to download the filtered data.

The linked data interface also works rather well for "simple query, high volume" access needs. Example here gives all Twitter IDs (326,000+) and you can read it in pages as fast as you can generate get requests (set an appropriate Accept header to get json)