How to get all nouns in a certain language from Wiktionary using SPARQL

1.2k Views Asked by At

I'm trying to query Wiktionary with SPARQL to get all the terms that are nouns of a certain language (for example German) and as output:

  • the string of the noun
  • the grammatical gender (genus): male, female, neutral

I am using the SPARQL-Endpoint: http://wiktionary.dbpedia.org/sparql and I found an example but I didn't figure out how to adapt it to get the information I want.

PREFIX terms:<http://wiktionary.dbpedia.org/terms/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>
PREFIX dc:<http://purl.org/dc/elements/1.1/>
SELECT ?sword ?slang ?spos ?ssense ?twordRes ?tword ?tlang
FROM <http://wiktionary.dbpedia.org>
WHERE {
    ?swordRes terms:hasTranslation ?twordRes .
    ?swordRes rdfs:label ?sword .
    ?swordRes dc:language ?slang .
    ?swordRes terms:hasPoS ?spos .
    OPTIONAL { ?swordRes terms:hasMeaning ?ssense . }
    OPTIONAL { 
           ?twordBaseRes terms:hasLangUsage ?twordRes . 
           ?twordBaseRes rdfs:label ?tword .
    }
    OPTIONAL { ?twordRes dc:language ?tlang . }
}
3

There are 3 best solutions below

0
On BEST ANSWER

First of all, you want to select all term senses that are nouns. As you can see in the query result of the example query, this information is captured by the terms:hasPoS relation. So, to specifically query all nouns, we could do this:

PREFIX terms: <http://wiktionary.dbpedia.org/terms/>
SELECT ?term
WHERE { 
     ?term terms:hasPoS terms:Noun . 
}
LIMIT 100 

Result

The next thing you want is only nouns of a certain language. This seems to be covered by the dc:language relation, so we add an additional constraint on that relation. Let's say we want all English nouns:

PREFIX terms: <http://wiktionary.dbpedia.org/terms/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>

SELECT ?term
WHERE { 
    ?term terms:hasPoS terms:Noun ;
          dc:language terms:English . 
}
LIMIT 100 

Result

So, we are now selecting what you want, but we don't yet have the output in the format you want, as the above query just gives back the identifier of the term sense, not the string-value of the actual term. As we can see in the output from the example query, the string value is captured by the rdfs:label property, so we add that:

PREFIX terms: <http://wiktionary.dbpedia.org/terms/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT ?term ?termLabel
WHERE { 
    ?term terms:hasPoS terms:Noun ;
          dc:language terms:English ;
          rdfs:label ?termLabel .
}
LIMIT 100

Result

If you now look at this query's result you'll see that there is something odd with the language going on: despite the fact that we thought we selected English, we are also getting back labels that have a different language tag (e.g. '@ru'). To remove these results we can restrict our query further, and say that we only want back labels in English:

PREFIX terms: <http://wiktionary.dbpedia.org/terms/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT ?term ?termLabel
WHERE { 
    ?term terms:hasPoS terms:Noun ;
          dc:language terms:English ;
          rdfs:label ?termLabel .
    FILTER(langMatches(lang(?termLabel), "en"))
}
LIMIT 100

Result

Finally, the gender/genus. Here I'm not really sure. Looking at some example resources in the wiktionary data (for example, the entry for dog) I'd say this information is not actually present in the data.

0
On

The answer from Jeen is great as a start. Here's an option for getting the gender.

English doesn't serve well as an example language since it does not have grammatical gender. Let's take German:

PREFIX terms: <http://wiktionary.dbpedia.org/terms/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX rdfs:<http://www.w3.org/2000/01/rdf-schema#>

SELECT ?term ?termLabel
WHERE { 
    ?term terms:hasPoS terms:Noun ;
          dc:language terms:German ;
          rdfs:label ?termLabel .
    FILTER(langMatches(lang(?termLabel), "de"))
}
LIMIT 100

Result

(It would be nice to filter the many exact duplicates. (I don't know how, and why they are there.))

Taking the German term "Eierkopf" instead of the English "dog": We can now follow the term link to http://wiktionary.dbpedia.org/resource/Eierkopf-German-Noun where we see the link to Wiktionary in German http://de.wiktionary.org/wiki/Eierkopf (we could have guessed that URL also, without fetching from wiktionary.dbpedia.org first).

Here the genus can be extracted from the text: "Substantiv, m" (m for masculine)

The options for German are:

<em title="Genus: Maskulinum (grammatikalisches Geschlecht: männlich)">m</em>
<em title="Genus: Femininum (grammatikal. Geschlecht: weiblich)">f</em>
<em title="Genus: Neutrum (grammatikal. Geschlecht: sächlich)">n</em>

If a noun has different gender based on the region/dialect, the official gender is in the HTML as above, and a comment appears below. Example:

https://de.wiktionary.org/wiki/Butter

So besides querying SPARQL, it also requires 1-2 web page requests per word, and some HTML content extraction.

0
On

I know that Wikidata is not Wiktionary, but you can now get, e.g., all German nouns in the Wikidata lexeme namespace via a query to the Wikidata Query Service. For instance,

SELECT
  ?lexeme ?lemma
WITH {      
  SELECT 
    ?lexeme
  WHERE {
    ?lexeme wikibase:lexicalCategory wd:Q1084 ;
            dct:language wd:Q188 .
  }
  GROUP BY ?lexeme
} AS %lexemes
WHERE {
  INCLUDE %lexemes
  ?lexeme wikibase:lemma ?lemma
}

This query currently returns "164624 results in 6767 ms"