I would like to get:
- a list of nouns in a specific language
- the case declension table for a word in a slavic language.
I was hoping to be able to send something like an http get request with parameters:
https://cs.wiktionary.org/wiki/wordlist/nouns
And for the declension table I was also hoping for an http request and then a response as a JSON object, eg.:
https://cs.wiktionary.org/wiki/program/declension
Expected response:
{
"word":"program",
"singular_declension":
[
"nominative":"program",
"genitive":"programu",
"dative":"programu",
"accusative":"program",
"vocative":"programe",
"locative":"programu",
"instrumental":"programem",
]
"plural_declension":
[
"nominative":"programy",
"genitive":"programů",
"dative":"programům",
"accusative":"programy",
"vocative":"programy",
"locative":"programech",
"instrumental":"programy",
]
}
Unfortunately, I cannot find any endpoints for that in the official API specs: https://www.mediawiki.org/wiki/API:Main_page ...nor the documentation: https://www.mediawiki.org/wiki/API:Main_page
How can I get those results? Or do I have to resort to webscraping and extracting this info from the html pages?
Action API
Technically, it's possible to use
Action APIin the following manner:HTMLcontaining the desired table.Yet this way is hardly scalable.
English Wiktionary
It's hard to claim so, but it seems that English Wiktionary contains most of the lemmas of other languages.
programin Czech is also there and with declensions. Furthermore, section name is a standard across all languages (Declension) also theHTMLtable has lots of metadata inside and is much easier to parse. As an example of a word form:Yet before implementing parsing, it is usually a good idea to check for existing solutions.
Wiktextract and Kaikki
These two projects are both based English Wiktionary dumps. Wiktextract provides
cliandpythoninterface for such dumps. Kaikki holds the recent extracted version and providesHTTPinterface to that. Here are few examples:All Czech nouns (be careful, it's 46,6MB):
programdescription:Has slightly different structure from desired, but easily convertible:
Sample code
If the above projects somehow cannot be used, here are some sample parsing code in
python.First, we need to somehow detect language sections, and having a lookup dictionary seems to be good option:
Section finder:
Processing:
Result: