How can I extract personal names from this JSON file (wiktionary dump)?

114 Views Asked by At

This link contains all 'proper names' in wiktionary in all languages. This means personal names like Tatiana, Zadie or Richard. However it also includes names of countries, towns, rivers, and so on.

I want to extract all records which are personal names.

The records I want to extract either have the string "given name" in them or the string "surname" (some have both).

For example the name Fabian:

{"pos": "name", "wikipedia": ["Fabian (name)"], "head_templates": [{"name": "en-proper noun", "args": {}, "expansion": "Fabian"}], "etymology_text": "From Latin Fabiānus (“belonging to Fabius”), derived from Fabius + -ānus.", "etymology_templates": [{"name": "der", "args": {"1": "en", "2": "la", "3": "Fabiānus", "4": "", "5": "belonging to Fabius"}, "expansion": "Latin Fabiānus (“belonging to Fabius”)"}, {"name": "m", "args": {"1": "la", "2": "Fabius"}, "expansion": "Fabius"}, {"name": "m", "args": {"1": "la", "2": "-ānus"}, "expansion": "-ānus"}], "sounds": [{"ipa": "/ˈfeɪbi.ən/"}, {"audio": "LL-Q1860 (eng)-Vealhurl-Fabian.wav", "text": "Audio (Southern England)", "tags": ["Southern-England"], "ogg_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/2/2b/LL-Q1860_%28eng%29-Vealhurl-Fabian.wav/LL-Q1860_%28eng%29-Vealhurl-Fabian.wav.ogg", "mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/2/2b/LL-Q1860_%28eng%29-Vealhurl-Fabian.wav/LL-Q1860_%28eng%29-Vealhurl-Fabian.wav.mp3"}], "word": "Fabian", "lang": "English", "lang_code": "en", "senses": [{"links": [["given name", "given name"]], "raw_glosses": ["(rare) A male given name from Latin."], "glosses": ["A male given name from Latin."], "tags": ["rare"], "id": "Fabian-en-name-XC4~mcw6", "categories": [{"name": "English given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "English male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}], "translations": [{"lang": "Aragonese", "code": "an", "sense": "male given name", "tags": ["masculine"], "word": "Fabián", "_dis1": "96 4"}, {"lang": "Catalan", "code": "ca", "sense": "male given name", "word": "Fabià", "_dis1": "96 4"}, {"lang": "Faroese", "code": "fo", "sense": "male given name", "tags": ["masculine"], "word": "Fabian", "_dis1": "96 4"}, {"lang": "French", "code": "fr", "sense": "male given name", "word": "Fabien", "_dis1": "96 4"}, {"lang": "Galician", "code": "gl", "sense": "male given name", "word": "Fabián", "_dis1": "96 4"}, {"lang": "Galician", "code": "gl", "sense": "male given name", "word": "Fabiano", "_dis1": "96 4"}, {"lang": "Galician", "code": "gl", "sense": "male given name", "word": "Fabio", "_dis1": "96 4"}, {"lang": "German", "code": "de", "sense": "male given name", "tags": ["masculine"], "word": "Fabian", "_dis1": "96 4"}, {"lang": "Hungarian", "code": "hu", "sense": "male given name", "word": "Fábián", "_dis1": "96 4"}, {"lang": "Italian", "code": "it", "sense": "male given name", "word": "Fabiano", "_dis1": "96 4"}, {"lang": "Polish", "code": "pl", "sense": "male given name", "tags": ["masculine", "person"], "word": "Fabian", "_dis1": "96 4"}, {"lang": "Portuguese", "code": "pt", "sense": "male given name", "word": "Fabiano", "_dis1": "96 4"}, {"lang": "Spanish", "code": "es", "sense": "male given name", "word": "Fabián", "_dis1": "96 4"}, {"lang": "Swedish", "code": "sv", "sense": "male given name", "word": "Fabian", "_dis1": "96 4"}]}, {"links": [["surname", "surname"]], "glosses": ["A surname."], "id": "Fabian-en-name-EMUC1F3L", "categories": [{"name": "English surnames", "kind": "other", "parents": [], "source": "w"}]}]}
{"pos": "name", "head_templates": [{"name": "head", "args": {"1": "fo", "2": "proper noun", "g": "m"}, "expansion": "Fabian m"}], "inflection_templates": [{"name": "fo-decl-proper-noun-s-indef", "args": {"1": "Fabian", "2": "Fabian", "3": "Fabiani", "4": "Fabians"}}], "forms": [{"form": "", "source": "declension", "tags": ["table-tags"]}, {"form": "fo-decl-proper-noun-s-indef", "source": "declension", "tags": ["inflection-template"]}, {"form": "Fabian", "tags": ["indefinite", "nominative"], "source": "declension"}, {"form": "Fabian", "tags": ["accusative", "indefinite"], "source": "declension"}, {"form": "Fabiani", "tags": ["dative", "indefinite"], "source": "declension"}, {"form": "Fabians", "tags": ["genitive", "indefinite"], "source": "declension"}], "word": "Fabian", "lang": "Faroese", "lang_code": "fo", "senses": [{"links": [["given name", "given name"]], "glosses": ["a male given name"], "tags": ["masculine"], "id": "Fabian-fo-name-h8YdwBAs", "categories": [{"name": "Faroese given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "Faroese male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}]}]}
{"pos": "name", "head_templates": [{"name": "head", "args": {"1": "de", "2": "proper noun", "g": "m"}, "expansion": "Fabian m"}], "etymology_text": "Borrowed from Latin Fabiānus (“belonging to Fabius”).", "etymology_templates": [{"name": "glossary", "args": {"1": "loanword", "2": "Borrowed"}, "expansion": "Borrowed"}, {"name": "bor", "args": {"1": "de", "2": "la", "3": "Fabiānus", "4": "", "5": "belonging to Fabius", "lit": "", "pos": "", "tr": "", "ts": "", "id": "", "sc": "", "g": "", "g2": "", "g3": "", "nocat": "", "sort": ""}, "expansion": "Latin Fabiānus (“belonging to Fabius”)"}, {"name": "bor+", "args": {"1": "de", "2": "la", "3": "Fabiānus", "4": "", "5": "belonging to Fabius"}, "expansion": "Borrowed from Latin Fabiānus (“belonging to Fabius”)"}], "sounds": [{"ipa": "/ˈfaːbian/"}, {"audio": "De-Fabian.ogg", "text": "Audio", "ogg_url": "https://upload.wikimedia.org/wikipedia/commons/c/c9/De-Fabian.ogg", "mp3_url": "https://upload.wikimedia.org/wikipedia/commons/transcoded/c/c9/De-Fabian.ogg/De-Fabian.ogg.mp3"}], "word": "Fabian", "lang": "German", "lang_code": "de", "senses": [{"links": [["given name", "given name"]], "glosses": ["a male given name"], "tags": ["masculine"], "id": "Fabian-de-name-h8YdwBAs", "categories": [{"name": "German given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "German male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}]}]}
{"pos": "name", "head_templates": [{"name": "head", "args": {"1": "oc", "2": "proper noun", "head": "", "g": "m", "g2": ""}, "expansion": "Fabian m"}, {"name": "oc-proper noun", "args": {"1": "m"}, "expansion": "Fabian m"}], "word": "Fabian", "lang": "Occitan", "lang_code": "oc", "senses": [{"links": [["given name", "given name"], ["Fabian", "Fabian#English"]], "raw_glosses": ["(Gascony) a male given name, equivalent to English Fabian"], "glosses": ["a male given name, equivalent to English Fabian"], "tags": ["Gascony", "masculine"], "id": "Fabian-oc-name-VtvZQ6Yw", "categories": [{"name": "Gascon", "kind": "other", "parents": [], "source": "w"}, {"name": "Occitan given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "Occitan male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}]}]}
{"pos": "name", "head_templates": [{"name": "pl-proper noun", "args": {"1": "m-pr"}, "expansion": "Fabian m pers"}], "inflection_templates": [{"name": "pl-decl-noun-m-pr", "args": {"nomp": "Fabianowie"}}], "forms": [{"form": "", "source": "declension", "tags": ["table-tags"]}, {"form": "pl-decl-noun-m-pr", "source": "declension", "tags": ["inflection-template"]}, {"form": "Fabian", "tags": ["nominative", "singular"], "source": "declension"}, {"form": "Fabianowie", "tags": ["nominative", "plural"], "source": "declension"}, {"form": "Fabiana", "tags": ["genitive", "singular"], "source": "declension"}, {"form": "Fabianów", "tags": ["genitive", "plural"], "source": "declension"}, {"form": "Fabianowi", "tags": ["dative", "singular"], "source": "declension"}, {"form": "Fabianom", "tags": ["dative", "plural"], "source": "declension"}, {"form": "Fabiana", "tags": ["accusative", "singular"], "source": "declension"}, {"form": "Fabianów", "tags": ["accusative", "plural"], "source": "declension"}, {"form": "Fabianem", "tags": ["instrumental", "singular"], "source": "declension"}, {"form": "Fabianami", "tags": ["instrumental", "plural"], "source": "declension"}, {"form": "Fabianie", "tags": ["locative", "singular"], "source": "declension"}, {"form": "Fabianach", "tags": ["locative", "plural"], "source": "declension"}, {"form": "Fabianie", "tags": ["singular", "vocative"], "source": "declension"}, {"form": "Fabianowie", "tags": ["plural", "vocative"], "source": "declension"}], "etymology_text": "Borrowed from Latin Fabianus.", "etymology_templates": [{"name": "glossary", "args": {"1": "loanword", "2": "Borrowed"}, "expansion": "Borrowed"}, {"name": "bor", "args": {"1": "pl", "2": "la", "3": "Fabianus", "4": "", "5": "", "lit": "", "pos": "", "tr": "", "ts": "", "id": "", "sc": "", "g": "", "g2": "", "g3": "", "nocat": "", "sort": ""}, "expansion": "Latin Fabianus"}, {"name": "bor+", "args": {"1": "pl", "2": "la", "3": "Fabianus"}, "expansion": "Borrowed from Latin Fabianus"}], "sounds": [{"ipa": "/ˈfa.bjan/"}, {"rhymes": "-abjan"}], "hyphenation": ["Fa‧bian"], "word": "Fabian", "lang": "Polish", "lang_code": "pl", "senses": [{"links": [["given name", "given name"], ["Fabian", "Fabian#English"]], "glosses": ["a male given name, equivalent to English Fabian"], "tags": ["masculine", "person"], "id": "Fabian-pl-name-VtvZQ6Yw", "categories": [{"name": "Polish given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "Polish male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}]}]}
{"pos": "name", "head_templates": [{"name": "head", "args": {"1": "sv", "2": "proper noun", "head": "", "g": "c", "3": "genitive", "4": "Fabians"}, "expansion": "Fabian c (genitive Fabians)"}, {"name": "sv-proper noun", "args": {"1": "c"}, "expansion": "Fabian c (genitive Fabians)"}], "forms": [{"form": "Fabians", "tags": ["genitive"]}], "word": "Fabian", "lang": "Swedish", "lang_code": "sv", "senses": [{"links": [["given name", "given name"]], "glosses": ["a male given name"], "tags": ["common-gender"], "id": "Fabian-sv-name-h8YdwBAs", "categories": [{"name": "Swedish given names", "kind": "topical", "parents": ["Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}, {"name": "Swedish male given names", "kind": "topical", "parents": ["Male given names", "Given names", "Names", "All topics", "Proper nouns", "Terms by semantic function", "Fundamental", "Nouns", "Lemmas"], "source": "w"}]}]}

As a human I can see that Fabian, the first record in the file linked, goes from lines 2 to 7. Line 8 is a new record. But I can't work out a regex pattern that will allow me to extract the whole of records like Fabian, which are personal names.

Can you help?

1

There are 1 best solutions below

4
On BEST ANSWER

Given that the input data is in JSON format, it's best to parse it as such, using ConvertFrom-Json, which allows you to filter by the properties of the JSON objects using Where-Object:

# Assumes that an input file named "names.json"
$personalNameObjects = 
  [System.IO.File]::ReadLines((Convert-Path -LiteralPath names.json)) | 
  ConvertFrom-Json | 
  Where-Object { $_.senses.links -match '(?:given |sur)name' }

$personalNameObjects now contains [pscustomobject] instances representing those input JSON objects where the .senses.links property values contain either given name or surname (as substrings, as there are variations, such as with a plural s or a suffix such as #English) - further filtering, such as by entry type, may be needed.

To get just the unique names themselves - assuming they're stored in the .word property - use:

$personalNameObjects | ForEach-Object word | Sort-Object -Unique

Note:

  • Given the size of the input file (almost 1 GB), [System.IO.File]::ReadLines() is used to improve reading performance; Get-Content -LiteralPath names.json works too, but would be noticeably slower.

    • Because .NET's working directory usually differs from PowerShell's, Convert-Path is used to pass the input file's full path.
  • If needed, you can later convert the filtered parsed-from-JSON objects back to JSON using ConvertTo-Json; be sure to use a sufficiently large -Depth argument to prevent inadvertent truncation (see this post for background).