Parse elements and sub-elements from wikitext with Python 3

111 Views Asked by At

I'm trying to parse some wikitext. Here's an example of the text I need to parse:

== title ==
=== subtopic ===
*text_1
**text dependent on text_1
**text_2 dependent on text_1
*text_2
**text dependent on text_2
=== other subtopic ===
*text_2
**text dependent on text_2
== other title ==
...

There structure here is not that complicated:
title I believe there's at least a title in the whole document
subtopics are optional
elements There have to be at least one per topic/subtopic
sub-elements are optional and can be repeated

In case sub-elements are repeated I intend to unify them using \ln.

What I want to do is to parse this into dictionaries being the structure the following:

{
"title": "title"
"subtopic": "subtopic"
"main_text": "text_1"
"sub_text": "text dependent on text_1 \ln text_2 dependent on text_1"}

Do you know any pythonic way or ideas to parse this into what I want? I will really appreciate your time.

PS. Here's the complete file I'm trying to parse and extract the quotes from: Woody Allen

1

There are 1 best solutions below

2
On

You said "quotes" but you linked Wikipedia. Did you mean Wikiquote?

Anyway, you must not parse wikitext yourself. Your aim is fulfilled by the parse API which you can access with a Python client.

For instance, list of sections (i.e. quoted works) on his Wikiquote article, https://en.wikiquote.org/w/api.php?action=parse&page=Woody_Allen&prop=sections :

{
    "parse": {
        "title": "Woody Allen",
        "pageid": 80,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "Quotes",
                "number": "1",
                "index": "1",
                "fromtitle": "Woody_Allen",
                "byteoffset": 657,
                "anchor": "Quotes"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Getting Even</i> (1971)",
                "number": "1.1",
                "index": "2",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11322,
                "anchor": "Getting_Even_.281971.29"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "<i>My Philosophy</i>",
                "number": "1.1.1",
                "index": "3",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11471,
                "anchor": "My_Philosophy"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Everything You Always Wanted to Know About Sex* (*But Were Afraid to Ask)</i> (1972)",
                "number": "1.2",
                "index": "4",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11814,
                "anchor": "Everything_You_Always_Wanted_to_Know_About_Sex.2A_.28.2ABut_Were_Afraid_to_Ask.29_.281972.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Sleeper</i> (1973)",
                "number": "1.3",
                "index": "5",
                "fromtitle": "Woody_Allen",
                "byteoffset": 12364,
                "anchor": "Sleeper_.281973.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Love and Death</i> (1975)",
                "number": "1.4",
                "index": "6",
                "fromtitle": "Woody_Allen",
                "byteoffset": 12858,
                "anchor": "Love_and_Death_.281975.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Without Feathers</i> (1975)",
                "number": "1.5",
                "index": "7",
                "fromtitle": "Woody_Allen",
                "byteoffset": 14090,
                "anchor": "Without_Feathers_.281975.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Annie Hall</i> (1977)",
                "number": "1.6",
                "index": "8",
                "fromtitle": "Woody_Allen",
                "byteoffset": 16485,
                "anchor": "Annie_Hall_.281977.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Side Effects</i> (1980)",
                "number": "1.7",
                "index": "9",
                "fromtitle": "Woody_Allen",
                "byteoffset": 16899,
                "anchor": "Side_Effects_.281980.29"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "My Apology",
                "number": "1.7.1",
                "index": "10",
                "fromtitle": "Woody_Allen",
                "byteoffset": 17529,
                "anchor": "My_Apology"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Manhattan Murder Mystery</i> (1993)",
                "number": "1.8",
                "index": "11",
                "fromtitle": "Woody_Allen",
                "byteoffset": 18579,
                "anchor": "Manhattan_Murder_Mystery_.281993.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Don't Drink the Water</i> (1994)",
                "number": "1.9",
                "index": "12",
                "fromtitle": "Woody_Allen",
                "byteoffset": 18960,
                "anchor": "Don.27t_Drink_the_Water_.281994.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Deconstructing Harry</i> (1997)",
                "number": "1.10",
                "index": "13",
                "fromtitle": "Woody_Allen",
                "byteoffset": 19228,
                "anchor": "Deconstructing_Harry_.281997.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Standup Comic</i> (1999)",
                "number": "1.11",
                "index": "14",
                "fromtitle": "Woody_Allen",
                "byteoffset": 21289,
                "anchor": "Standup_Comic_.281999.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Mere Anarchy</i> (2007)",
                "number": "1.12",
                "index": "15",
                "fromtitle": "Woody_Allen",
                "byteoffset": 22463,
                "anchor": "Mere_Anarchy_.282007.29"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Attributed",
                "number": "2",
                "index": "16",
                "fromtitle": "Woody_Allen",
                "byteoffset": 24181,
                "anchor": "Attributed"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Others",
                "number": "3",
                "index": "17",
                "fromtitle": "Woody_Allen",
                "byteoffset": 25045,
                "anchor": "Others"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Quotes about Allen",
                "number": "4",
                "index": "18",
                "fromtitle": "Woody_Allen",
                "byteoffset": 27525,
                "anchor": "Quotes_about_Allen"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "External links",
                "number": "5",
                "index": "19",
                "fromtitle": "Woody_Allen",
                "byteoffset": 29106,
                "anchor": "External_links"
            }
        ]
    }
}