I'm trying to parse some wikitext
. Here's an example of the text I need to parse:
== title ==
=== subtopic ===
*text_1
**text dependent on text_1
**text_2 dependent on text_1
*text_2
**text dependent on text_2
=== other subtopic ===
*text_2
**text dependent on text_2
== other title ==
...
There structure here is not that complicated:
title I believe there's at least a title
in the whole document
subtopics are optional
elements There have to be at least one per topic/subtopic
sub-elements are optional and can be repeated
In case sub-elements
are repeated I intend to unify them using \ln
.
What I want to do is to parse this into dictionaries being the structure the following:
{
"title": "title"
"subtopic": "subtopic"
"main_text": "text_1"
"sub_text": "text dependent on text_1 \ln text_2 dependent on text_1"}
Do you know any pythonic way or ideas to parse this into what I want? I will really appreciate your time.
PS. Here's the complete file I'm trying to parse and extract the quotes from: Woody Allen
You said "quotes" but you linked Wikipedia. Did you mean Wikiquote?
Anyway, you must not parse wikitext yourself. Your aim is fulfilled by the
parse
API which you can access with a Python client.For instance, list of sections (i.e. quoted works) on his Wikiquote article, https://en.wikiquote.org/w/api.php?action=parse&page=Woody_Allen&prop=sections :