How can I convert this XML file at this address into a pandas dataframe?
I have downloaded the XML as a file and called it '058com.xml'
and run the code below, though the last column of the resulting dataframe is a mess of data arranged as multiple OrderedDict. The XML structure seems complex and is beyond my knowledge.
json_normalize
documentation left me confused.
How can I improve the code to fully flatten the XML ?
import pandas as pd
import xmltodict
rawdata = '058com.xml'
with open(rawdata) as fd:
doc = xmltodict.parse(fd.read(), encoding='ISO-8859-1', process_namespaces=False)
pd.json_normalize(doc['Election']['Departement']['Communes']['Commune'])
Ideally the dataframe should look like ID's, names for geographic entities and vote results and names of election candidates.
The final dataframe should contain a lot of columns when fully flatten and is expected to be very close of the CSV below. I pasted the headers and the first line in the form of a .csv
(semi-colon separated) as a resentative sample of what the dataframe should look like
Code du département;Libellé du département;Code de la commune;Libellé de la commune;Etat saisie;Inscrits;Abstentions;% Abs/Ins;Votants;% Vot/Ins;Blancs;% Blancs/Ins;% Blancs/Vot;Nuls;% Nuls/Ins;% Nuls/Vot;Exprimés;% Exp/Ins;% Exp/Vot;N°Panneau;Sexe;Nom;Prénom;Voix;% Voix/Ins;% Voix/Exp
01;Ain;001;L'Abergement-Clémenciat;Complet;645;108;16,74;537;83,26;16;2,48;2,98;1;0,16;0,19;520;80,62;96,83;1;F;ARTHAUD;Nathalie;3;0,47;0,58;2;M;ROUSSEL;Fabien;6;0,93;1,15;3;M;MACRON;Emmanuel;150;23,26;28,85;4;M;LASSALLE;Jean;18;2,79;3,46;5;F;LE PEN;Marine;149;23,10;28,65;6;M;ZEMMOUR;Éric;43;6,67;8,27;7;M;MÉLENCHON;Jean-Luc;66;10,23;12,69;8;F;HIDALGO;Anne;5;0,78;0,96;9;M;JADOT;Yannick;30;4,65;5,77;10;F;PÉCRESSE;Valérie;26;4,03;5,00;11;M;POUTOU;Philippe;3;0,47;0,58;12;M;DUPONT-AIGNAN;Nicolas;21;3,26;4,04
Since the URL really contains two data sections under each
<Tour>
, specifically<Mentions>
(which appear to be aggregate vote data) and<Candidats>
(which are granular person-level data) (pardon my French), consider building two separate data frames using the new IO method,pandas.read_xml
, which supports XSLT 1.0 (via the third-partylxml
package). No migration to dictionaries for JSON handling.As a special purpose language written in XML, XSLT can transform your nested structure to flatter format for migration to data frame. Specifically, each stylesheet drills down to the most granular node and then by the
ancestor
axis pulls higher level information as sibling columns.Mentions (save as .xsl, a special .xml file or embed as string in Python)
Python (read directly from URL)
Output
Candidats (save as .xsl, a special .xml file or embed as string in Python)
Python (read directly from URL)
Output
You can join resulting data frames using their shared
Communes
nodes:<CodSubCom>
and<LibSubCom>
but may have topivot_table
on the aggregate data for a one-to-many merge. Below demonstrates with Nombre aggregate:In forthcoming pandas 1.5,
read_xml
will supportdtypes
to allow conversion after XSLT transformation in this case.