I have 5.5 GB XML file needs to be parsed and insert into MongoDB. My XML will have different classes / group. sample group:
<parent class="class1">
<p key="value"/>
</parent>
<parent class="class2">
<p key="value"/>
<p key1="value1"/>
<p key2="value2"/>
</parent>
The classes are located all over the file (not sequential or grouped). Each group should be inserted into a separate Mongo Collection.
Solution:
- I used
lxmllibrary from python to parse the XML file, this library is a memory-efficient. - I loop through the entire ~5GB file and grouping the unique class elements
- The consolidated data per class is stored in a separate collection in MongoDB
The above process is taking approximately 1.5 hours. Parsing: ~1 hour and 23 mins. Mongo insertion: ~7 mins
I want each class to have a unique collection in MongoDB, the data i want is the parent attributes (e.g. class='class1') + the children elements (p) in one document for each group (parent+child).
Is there any in-memory libraries that I can use to speed up the overall processing time?