I have an xml of the given form.
<myData>
<myElement>
<myGroupID>ID1</myGroupID>
<myGroupValue>value1</myGroupValue>
</myElement>
<myElement>
<myGroupID>ID2</myGroupID>
<myGroupValue>value2</myGroupValue>
</myElement>
<myElement>
<myGroupID>ID3</myGroupID>
<myGroupValue>value3</myGroupValue>
</myElement>
<myElement>
<myGroupID>ID4</myGroupID>
<myGroupValue>value4</myGroupValue>
</myElement>
<myElement>
<myGroupID>ID1</myGroupID>
<myGroupValue>value1</myGroupValue>
</myElement>
<myElement>
<myGroupID>ID2</myGroupID>
<myGroupValue>value2</myGroupValue>
</myElement>
<myElement>
<myGroupID>ID3</myGroupID>
<myGroupValue>value3</myGroupValue>
</myElement>
<myElement>
<myGroupID>ID4</myGroupID>
<myGroupValue>value4</myGroupValue>
</myElement>
<myData>
The total number of myElement
tags in the file can be 2-4million, and there are other tags in each of the elements.
As can be seen, the myGroupID
and myGroupValue
tags have duplicate values for different elements.
My requirement is to get distinct values of myGroupID
and myGroupValue
tags.
I was trying to use Stax parser with Iterator api [event based approach]
. What I have learnt is, I'll have to go through all the tags to check if the event.getLocalName
is myGroupID
or myGroupValue
and if so, then I'll have to use my logic of checking if the already parsed part of the file has any value as that of the current element.
But with this approach, I'm unnecessarily iterating through other tags [other than myGroupID
and myGroupValue
], which is a waste of time as it seems.
Any idea how we can directly jump to tags with specific names within an element?
Not to mention that I had 0 or even less knowledge on stax parsing and just had the chance to learn it today, and I am to use java for this parsing.
Thanks in advance for our kind suggestions.
Update:
Thanks everyone for your valuable inputs. As off now, I'm using Stax Iterator API for addressing the requirement and it seems to be working pretty fast. Moreover, the memory used by the code is also acceptable ~3mb
, whereas the total size of the file I'm parsing is of 55mb
. Thus its going all good for me.
Just a few things that's still bothering me:- the XML is containing leading
and training
spaces and the '-' character
. Any suggestions how we can get rid of them when we're not parsing a file, but directly parsing the XML coming from an input stream of an HTTPConnection
?
I do not have the choice of getting a better XML here [without the leading
and trailing spaces
and the '-' character
], as the XML I'm receiving is actually the response of a service from another system and they are not ready to modify their code to address our system's requirements.
Why not use SAX? http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
Analogical, in endElement(), when "myElement" closing brace is found, you should switch inElement to false and store or do anything else with groupId and groupValue taken from current element. This is best way to go, and pretty fast - even faster than Stax, with still minimal memory req.