Parse huge xml file to get distinct values from child tags -- need best approach suggestions

1.2k Views Asked by At

I have an xml of the given form.

 <myData>
    <myElement>
            <myGroupID>ID1</myGroupID>
            <myGroupValue>value1</myGroupValue>
    </myElement>
    <myElement>
            <myGroupID>ID2</myGroupID>
            <myGroupValue>value2</myGroupValue>
    </myElement>
    <myElement>
            <myGroupID>ID3</myGroupID>
            <myGroupValue>value3</myGroupValue>
    </myElement>
        <myElement>
            <myGroupID>ID4</myGroupID>
            <myGroupValue>value4</myGroupValue>
    </myElement>
        <myElement>
            <myGroupID>ID1</myGroupID>
            <myGroupValue>value1</myGroupValue>
    </myElement>
    <myElement>
            <myGroupID>ID2</myGroupID>
            <myGroupValue>value2</myGroupValue>
    </myElement>
    <myElement>
            <myGroupID>ID3</myGroupID>
            <myGroupValue>value3</myGroupValue>
    </myElement>
        <myElement>
            <myGroupID>ID4</myGroupID>
            <myGroupValue>value4</myGroupValue>
    </myElement>
<myData>    

The total number of myElement tags in the file can be 2-4million, and there are other tags in each of the elements. As can be seen, the myGroupID and myGroupValue tags have duplicate values for different elements.

My requirement is to get distinct values of myGroupID and myGroupValue tags.

I was trying to use Stax parser with Iterator api [event based approach]. What I have learnt is, I'll have to go through all the tags to check if the event.getLocalName is myGroupID or myGroupValue and if so, then I'll have to use my logic of checking if the already parsed part of the file has any value as that of the current element.

But with this approach, I'm unnecessarily iterating through other tags [other than myGroupID and myGroupValue ], which is a waste of time as it seems.

Any idea how we can directly jump to tags with specific names within an element?

Not to mention that I had 0 or even less knowledge on stax parsing and just had the chance to learn it today, and I am to use java for this parsing.

Thanks in advance for our kind suggestions.

Update:

Thanks everyone for your valuable inputs. As off now, I'm using Stax Iterator API for addressing the requirement and it seems to be working pretty fast. Moreover, the memory used by the code is also acceptable ~3mb, whereas the total size of the file I'm parsing is of 55mb. Thus its going all good for me.

Just a few things that's still bothering me:- the XML is containing leading and training spaces and the '-' character. Any suggestions how we can get rid of them when we're not parsing a file, but directly parsing the XML coming from an input stream of an HTTPConnection?

I do not have the choice of getting a better XML here [without the leading and trailing spaces and the '-' character], as the XML I'm receiving is actually the response of a service from another system and they are not ready to modify their code to address our system's requirements.

1

There are 1 best solutions below

8
On

Why not use SAX? http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/

public void startElement(....) {
    if (qName.equalsIgnoreCase("myElement")) {
        //do stuff, inElement = true, prepare new element...
    }

    else if (qName.equalsIgnoreCase("MYGROUPID") && inElement) {
        //do stuff
    }

    else if (qName.equalsIgnoreCase("MYGROUPVALUE") && inElement) {
        //do stuff
    }

Analogical, in endElement(), when "myElement" closing brace is found, you should switch inElement to false and store or do anything else with groupId and groupValue taken from current element. This is best way to go, and pretty fast - even faster than Stax, with still minimal memory req.