In java , parse xml for only few tags or from huge xml (size 5gb) read data from a single specified tag

1k Views Asked by At

How can i read a single tag say from a huge xml file(say 5gb) , i dont need other data from xml . Is the Stax approach the right thing ? consider sample xml

<tag1>
<tag2>
<tag3>
<tag4>
.
.
.
.
.
.
.
<balance>12121</balance>
.
.
.
.
.
.
</tag4>
</tag3>
</tag2>
</tag1>

thanks in advance

3

There are 3 best solutions below

0
On

If you are processing a huge xml files, you need to use a SAX parser, instead of a DOM parser. Look at this tutorial and here in the Oracle page.

Dom parsers - are reading whole documents and creating a structure in memory, they can be used like a map of maps, so very easy to checking if any elements exists, etc. But it;s not effective against big data.

Sax on the other hand are event driven parsers, you implement methods for reading a tag, then reading it velue, etc. It is a one iteration algoritm, that doesn't use much resources.

2
On

A lot depends on how easy it is to locate the element you are looking for. If you want the only (or first) element with name "balance" then it's pretty easy using either SAX or StAX. (StAX is probably a bit easier, but don't use the StAX parser that comes with the JDK, use Woodstox).

If it's harder to identify the element you want, then using an XSLT/XPath/XQuery engine with streaming capability would be better. For example Saxon XQuery would allow you to do

saxon:stream('big-file.xml')//balance[@account-nr='012345' and @date='2015-08-25']

But products that offer streaming tend to cost money.

0
On

Oracle's XQuery processor for Java can also stream in this case. Here is the section in the documentation about streaming: http://docs.oracle.com/database/121/ADXDK/adx_j_xqj.htm#ADXDK190

For example:

declare variable $mydata external;
$mydata//balance

And, you can also use fn:doc if you setup an entity resolver. For example:

doc("mydata.xml")//balance

This will also stream. See example 7-14: http://docs.oracle.com/database/121/ADXDK/adx_j_xqj.htm#ADXDK112