Faster XPath expressions to execute queries from multiple XMLs

80 Views Asked by At

I have the two following XMLs and the problem statement is as follows.

  1. Parse XML 1 and if subnode of any node_x contains 'a' in its name (like in value_a_0) and value_a_0 contains a specific number, parse XML 2 and go to node_x-1 for all abc_x in and compare the content of value_x-1_0/1/2/3 with certain entities.

  2. If subnode of any node_x contains 'b' in its name (like in value_b_0) and value_b_0 contains a specific number(say 'm'), parse XML 2 and go to node_x+1 for all abc_x in and compare the content of value_x-1_0/1/2/3 with 'm'.

Example : For all the value_a_0 in record1 check if value_a_0 node contains 5. If so, which are the case for node_1 and node_9, go to record2/node_0 and record2/node_8 and compare the contents of value_0_0/1/2/3 whether they contains 5 or not. Similarly, for rest of the cases.

I was wondering what would be the best practice to solve it? Is there any hash-table approach in Xpath 3.0?

First XML

<record1>
    <node_1>
        <value_a_0>5</value_1_0>
        <value_b_1>0</value_1_1>
        <value_c_2>10</value_1_2>
        <value_d_3>8</value_1_3>
    </node_1>
   .................................
   .................................

    <node_9>
        <value_a_0>5</value_a_0>
        <value_b_1>99</value_b_1>
        <value_c_2>53</value_c_2>
        <value_d_3>5</value_d_3>
  </node_9>
</record1>

Second XML

<record2>
  <abc_0>
        <node_0>
            <value_0_0>5</value_0_0>
            <value_0_1>0</value_0_1>
            <value_0_2>150</value_0_2>
            <value_0_3>81</value_0_3>
        </node_0>
        <node_1>
            <value_1_0>55</value_1_0>
            <value_1_1>30</value_1_1>
            <value_1_2>150</value_1_2>
            <value_1_3>81</value_1_3>
        </node_1>
       .................................
       .................................

        <node_63>
            <value_63_0>1</value_63_0>
            <value_63_1>99</value_63_1>
            <value_63_2>53</value_63_2>
            <value_63_3>5</value_63_3>
      </node_63>
   </abc_0>
   ================================================
   <abc_99>
        <node_0>
            <value_0_0>555</value_0_0>
            <value_0_1>1810</value_0_1>
            <value_0_2>140</value_0_2>
            <value_0_3>80</value_0_3>
        </node_0>            
        <node_1>
            <value_1_0>555</value_1_0>
            <value_1_1>1810</value_1_1>
            <value_1_2>140</value_1_2>
            <value_1_3>80</value_1_3>
        </node_1>
        <node_2>
            <value_2_0>5</value_2_0>
            <value_2_1>60</value_2_1>
            <value_2_2>10</value_2_2>
            <value_2_3>83</value_2_3>
        </node_2>
       .................................
       .................................

        <node_63>
            <value_63_0>1</value_63_0>
            <value_63_1>49</value_63_1>
            <value_63_2>23</value_63_2>
            <value_63_3>35</value_63_3>
       </node_63>
    </abc_99>
  </record2>
2

There are 2 best solutions below

0
Martin Honnen On BEST ANSWER

It seems like a task that can partially solved by grouping but as in your previous examples the poor use of XML elements names that all differ by index values that should be part of an element or attribute value and not part of the element name makes it harder to write succinct code:

let $abc-elements := $doc2/record2/*
for $node-element in record1/*
for $index in (1 to count($node-element[1]/*))
for $index-element in $node-element/*[position() = $index]
group by $index, $group-value := $index-element
where tail($index-element)
return 
    <group index="{$index}" value="{$group-value}">
    {
        let $suffixes := $index-element/../string((xs:integer(substring-after(local-name(), '_')) - 1)),
            $relevant-abc-node-elements := $abc-elements/*[substring-after(local-name(), '_') = $suffixes]
        return $relevant-abc-node-elements[* = $group-value]
    }
    </group>

https://xqueryfiddle.liberty-development.net/nbUY4kA

1
Michael Kay On

First I would say that using structured element names like this is pretty poor XML design. That's relevant because when you do a join query in XPath or XQuery you're very dependent on the optimizer to find a fast execution path (e.g. a hash join), and the "weirder" your query is, the less likely the optimizer is to find a fast execution strategy.

I often start by converting "weird" XML into something more sanitary. For example in this case I would transform <value_a_0>5</value_1_0> into <value cat="a" seq="0">5</value>. That makes it easier to write your query and easier for the optimizer to recognize it, and the transformation phase is re-usable so you can apply it before any operations on the XML, not just this one.

If you're looking for better than O(n*m) performance on a join query, you need to look at the capabilities of your chosen XPath engine. Saxon-EE for example will do such optimizations, Saxon-HE won't. You're generally more likely to find advanced optimization in an XQuery engine than an XPath engine.

As for the detail of your query, I got lost with the requirement statement when you start talking about abc_x. I'm not sure what that refers to.