I have large XML files ("ONIX" standard) I'd like to split. Basic structure is:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE ONIXmessage SYSTEM "http://www.editeur.org/onix/2.1/short/onix-international.dtd">
<!-- DOCTYPE is not always present and might look differently -->
<ONIXmessage> <!-- sometimes with an attribute -->
<header>
...
</header> <!-- up to this line every out-file should be identical to source -->
<product> ... </product>
<product> ... </product>
...
<product> ... </product>
<ONIXmessage>
What I want to do is to split this file into n smaller files of approximately same size. For this I'd count the number of <product>
nodes, divide them by n and clone them into n new xml files. I have searched a lot, and this task seems to be harder than I thought.
- What I could not solve so far is to clone a new XML document with identical xml declaration, doctype, root element and
<header>
node, but without<product>s
. I could do this using regex but I'd rather use xml tools. - What would be the smartest way to transfer a number of
<product>
nodes to a new XML document? Object notation, like$xml.ONIXmessage.product | % { copy... }
,XPath()
queries (can you select n nodes with XPath()?) andCloneNode()
orXMLReader
/XMLWriter
? - The content of the nodes should be identical regarding formatting and encoding. How can this be ensured?
I'd be very grateful for some nudges in the right direction!
One way is to:
Example:
UPDATE: Since performance was important here, I created a new version of the script that uses a foreach-loop and a xml-template for the copies to remove 99% of the read-operations and delete-operations. The concept is still the same, but it's executed in a different way.
Benchmark: