XSLT streaming with xsl:iterate correct way

4k Views Asked by At

I wanted to process a 161mo database, but java saxon9he run out of memory at 300mb of ram and the .NET at 1700mb ram, so I need to use streaming, so I use XMLSpy demo, but I still don't understand the xpath expressions child parent logic. I am on windows xp sp3 32bit 4gb of ram.

    <xsl:iterate select="db_entry">
        <xsl:apply-templates select="db_entry"/>
    </xsl:iterate>

What the correct way to stream this with xsl:iterate or maybe xsl:for-each is sufficiente ? There is nearly 60000 entries in this database. I mean how to correctly write this because a db_entry within a db_entry does not work.

EDIT:

<xsl:template match="databank_export">
<xsl:iterate select="db_entry">
    <xsl:apply-templates select="public_data"/>
    <xsl:text> |</xsl:text>
    <xsl:apply-templates select="text_data"/>
    <xsl:text> |</xsl:text>
    <xsl:apply-templates select="research_data"/>
    <xsl:text>&#10;</xsl:text>
</xsl:iterate>
</xsl:template>

I replace the db_entry xsl:template by xsl:iterate but then XMLspy can't load the big file so it appears that streaming doesn't work. Am I doing it right or is it program limitations or demo limitations ?

2nd EDIT: I'll put here my entire xsl code:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:math="http://www.w3.org/2005/xpath-functions/math" xmlns:array="http://www.w3.org/2005/xpath-functions/array" xmlns:map="http://www.w3.org/2005/xpath-functions/map" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:err="http://www.w3.org/2005/xqt-errors" exclude-result-prefixes="array fn map math xhtml xs err" version="3.0">
    <xsl:output method="text" encoding="UTF-8" indent="yes"/>
    <xsl:mode streamable="yes"/>
    <!--
    <xsl:template match="databank_export">
    -->
    <xsl:template match="/">
        <xsl:apply-templates select="databank_export/copy-of(db_entry)" mode="entry"/>
    </xsl:template>
    <xsl:template match="db_entry" mode="entry">
        <xsl:value-of select="public_data, text_data, research_data" separator=" |"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:template>
    <xsl:template match="public_data">
        <xsl:value-of select="sflname"/>
        <xsl:text>; </xsl:text>
        <xsl:apply-templates select="bdata"/>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="gender"/>
        <xsl:text>; PHOTO : |</xsl:text>
        <xsl:value-of select="name, gender, rating, datatype/@sdatatype, datatype/@sdatasource, bdata/sbdate, bdata/sbdate/@ccalendar" separator=" - "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="bdata/sbtime, bdata/sbtime/@sbtime_ampm, bdata/sbtime/@ctimetype, bdata/sbtime/@stimetype, bdata/sbtime/@stmerid, bdata/sbtime/@ctzauto, bdata/sbtime/@jd_ut, bdata/sbtime/@sznabbr, bdata/sbtime/@time_unknown, bdata/sbtime/@itimeaac, bdata/sbtime/@stimeaac" separator=" "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="bdata/place, bdata/country, bdata/country/@sctr" separator=", "/>
        <xsl:text>, </xsl:text>
        <xsl:value-of select="bdata/place/@slati, bdata/place/@slong" separator=" "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="scollector, seditor, biographer" separator=" "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="screationdate, slasteditdate" separator=" "/>
    </xsl:template>
    <xsl:template match="bdata">
        <xsl:value-of select="sbdate/@iday, sbdate/@imonth, sbdate/@iyear" separator="."/>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="sbtime"/>
        <xsl:text>; </xsl:text>
        <xsl:analyze-string select="sbtime/@stmerid" regex="([hm]{{1}})([0-9]{{1,2}})([ew]{{1}})([0-9]{{0,2}})">
            <xsl:matching-substring>
                <xsl:choose>
                    <xsl:when test="regex-group(3) = 'e'">
                        <xsl:text>+</xsl:text>
                    </xsl:when>
                    <xsl:when test="regex-group(3) = 'w'">
                        <xsl:text>-</xsl:text>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:text>+</xsl:text>
                    </xsl:otherwise>
                </xsl:choose>
                <xsl:choose>
                    <xsl:when test="regex-group(1) = 'h'">
                        <xsl:number value="regex-group(2)" format="01"/>
                    </xsl:when>
                    <xsl:when test="regex-group(1) = 'm'">
                        <xsl:text>00:</xsl:text>
                        <xsl:number value="regex-group(2)" format="01"/>
                        <xsl:text>:</xsl:text>
                        <xsl:number value="regex-group(4)" format="01"/>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:text>+1</xsl:text>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:matching-substring>
        </xsl:analyze-string>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="place, country" separator=","/>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="place/@slati, place/@slong" separator="; "/>
    </xsl:template>
    <xsl:template match="text_data">
        <xsl:value-of select="shortbiography, wikipedia_link, db_link, sourcenotes" separator="|"/>
    </xsl:template>
    <xsl:template match="research_data">
        <xsl:apply-templates select="categories"/>
        <xsl:text>|</xsl:text>
        <xsl:apply-templates select="relationships"/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="events/@count"/>
        <xsl:text>|</xsl:text>
        <xsl:apply-templates select="events"/>
    </xsl:template>
    <xsl:template match="categories">
        <xsl:iterate select="category">
            <xsl:value-of select="@cat_id, @db_id, @catnotes" separator=" "/>
            <xsl:text> - </xsl:text>
            <xsl:value-of select="text()"/>
            <xsl:text> |</xsl:text>
        </xsl:iterate>
    </xsl:template>
    <xsl:template match="relationships">
        <xsl:iterate select="relationship">
            <xsl:value-of select="@rel_id, @rel_db_id, @db_id, @relcat" separator=" "/>
            <xsl:text> - </xsl:text>
            <xsl:value-of select="@relnotes, text()" separator=" - "/>
            <xsl:text> |</xsl:text>
        </xsl:iterate>
    </xsl:template>
    <xsl:template match="events">
        <xsl:iterate select="event">
            <xsl:value-of select="@sevcode, @evn_id, @db_id, @evnotes" separator=" "/>
            <xsl:text> |</xsl:text>
            <xsl:apply-templates select="event_data"/>
            <xsl:text> |</xsl:text>
        </xsl:iterate>
    </xsl:template>
    <xsl:template match="event">
        <xsl:apply-templates select="event_data"/>
    </xsl:template>
    <xsl:template match="event_data">
        <xsl:value-of select="sbdate, sbdate/@ccalendar, sbdate_dmy" separator=" "/>
    </xsl:template>

</xsl:stylesheet>

It work with a small sample file but not with the whole 161mb file.

Best regards.

2

There are 2 best solutions below

0
On BEST ANSWER

Martin has answered quite a lot of the questions, but let me add a few words.

Your example code

<xsl:iterate select="db_entry">
    <xsl:apply-templates select="db_entry"/>
</xsl:iterate>

seems to be a beginner's mistake: unless db_entry actually contains another db_entry element as a child, this should be

<xsl:iterate select="db_entry">
    <xsl:apply-templates select="."/>
</xsl:iterate>

The difference between xsl:iterate and xsl:for-each is that with xsl:for-each, each item in the input sequence is processed independently of the others: there is no defined order of processing, and there is no way that the processing of one item can affect the way subsequent items are processed. With xsl:iterate, the items are processed in order, and (by using xsl:next-iteration) you can set variables/parameters when processing an item, which are available for use when processing the next item.

This difference has nothing directly to do with streaming; however xsl:iterate was introduced because there were use cases (such as computing a running total on a bank account) that were very hard to make streamable without such a construct.

Your edited code:

<xsl:iterate select="db_entry">
    <xsl:apply-templates select="public_data"/>
    <xsl:text> |</xsl:text>
    <xsl:apply-templates select="text_data"/>
    <xsl:text> |</xsl:text>
    <xsl:apply-templates select="research_data"/>
    <xsl:text>&#10;</xsl:text>
</xsl:iterate>

could equally well be written using xsl:for-each, because the processing of an item doesn't depend in any way on the processing of previous items. Either way, however, it wouldn't satisfy the streaming rules, because you are making three "downward selections" within the iteration body, and you are only allowed one. The simplest workaround to this, as Martin has illustrated, is to make a copy of each db_entry (as a tree in memory) and then you can operate on this copy without any streaming constraints.

Another workaround, if you know that the three child elements occur in the order you are processing them, is to replace:

    <xsl:apply-templates select="public_data"/>
    <xsl:text> |</xsl:text>
    <xsl:apply-templates select="text_data"/>
    <xsl:text> |</xsl:text>
    <xsl:apply-templates select="research_data"/>
    <xsl:text>&#10;</xsl:text>

by

        <xsl:for-each select="*[
            self::public_data or self::text_data or self::research_data]">
          <xsl:if test="position() ne 1"> |</xsl:if>
          <xsl:apply-templates select="."/>
        </xsl:for-each>
        <xsl:text>&#10;</xsl:text>

(Note the little trick of putting a vertical bar before every entry except the first, rather than putting it after every entry except the last. That's because when you're streaming, you don't know when you're about to reach the end. Little things like this become very important when you're trying to make your code streamable.)

As Martin says, Altova RaptorXML does not support streaming: you will need to use Saxon-EE for this.

5
On

To give you a simple example, assuming you have lots of db_entry elements but we can safely assume that a single entry pulled into memory does not cause memory problems then you can use copy-of on those elements and use traditional processing in a different mode from the main mode using streaming:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    exclude-result-prefixes="xs math"
    version="3.0">

    <xsl:mode streamable="yes"/>

    <xsl:output method="text"/>

    <xsl:template match="/">
        <xsl:apply-templates select="databank_export/copy-of(db_entry)" mode="entry"/>
    </xsl:template>

    <xsl:template match="db_entry" mode="entry">
        <xsl:value-of select="public_data, text_data, research_data" separator=" |"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:template>

</xsl:stylesheet>

That transforms e.g.

<?xml version="1.0" encoding="UTF-8"?>
<databank_export>
   <db_entry>
      <public_data>public data 1</public_data>
      <text_data>text 1</text_data>
      <research_data>research 1</research_data>
   </db_entry>
   <db_entry>
      <public_data>public data 2</public_data>
      <text_data>text 2</text_data>
      <research_data>research 2</research_data>
   </db_entry>
   <db_entry>
      <public_data>public data 3</public_data>
      <text_data>text 3</text_data>
      <research_data>research 3</research_data>
   </db_entry>
   <db_entry>
      <public_data>public data 4</public_data>
      <text_data>text 4</text_data>
      <research_data>research 4</research_data>
   </db_entry>
   <db_entry>
      <public_data>public data 5</public_data>
      <text_data>text 5</text_data>
      <research_data>research 5</research_data>
   </db_entry>
   <db_entry>
      <public_data>public data 6</public_data>
      <text_data>text 6</text_data>
      <research_data>research 6</research_data>
   </db_entry>
   <db_entry>
      <public_data>public data 7</public_data>
      <text_data>text 7</text_data>
      <research_data>research 7</research_data>
   </db_entry>
   <db_entry>
      <public_data>public data 8</public_data>
      <text_data>text 8</text_data>
      <research_data>research 8</research_data>
   </db_entry>
   <db_entry>
      <public_data>public data 9</public_data>
      <text_data>text 9</text_data>
      <research_data>research 9</research_data>
   </db_entry>
   <db_entry>
      <public_data>public data 10</public_data>
      <text_data>text 10</text_data>
      <research_data>research 10</research_data>
   </db_entry>
</databank_export>

into

public data 1 |text 1 |research 1
public data 2 |text 2 |research 2
public data 3 |text 3 |research 3
public data 4 |text 4 |research 4
public data 5 |text 5 |research 5
public data 6 |text 6 |research 6
public data 7 |text 7 |research 7
public data 8 |text 8 |research 8
public data 9 |text 9 |research 9
public data 10 |text 10 |research 10

and Saxon 9.8 EE would use streaming.

With xsl:iterate you could use

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    exclude-result-prefixes="xs math"
    version="3.0">

    <xsl:mode streamable="yes"/>

    <xsl:output method="text"/>

    <xsl:template match="/">
        <xsl:iterate select="databank_export/db_entry">
            <xsl:apply-templates select="copy-of()" mode="entry"/>
        </xsl:iterate>
    </xsl:template>

    <xsl:template match="db_entry" mode="entry">
        <xsl:value-of select="public_data, text_data, research_data" separator=" |"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:template>

</xsl:stylesheet>

As for your extended stylesheet, still assuming that pulling a single db_entry into memory to process it normally with a mode that does not use streaming I think you want something along the lines of

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:math="http://www.w3.org/2005/xpath-functions/math" xmlns:array="http://www.w3.org/2005/xpath-functions/array" xmlns:map="http://www.w3.org/2005/xpath-functions/map" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:err="http://www.w3.org/2005/xqt-errors" exclude-result-prefixes="array fn map math xhtml xs err" version="3.0"
    default-mode="entry">
    <xsl:output method="text" encoding="UTF-8" indent="yes"/>
    <xsl:mode name="start" streamable="yes"/>
    <!--
    <xsl:template match="databank_export">
    -->
    <xsl:template match="/" mode="start">
        <xsl:apply-templates select="databank_export/copy-of(db_entry)" mode="entry"/>
    </xsl:template>
    <xsl:template match="db_entry">
        <xsl:apply-templates select="public_data, text_data, research_data"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:template>
    <xsl:template match="public_data">
        <xsl:value-of select="sflname"/>
        <xsl:text>; </xsl:text>
        <xsl:apply-templates select="bdata"/>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="gender"/>
        <xsl:text>; PHOTO : |</xsl:text>
        <xsl:value-of select="name, gender, rating, datatype/@sdatatype, datatype/@sdatasource, bdata/sbdate, bdata/sbdate/@ccalendar" separator=" - "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="bdata/sbtime, bdata/sbtime/@sbtime_ampm, bdata/sbtime/@ctimetype, bdata/sbtime/@stimetype, bdata/sbtime/@stmerid, bdata/sbtime/@ctzauto, bdata/sbtime/@jd_ut, bdata/sbtime/@sznabbr, bdata/sbtime/@time_unknown, bdata/sbtime/@itimeaac, bdata/sbtime/@stimeaac" separator=" "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="bdata/place, bdata/country, bdata/country/@sctr" separator=", "/>
        <xsl:text>, </xsl:text>
        <xsl:value-of select="bdata/place/@slati, bdata/place/@slong" separator=" "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="scollector, seditor, biographer" separator=" "/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="screationdate, slasteditdate" separator=" "/>
    </xsl:template>
    <xsl:template match="bdata">
        <xsl:value-of select="sbdate/@iday, sbdate/@imonth, sbdate/@iyear" separator="."/>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="sbtime"/>
        <xsl:text>; </xsl:text>
        <xsl:analyze-string select="sbtime/@stmerid" regex="([hm]{{1}})([0-9]{{1,2}})([ew]{{1}})([0-9]{{0,2}})">
            <xsl:matching-substring>
                <xsl:choose>
                    <xsl:when test="regex-group(3) = 'e'">
                        <xsl:text>+</xsl:text>
                    </xsl:when>
                    <xsl:when test="regex-group(3) = 'w'">
                        <xsl:text>-</xsl:text>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:text>+</xsl:text>
                    </xsl:otherwise>
                </xsl:choose>
                <xsl:choose>
                    <xsl:when test="regex-group(1) = 'h'">
                        <xsl:number value="regex-group(2)" format="01"/>
                    </xsl:when>
                    <xsl:when test="regex-group(1) = 'm'">
                        <xsl:text>00:</xsl:text>
                        <xsl:number value="regex-group(2)" format="01"/>
                        <xsl:text>:</xsl:text>
                        <xsl:number value="regex-group(4)" format="01"/>
                    </xsl:when>
                    <xsl:otherwise>
                        <xsl:text>+1</xsl:text>
                    </xsl:otherwise>
                </xsl:choose>
            </xsl:matching-substring>
        </xsl:analyze-string>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="place, country" separator=","/>
        <xsl:text>; </xsl:text>
        <xsl:value-of select="place/@slati, place/@slong" separator="; "/>
    </xsl:template>
    <xsl:template match="text_data">
        <xsl:value-of select="shortbiography, wikipedia_link, db_link, sourcenotes" separator="|"/>
    </xsl:template>
    <xsl:template match="research_data">
        <xsl:apply-templates select="categories"/>
        <xsl:text>|</xsl:text>
        <xsl:apply-templates select="relationships"/>
        <xsl:text>|</xsl:text>
        <xsl:value-of select="events/@count"/>
        <xsl:text>|</xsl:text>
        <xsl:apply-templates select="events"/>
    </xsl:template>
    <xsl:template match="categories">
        <xsl:iterate select="category">
            <xsl:value-of select="@cat_id, @db_id, @catnotes" separator=" "/>
            <xsl:text> - </xsl:text>
            <xsl:value-of select="text()"/>
            <xsl:text> |</xsl:text>
        </xsl:iterate>
    </xsl:template>
    <xsl:template match="relationships">
        <xsl:iterate select="relationship">
            <xsl:value-of select="@rel_id, @rel_db_id, @db_id, @relcat" separator=" "/>
            <xsl:text> - </xsl:text>
            <xsl:value-of select="@relnotes, text()" separator=" - "/>
            <xsl:text> |</xsl:text>
        </xsl:iterate>
    </xsl:template>
    <xsl:template match="events">
        <xsl:iterate select="event">
            <xsl:value-of select="@sevcode, @evn_id, @db_id, @evnotes" separator=" "/>
            <xsl:text> |</xsl:text>
            <xsl:apply-templates select="event_data"/>
            <xsl:text> |</xsl:text>
        </xsl:iterate>
    </xsl:template>
    <xsl:template match="event">
        <xsl:apply-templates select="event_data"/>
    </xsl:template>
    <xsl:template match="event_data">
        <xsl:value-of select="sbdate, sbdate/@ccalendar, sbdate_dmy" separator=" "/>
    </xsl:template>

</xsl:stylesheet>

and then need to run Saxon 9.8 EE with -im:start. The above is basically the code you posted, only using default-mode="entry" to make sure all the templates and their apply-templates that you have written belong to the mode named entry while adding a streamable start mode which then uses my earlier suggestion of

    <xsl:template match="/" mode="start">
        <xsl:apply-templates select="databank_export/copy-of(db_entry)" mode="entry"/>
    </xsl:template>

to pull in the db_entry elements with streaming, but to push a copy of each to a non-streamable mode where normal processing can be used with e.g.

    <xsl:template match="db_entry">
        <xsl:apply-templates select="public_data, text_data, research_data"/>
        <xsl:text>&#10;</xsl:text>
    </xsl:template

If I had written that from scratch I would have used the approach as in my shorter snippet to have the unnamed default mode as streamable and then to push the copies to a named, non-streamable mode but given all the code you had added it was easier to set that default-mode attribute and then use the named mode start with streaming as the initial mode.

As said, to work around memory problems, the whole approach with that mixture of streaming and traditional processing only makes sense if there are thousands of db_entry elements making up the size of the input tree with traditional processing. And it needs Saxon 9.8 EE (or perhaps 9.7 EE) as the only current implementation of streaming.

Note that I have made no effort to check or correct all those templates you have posted, in general you could use xsl:for-each in many places where you have tried xsl:iterate or I would prefer there to use e.g. <xsl:apply-templates select="event"/> and set up an <xsl:template match="event">...</xsl:template>, to keep the code modular and the processing approach consistent.