How do I separate xmlstarlet output with nul?

370 Views Asked by At

To I'm trying use nul (U+0) to delimit xml values in xmlstarlet output. xmlstarlet ignores -o '', -o $'\0', and -o '\0'.

I'm aware that I can use other characters like the various field separators to delimit output. The problem with this approach is that these characters can also exist as data. I don't want any ambiguity.

I want to to use nul specifically because it's the only value that can't be represented in raw XML.

So, to repeat my question: How do I separate xmlstarlet output with nul?

More information

I've included the following information at the request of the folks who requested it. While I appreciate your desire to help, please avoid suggesting XY sulutions. I'm only looking for an answer to my question as presented.

The data I'm working with looks like this:

<data>
    <datapoint attribute-1="val-1" attribute-2="val-a" />
    <datapoint attribute-1="val-2" attribute-2="val-b"  />
    <datapoint attribute-1="val-3">
        <sub-datapoint />
    </datapoint>
</data>

The way I'm trying to use xmlstarlet:

mapfile -tf ARRAY < <( xmlstarlet sel -t -m /data/datapoint -o 'datapoint' -o $'\0' -v ./@attribute-1 -o $'\0' data.xml )

A hexdump of the output I'm looking for:

64 61 74 61 70 6f 69 6e  74 00 76 61 6c 2d 31 00  |datapoint.val-1.|
64 61 74 61 70 6f 69 6e  74 00 76 61 6c 2d 32 00  |datapoint.val-2.|
64 61 74 61 70 6f 69 6e  74 00 76 61 6c 2d 33 00  |datapoint.val-3.|
3

There are 3 best solutions below

0
On BEST ANSWER

Unfortunately, xmlstarlet doesn't seem to be capable of producing nul in its output.

xmlstarlet is however capable of producing U+FFFF; A codepoint that's invalid in all XML versions. You can use this code to safely delimit XML values, and then use another program to replace it with nul:

xmlstarlet sel -t \
   -m /data/datapoint \
   -o 'datapoint' \
   -o $'\uffff' \
   -v ./@attribute-1 \
   -o $'\uffff' data.xml \
 | python3 -c 'import sys; 
               sys.stdout.write(sys.stdin.read().replace("\uffff", "\0"))'
9
On

You can use $'\1' which should be just as good as null in majority of situations :

mapfile -d $'\1' -t ARRAY < <( xmlstarlet sel -t -m "XPATH" -v "XPATH" -o $'\1' -v 'XPATH' "FILE" )
0
On

Here's a variation of @TendersMcChiken's answer with perl substituted for python:

xmlstarlet sel -t -m /data/datapoint \
  -o 'datapoint' -o $'\uFFFF' -v ./@attribute-1 -o $'\uFFFF' data.xml \
  | perl -CS -0xFFFF -l0 -pe '' \
  | hexdump -e '16/1 "%-3.2x"' -e '"|" 16/1 "%_p" "|\n"'

The output exactly matches the hexdump shown in the question.

Aside: since the goal was to capture the result into a bash array, I tried this:

mapfile -d $'\uFFFF' -t arr < <(
  xmlstarlet sel -t -m /data/datapoint \
  -o 'datapoint' -o $'\uFFFF' -v ./@attribute-1 -o $'\uFFFF' data.xml
)

It didn't work, however, because bash does not support a multibyte character as the delimiter for its mapfile builtin. [discussion]

What you could do is have xmlstarlet output 0xFFFF, use perl (or something) to translate 0xFFFF to NUL, and, finally, use mapfile with a NUL delimiter:

mapfile -d '' -t arr < <(
  xmlstarlet sel -t -m /data/datapoint \
  -o 'datapoint' -o $'\uFFFF' -v ./@attribute-1 -o $'\uFFFF' data.xml \
  | perl -CS -0xFFFF -l0 -pe ''
)