Hello World äüö I would like to use xmllint to produce" /> Hello World äüö I would like to use xmllint to produce" /> Hello World äüö I would like to use xmllint to produce"/>

xmllint encodes special chars

143 Views Asked by At

This is my file (UTF-8 encoded):

<?xml version="1.0" encoding="UTF-8"?>
<foo>
  <bar>Hello World äüö</bar>
</foo>

I would like to use xmllint to produce this result:

<bar>Hello World äüö</bar>

But every command prints encoded unicode characters:

$ xmllint --xpath "//bar" file.xml
<bar>Hello World &#xE4;&#xFC;&#xF6;</bar>
$ xmllint --xpath "//bar" --encode utf-8 file.xml
<bar>Hello World &#xE4;&#xFC;&#xF6;</bar>
$ xmllint --xpath "//bar" --noenc file.xml
<bar>Hello World &#xE4;&#xFC;&#xF6;</bar>

Do you have any idea how to get the unencoded result? (I can not install other tools like xmlstarlet..).

$ xmllint --version
xmllint: using libxml version 20907
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma
$ locale
LANG=C.utf8
LC_CTYPE="C.utf8"
LC_NUMERIC="C.utf8"
LC_TIME="C.utf8"
LC_COLLATE="C.utf8"
LC_MONETARY="C.utf8"
LC_MESSAGES="C.utf8"
LC_PAPER="C.utf8"
LC_NAME="C.utf8"
LC_ADDRESS="C.utf8"
LC_TELEPHONE="C.utf8"
LC_MEASUREMENT="C.utf8"
LC_IDENTIFICATION="C.utf8"
LC_ALL=
$ cat /etc/*-release
Rocky Linux release 8.8 (Green Obsidian)
1

There are 1 best solutions below

1
LMC On BEST ANSWER

Best option seems to be cat internal shell command

Given

<?xml version="1.0" encoding="UTF-8"?>
<A>
  <B>Hello World äüö  &#xE4;&#xFC;&#xF6;</B>
</A>

Sending cat <xpath expression> to internal shell

printf "%s\n" "cat //B/text()" 'bye' |  xmllint --shell tmp.xml | grep -Ev '^([/]| -----)'
Hello World äüö  äüö

Issue looks related to xmllint version (libxml2 version in the end). See details below

xmllint --version
xmllint: using libxml version 20914

Using xmllint --shell

echo "cat //B" | xmllint --shell tmp.xml 
/ > cat //B
 -------
<B>Hello World äüö  äüö</B>

--noenc and no xpath. noenctakes precedence over noent which makes sense, all characters in output are ascii.

xmllint --noenc tmp.xml 
<?xml version="1.0"?>
<A>
  <B>Hello World &#xE4;&#xFC;&#xF6;  &#xE4;&#xFC;&#xF6;</B>
</A>

--noent(looks the default)

xmllint --noent tmp.xml 
<?xml version="1.0" encoding="UTF-8"?>
<A>
  <B>Hello World äüö  äüö</B>
</A>

--xpath - noenc is ignored

xmllint --noenc --noent --xpath '//B' tmp.xml 
<B>Hello World äüö  äüö</B>

xmllint --xpath '//B' tmp.xml 
<B>Hello World äüö  äüö</B>

--shell - noenc is ignored on cat internal command and enforced on xpath one.

xmllint --shell --noenc tmp.xml 
/ > cat //B/text()
 -------
Hello World äüö  äüö
/ > xpath //B/text()
Object is a Node Set :
Set contains 1 nodes:
1  TEXT
    content=Hello World #C3#A4#C3#BC#C3#B6  #C3#A4#C3#BC#C3#B6

ASCII encoding

xmllint --encode ASCII tmp.xml
<?xml version="1.0" encoding="ASCII"?>
<A>
  <B>Hello World &#228;&#252;&#246;  &#228;&#252;&#246;</B>
</A>

lxml pyhton module is also based on libxml so here's a one liner that does the same

python3 -c 'import sys; from lxml import etree;doc=etree.parse(sys.argv[1]); print(doc.xpath(sys.argv[2]))' tmp.xml '//B/text()'

text result

['Hello World äüö  äüö']

Serialazing without indicating encoding

python3 -c 'import sys; from lxml import etree;doc=etree.parse(sys.argv[1]); print(etree.tostring(doc.xpath(sys.argv[2])[0]).decode("utf-8"))' tmp.xml '//B'
<B>Hello World &#228;&#252;&#246;  &#228;&#252;&#246;</B>

Serialazing with encoding

python3 -c 'import sys; from lxml import etree;doc=etree.parse(sys.argv[1]); print(etree.tostring(doc.xpath(sys.argv[2])[0], encoding="utf-8").decode("utf-8"))' tmp.xml '//B'
<B>Hello World äüö  äüö</B>