The ¬ character (0xAC in ISO-8859-1) works for normal text if I ensure that ISO-8859-1 is always used as the encoding throughout. However, when using it in attributes it is escaped to: %C2%AC
. I understand that it needs to be escaped for urls, but not why it escapes it in the same way as it would for UTF-8, rather than just %AC
as I'd expect it to for ISO-8859-1.
Since the escapes are in the output html file the only conclusion is that the xslt processor is the cause.
Example:
Which for me generates:
Output was generated using xsltproc
, compiled against libxml 20707, libxslt 10126 and libexslt 815. This was on #! Linux (amd64). I have also tried: xmlstarlet tr
(also uses libxml), xalan
and google chrome (by adding an <?xml-stylesheet ... >
, see input_ss.xml tag) with the same result.
Opera doesn't escape it at all, and it allows ¬ to be used literally in the url and attribute.
Is this standard behaviour for xslt or is this a bug in the way the attributes are escaped? And either way, is there a solution other than replacing %C2%AC
with %AC
bearing in mind it is almost certainly the same for other characters that are valid ISO-8859-1 and invalid in UTF-8.
There are 3 different text-based technologies in use here, XML, HTML and URIs.
All of these have escape mechanisms - that is to say, ways to use text to indicate other text that it is impossible or difficult to indicate in a given context.
The not-sign character
¬
(U+00AC) could be escaped in the first two as¬
; or¬
perhaps with some leading zeros, in both XML and HTML (¬
would also work in HTML). This escape would be used no matter what encoding the XML or HTML was in, because it relates to the character¬
, not to its set of octets in a given character encoding - indeed, we would generally only use it in the case where there was no such set of octets in the encoding being used.In this case, this is unnecessary, since the output is in a character encoding in which there is no need to escape it, and so in the source you can see
The ¬ character
unescaped.This HTML includes the text of a URI. The encoding of the HTML has nothing to do with this, because the encoding is how we get the text of the HTML from one machine to another, but when the HTML is being parsed to read this URI we're past that point and are dealing with some text at the level of text - that is to say, it doesn't have an encoding any more.
Now, URIs have their own escape mechanisms. This must be used in the case of
¬
, as it is not a character allowed in URIs (as opposed to IRIs). Sadly, unlike the escapes in XML and HTML, these escapes are based on octets in a given encoding rather than the code-point of the character itself.It's easy to see this as a mistake now, but URIs were specified in 1994 and that formalised work going back to 1989/1990 while Unicode 1.0 was released in 1991 and didn't have the ground-breaking 2.0 until 1996, so hindsight has considerably more benefits than URI's inventors. (HTML had the same problem many years ago, but the format of its encodings made it much easier to fix this without as many backwards-compatibility issues).
So, what encoding should we use for those octets? The original specs left this undefined, but really the only possible choice is UTF-8. It's the only encoding that gives those escapes commonly used for chracters special to URIs their escapes in the range 0x20 - 0x7F while also covering all of the UCS.
There's also no way to indicate another choice could be more appropriate. Remember, we're working at the level of text, so your use of ISO-8859-1 is completely irrelevant. Even if we kept track of the encoding while parsing the HTML, the URI is going to be made use of in a way that is nothing to do with the document, so we still couldn't use it. In all, if we have to make use of an octet-based encoding, and we have to keep characters in the ASCII range matching the octets they'd have in ASCII, the only possible basis for the encoding is UTF-8.
For that reason, the escape in any URI for
¬
must always be%C2%AC
.There can be some legacy systems that expect URIs to use other encodings, but the solution is to fix the bit that's broken, not the bit that works, so if something expects
¬
to be%AC
then catch it close to that by converting%C2%AC
close to its use (and if it outputs%AC
itself then of course you'll need to fix it to%C2%AC
before it hits the outside world).