I’m trying to understand why certain HTML attributes are failing W3C validation. I encountered this in a real codebase, but here’s a minimal reproduction:
<!DOCTYPE html><html lang="en"><head><title>a</title></head><body>
<img alt="1" src="⭐">
<img alt="2" src="/⭐">
<img alt="3" src="/a⭐">
<img alt="4" src="/a/⭐">
<img alt="5" src="">
<img alt="6" src="/"> <!-- Only this is invalid. -->
<img alt="7" src="/a">
<img alt="8" src="/a/">
</body></html>
The W3C validator reports only one error, affecting the sixth image:
Error: Bad value
/for attributesrcon elementimg: Illegal character in path segment:?is not allowed.<img alt="6" src="/">
Why is only that one a problem, and not the others? What’s different about it?
The behavior described in the question was caused by a bug in the checker (validator) code that’s fixed now; see https://github.com/validator/galimatias/pull/2. The bug had gone unnoticed due to the test suite not having had coverage for the case of a relative URL that starts with a slash followed by a code point greater than U+FFFF — like the U+1F30 (rainbow) character in the question. So the test suite was also updated to add coverage for that case; see https://github.com/web-platform-tests/wpt/pull/36213.
Incidentally, the reason the U+2b50 (⭐) case wasn’t affected by the bug while the U+1F308 () case was is: Java uses UTF-16, and U+1F308 is in the range of so-called supplementary characters (that is, the set of code points above U+FFFF), and so — as noted in a comment above — in UTF-16 the code point U+1F308 is represented by a surrogate pair of two
charvalues while U+2b50 is represented by a singlecharvalue.And the reason the difference in how many
charvalues affects how the URL is parsed is that the state machine in the HTML checker’s URL-parsing code maintains a character index and decrements it during state changes. And so, if it’s handling a URL segment that can contain code points above U+FFFF, it must be smart about how many characters it decrements the index by — it needs to decrement it by 2 for code points above U+FFFF, and by 1 otherwise.And to do that, the code has a
decrIdx()method that callsCharacter.charCount():So the code change that got made to the checker replaced a simple
idx--decrementing of the index value with a smarterCharacter.charCount()-enableddecrIdx()call.