After viewing this previous SO question regarding percent encoding, I'm curious as to which styles of encodings are correct - the Wikipedia article on percent encoding alludes to using + instead of %20 for spaces, while still having an application/x-www-urlencoded content type.
This leads me to think the + vs. %20 behavior depends on which part of the URL is being encoded. What differences are preferred for path segments vs. query strings? Details and references for this specification would be greatly appreciated.
Note: I assume that non-alphanumeric characters will be encoded via UTF-8, in that each octet for a character becomes a %XX string. Correct me if I am wrong here (for instance latin-1 instead of utf-8), but I am more interested in the differences between the encodings of different parts of a URL.
Not only does it depend on the particular URL component, but it also depends on the circumstances in which that component is populated with data.
The use of
'+'for encoding space characters is specific to theapplication/x-www-form-urlencodedformat, which applies to webform data that is being submitted in an HTTP request. It does not apply to a URL itself.The
application/x-www-form-urlencodedformat is formally defined by W3C in the HTML specifications. Here is the definition from HTML 4.01:Section 17.13.3 Processing form data, Step four: Submit the encoded form data set
Section 17.13.4 Form content types, application/x-www-form-urlencoded
The corresponding HTML5 definitions (Section 4.10.22.3 Form submission algorithm and Section 4.10.22.6 URL-encoded form data) are way more refined and detailed, but for purposes of this discussion, the jist is roughly the same.
So, in the situation where the webform data is submitted via an HTTP
GETrequest instead of aPOSTrequest, the webform data is encoded usingapplication/x-www-form-urlencodedand placed as-is in the URLquerycomponent.Per RFC 3986: Uniform Resource Identifier (URI): Generic Syntax:
'+'is a reserved character:The
querycomponent explicitly allows unencoded'+'characters, as it allows characters fromsub-delims:So, in the context of a webform submission, spaces are encoded using
'+'prior to then being put as-is into thequerycomponent. This is allowed by the URL syntax, since the encoded form ofapplication/x-www-form-urlencodedis compatible with the definition of thequerycomponent.So, for example:
http://server/script?field=hello+worldHowever, outside of a webform submission, putting a space character directly into the
querycomponent requires the use ofpct-encoded, since' 'is not included in eitherunreservedorsub-delims, and is not explicitly allowed by thequerydefinition.So, for example:
http://server/script?hello%20worldSimilar rules also apply to the
pathcomponent, due to its use ofpchar:So, although
pathdoes allow for unencodedsub-delimscharacters, a'+'character gets treated as-is, not as an encoded space.application/x-www-form-urlencodedis not used with thepathcomponent, so a space character has to be encoded as%20due to the definitions ofpcharandsegment-nz-nc.Now, regarding the charset used to encode characters -
For a webform submission, that charset is dictated by rules defined in the webform encoding algorithm (more so in HTML5 than HTML4) used to prepare the webform data prior to inserting it into the URL. In a nutshell, the HTML can specify an
accept-charsetattribute or hidden_charset_field directly in the<form>itself, otherwise the charset is typically the charset used by the parent HTML.However, outside of a webform submission, there is no formal standard for which charset is used to encode non-ascii characters in a URL component (the IRI syntax, on the other hand, requires UTF-8 especially when converting an IRI into an URI/URL). Outside of IRI, it is up to particular URI schemes to dictate their charsets (the HTTP scheme does not), otherwise the server decides which charset it wants to use. Most schemes/servers use UTF-8 nowadays, but there are still some servers/schemes that use other charsets, typically based on the server's locale (Latin1, Shift-JIS, etc). There have been attempts to add charset reporting directly in the URL and/or in HTTP (such as Deterministic URI Encoding ), but those are not commonly used.