I want to create an instance of java.net.URI using individual URI components, namely:
- scheme
- userInfo
- host
- port
- path
- query
- fragment
There is a constructor in java.net.URI class that allows me to do it, here is a code from the library:
public URI(String scheme,
String authority,
String path, String query, String fragment)
throws URISyntaxException
{
String s = toString(scheme, null,
authority, null, null, -1,
path, query, fragment);
checkPath(s, scheme, path);
new Parser(s).parse(false);
}
This constructor will also encode path, query, and fragment parts of the URI, so for example if I pass already encoded strings as arguments, they will be double encoded.
JavaDoc on this function states:
- If a path is given then it is appended. Any character not in the unreserved, punct, escaped, or other categories, and not equal to the slash character ('/') or the commercial-at character ('@'), is quoted.
- If a query is given then a question-mark character ('?') is appended, followed by the query. Any character that is not a legal URI character is quoted.
- Finally, if a fragment is given then a hash character ('#') is appended, followed by the fragment. Any character that is not a legal URI character is quoted.
it states that unreserved punct and escaped characters are NOT quoted, punct characters include:
!
#
$
&
'
(
)
*
+
,
;
=
:
According to RFC 3986 reserved characters are:
reserved = gen-delims / sub-delims
gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "="
So, if characters @
, /
and +
are reserved, and should always be encoded (or I'm I missing something?), according to the most up to date RFC on URIs, then why does java.net.URI JavaDoc states that it will not encode punct characters (which includes +
and =
), @
and /
?
Here is a little example I ran:
String scheme = "http";
String userInfo = "username:password";
String host = "example.com";
int port = 80;
String path = "/path/t+/resource";
String query = "q=search+term";
String fragment = "section1";
URI uri = new URI(scheme, userInfo, host, port, path, query, fragment);
uri.toString // will not encode `+` in path.
I don't understand, if this is correct behavior and those characters indeed don't need to be encoded, then why are they referred to as "reserved" in an RFC? I'm trying to implement a function that will take a whole URI string and encode it (hence extract path, query, and fragment, encode reserved characters in them, and put the URI back together).
It is exactly because that these characters are reserved, that Java's API does not encode them. Being reserved means that they have special meaning when they are not escaped:
from the same section of the RFC you linked:
If
java.net.URI
always escaped them, then you would not be able to express whatever special meaning the reserved characters have. You would be only able to createbut not
which can be URIs that mean different things, according to the RFC.
Further down that section, it is also said that
In other words, if "these characters are specifically allowed by the URI scheme to represent data in that component", then "URI producing applications should NOT percent-encode data octets...". This is very much the case in the path component, which uses a subset of the reserved characters -
/
,@
,:
, and everything in "sub-delims".This matches what the JavaDoc says about what it doesn't escape. Note that the wording in the JavaDoc (words like "escaped" and "punct") is actually from an older RFC, RFC 2396. With a bit of careful checking, you can see that they are indeed equivalent in this regard.