I have to write a script in perl which parses uris from html. Anyway, the real problem is how to resolve relative uris.
I have base URI (base href in html) for example http://a/b/c/d;p?q (let's go through rfc3986) and different other URIs:
/g, //g, ///g, ////g, h//g, g////h, h///g:f
In this RFC, section 5.4.1 (link above) there is only example of //g:
"//g" = "http://g"
What about all other cases? As far as I understood from rfc 3986, section 3.3, multiple slashes are allowed. So, is following resolution correct?
"///g" = "http://a/b/c///g"
Or what is should be? Does anyone can explain it better and prove it with not obsoleted rfc or documentation?
Update #1: Try to look at this working url - https:///stackoverflow.com////////a/////10161264/////6618577
What's going on here?
I'll start by confirming that all the URIs you provided are valid, and by providing the outcome of the URI resolutions you mentioned (and the outcome of a couple of my own):
Next, we'll look at the syntax of relative URIs, since that's what your question circles around.
The key things from these rules for answering your question:
path-absolute
) can't start with//
. The first segment, if provided, must be non-zero in length. If the relative URI starts with//
, what follows must be anauthority
.//
can otherwise occur in a path because segments can have zero-length.Now, let's look at each of the resolutions you provided in turn.
/g
is an absolute pathpath-absolute
, and thus a valid relative URI (relative-ref
), and thus a valid URI (URI-reference
).Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Following the algorithm in §5.2.2, we get:
Following the algorithm in §5.3, we get:
//g
is different.//g
isn't an absolute path (path_absolute
) because an absolute path can't start with an empty segment ("/" [ segment-nz *( "/" segment ) ]
).Instead, it's follows the following pattern:
Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Following the algorithm in §5.2.2, we get the following:
Following the algorithm in §5.3, we get the following:
Note: This contacts server
g
!///g
is similar to//g
, except the authority is blank! This is surprisingly valid.Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Following the algorithm in §5.2.2, we get the following:
Following the algorithm in §5.3, we get the following:
Note: While valid, this URI is useless because the server name (
T.authority
) is blank!////g
is the same as///g
except theR.path
is//g
, so we getNote: While valid, this URI is useless because the server name (
T.authority
) is blank!The final three (
h//g
,g////h
,h///g:f
) are all relative paths (path-noscheme
).Parsing the URIs (say, using the regular expression in Appendix B) gives us the following:
Following the algorithm in §5.2.2, we get the following:
Following the algorithm in §5.3, we get the following:
I don't think the examples are suitable for answering what I think you really want to know, though.
Take a look at the following two URIs. They aren't equivalent.
and
Most servers will treat them the same —which is fine since servers are free to interpret paths in any way they wish— but it makes a difference when applying relative paths. For example, if these were the base URI for
../../e
, you'd getand