Below is the Ruby code I am using to get the HTML content of webpages. I am not allowed to change this code.
def getHtmlFromUrl(url)
uri = URI.parse(url)
http = Net::HTTP.new(uri.host, uri.port)
http.read_timeout = 2
html = http.get(uri.to_s)
# ...
# Handle any error that may have occurred (return nil)
# ...
return html.body
end
This code seems to have problems reading certain URLs that do not have trailing slashes. For example, an error occurs when I try to read http://drive.google.com
, but not http://drive.google.com/
. Why is this the case? I decided to implement a fix where I add a trailing slash to a domain if no path is specified. Is that a safe fix? Is it possible that an error occurs in a case of http://somedomain.com/
and works correctly for http://somedomain.com
?
You shouldn't have any problems always using a trailing slash, but another option would be to follow redirects (drive.google.com is probably redirecting you to drive.google.com/ ).
See this answer (and comments) for more information on how to deal with redirects using Net:HTTP: https://stackoverflow.com/a/6934503/1691