Ruby - Should I always append a slash?

544 Views Asked by At

Below is the Ruby code I am using to get the HTML content of webpages. I am not allowed to change this code.

def getHtmlFromUrl(url)
    uri = URI.parse(url)
    http = Net::HTTP.new(uri.host, uri.port)
    http.read_timeout = 2
    html = http.get(uri.to_s)
    # ...
    # Handle any error that may have occurred (return nil)
    # ...
    return html.body
end

This code seems to have problems reading certain URLs that do not have trailing slashes. For example, an error occurs when I try to read http://drive.google.com, but not http://drive.google.com/. Why is this the case? I decided to implement a fix where I add a trailing slash to a domain if no path is specified. Is that a safe fix? Is it possible that an error occurs in a case of http://somedomain.com/ and works correctly for http://somedomain.com?

2

There are 2 best solutions below

0
On

You shouldn't have any problems always using a trailing slash, but another option would be to follow redirects (drive.google.com is probably redirecting you to drive.google.com/ ).

See this answer (and comments) for more information on how to deal with redirects using Net:HTTP: https://stackoverflow.com/a/6934503/1691

0
On

It really sounds like the problem is because you're not handling redirects. The Net::HTTP documentation contains information on handling them. It's a pretty simple process:

Following Redirection

Each Net::HTTPResponse object belongs to a class for its response code.

For example, all 2XX responses are instances of a Net::HTTPSuccess subclass, a 3XX response is an instance of a Net::HTTPRedirection subclass and a 200 response is an instance of the Net::HTTPOK class. For details of response classes, see the section “HTTP Response Classes” below.

Using a case statement you can handle various types of responses properly:

def fetch(uri_str, limit = 10)
  # You should choose a better exception.
  raise ArgumentError, 'too many HTTP redirects' if limit == 0

  response = Net::HTTP.get_response(URI(uri_str))

  case response
  when Net::HTTPSuccess then
    response
  when Net::HTTPRedirection then
    location = response['location']
    warn "redirected to #{location}"
    fetch(location, limit - 1)
  else
    response.value
  end
end

print fetch('http://www.ruby-lang.org')

That said, there are a number of HTTP clients for Ruby that help handle that situation for you, because it's so common, so you can concentrate on more important things like handling missing pages, timeouts, threading/handling multiple requests in parallel, decoding JSON/XML/YAML. I'd recommend looking into these (in no particular order) and see what they have to offer:

  • Typhoeus: Typhoeus.get("www.example.com", followlocation: true)
  • HTTPClient: puts clnt.get('http://dev.ctor.org/', :follow_redirect => true)
  • Curb:

    follow_location = boolean → boolean
    

    Configure whether this Curl instance will follow Location: headers in HTTP responses. Redirects will only be followed to the extent specified by max_redirects.

  • HTTParty:

    Proceed to the location header when an HTTP response dictates a redirect. Redirects are always followed by default.

    Examples:

    class Foo
      include HTTParty
      base_uri 'http://google.com'
      follow_redirects true
    end