When to use 'http://' or 'http://www.' when scraping?

112 Views Asked by At

I am scraping a small number of sites with the ruby anemone gem.

Anemone.crawl("http://www.somesite.com") do |anemone|
         anemone.on_every_page do |page|
            ...
         end
end

Depending on the site, some require 'www' to be present in the url while others require that it be omitted. How can I configure the crawler or code it so that it known when to use the correct url?

2

There are 2 best solutions below

0
the Tin Man On BEST ANSWER

You can't know, so, do something similar to what you'd do while sitting in front of the browser.

Try one, see if you get a connection, see if you got a 200 response, then see if the title has "error" in it. If none of those fail, then consider it good.

If not, try the other.

The problem using a canned spider/crawler is you have to work around their code when the situation is different than they expected when they wrote the software.

0
Casper On

Most sites redirect www to somesite.com, or the other way around automatically, so you should not have to worry about that.

I would think Anemone can handle redirects(?). But if it can't then I suggest you pre-check the URLs for redirects before you hand them over to Anemone. You can look here how to do that:

How can I get the final URL after redirects using Ruby?

I.e.:

final_url = check_base_url_for_redirect('www.somesite.com')
Anemone.crawl(final_url) ...