How to handle 500 Internal Server Error and 404 Page Not Found with Anemone, Boilerpipe and Nokigiri

1.6k Views Asked by At

I'm implementing a tool that needs to crawl a website. I'm using anemone to crawl and on each anemone's page I'm using boilerpipe and Nokogiri to manage HTML format, etc.

My problem is: if I get 500 Internal Server Error, it makes Nokogiri fail because there is no page.

Anemone.crawl(name) do |anemone|
   anemone.on_every_page do |page|
       if not (page.nil? && page.not_found?)
              result = Boilerpipe.extract(page.url, {:output => :htmlFragment, :extractor => :ArticleExtractor})
              doc = Nokogiri::HTML.parse(result)

       end
    end
end

In the case above, if there is a 500 Internal Server Error, the application will give an error on Nokogiri::HTML.parse(). I want to avoid this problem. If the server gives an error I want to continue computation ignoring this page.

There is any way to handle 500 Internal Server Error and 404 Page Not Found with these tools?

Kind regards, Hugo

2

There are 2 best solutions below

1
Bala On

I ran into a similar problem. The question and the reply is here

How to handle 404 errors with Nokogiri

4
davegson On
# gets the reponse of the link
res = Net::HTTP.get_response(URI.parse(url))

# if it returns a good code
if res.code.to_i >= 200 && res.code.to_i < 400 #good codes will be betweem 200 - 399
  # do something with the url
else
  # skip the object
  next
end