How to avoid getting blocked by websites when using Ruby Mechanize for web crawling

1.1k Views Asked by At

I am successful scraping building data from a website (www.propertyshark.com) using a single address, but it looks like I get blocked once I use loop to scrape multiple addresses. Is there a way around this? FYI, the information I'm trying to access is not prohibited according to their robots.txt.

Codes for single run is as follows:

require 'mechanize'

class PropShark
  def initialize(key,link_key)
    @@key = key
    @@link_key = link_key
  end

  def crawl_propshark_single
    agent = Mechanize.new{ |agent|
      agent.user_agent_alias = 'Mac Safari'
    }
    agent.ignore_bad_chunking = true
    agent.verify_mode = OpenSSL::SSL::VERIFY_NONE

    page = agent.get('https://www.google.com/')
    form = page.forms.first
    form['q'] = "#{@@key}"
    page = agent.submit(form)
    page = form.submit  
    page.links.each do |link|
      if link.text.include?("#{@@link_key}")  
        if link.text.include?("PropertyShark")
          property_page = link.click
        else
          next
        end

        if property_page
          data_value = property_page.css("div.cols").css("td.r_align")[4].text # <--- error points to these commands
          data_name = property_page.css("div.cols").css("th")[4].text
          @result_hash["#{data_name}"] = data_value
        else
          next
        end
      end 
    end

    return @result_hash
  end
end #endof: class PropShark

# run
key = '41 coral St, Worcester, MA 01604 propertyshark'
key_link = '41 Coral Street'
spider = PropShark.new(key,key_link)
puts spider.crawl_propshark_single

I get the following errors but in an hour or two the error disappears:

undefined method `text' for nil:NilClass (NoMethodError)

When I use a loop using the above codes, I delay the process by having sleep 80 between addresses.

2

There are 2 best solutions below

5
On

The first thing you should do, before you do anything else, is to contact the website owner(s). Right now, you actions could be interpreted anywhere between overly aggressive and illegal. As others have pointed out, the owners may not want you scraping the site. Alternatively, they may have an API or product feed available for this particular thing. Either way, if you are going to be depending on this website for your product, you may want to consider playing nice with them.

With that being said, you are moving through their website with all of the grace of an elephant in a china store. Between the abnormal user agent, unusual usage patterns from a single IP, and a predictable delay between requests, you've completely blown your cover. Consider taking a more organic path through the site, with a more natural human-emulation delay. Also, you should either disguise your useragent, or make it super obvious (Josh's Big Bad Scraper). You may even consider using something like Selenium, which uses a real browser, instead of Mechanize, to give away fewer hints.

You may also consider adding more robust error handling. Perhaps the site is under excessive load (or something), and the page you are parsing is not the desired page, but some random error page. A simple retry may be all you need to get that data in question. When scraping, a poorly-functioning or inefficient site can be as much of an impediment as deliberate scraping protections.

If none of that works, you could consider setting up elaborate arrays of proxies, but at that point you would be much better of using one of the many online Webscraping/API creating/Data extraction services that currently exist. They are fairly inexpensive and already do everything discussed above, plus more.

3
On

It is very likely nothing is "blocking" you. As you pointed out

property_page.css("div.cols").css("td.r_align")[4].text

is the problem. So lets focus on that line of code for a second.

Say the first time round your columns are columns = [1,2,3,4,5] well then rows[4] will return 5 (the element at index 4).

No for fun let's assume the next go around your columns are columns = ['a','b','c','d'] well then rows[4] will return nil because there is nothing at the fourth index.

This appears to be your case where sometimes there are 5 columns and sometimes there are not. Thus leading to nil.text and the error you are recieving