Scrubyt gives 404 Error when clicking link using _details method

387 Views Asked by At

This might be a similar problem to my earlier two questions - see here and here but I'm trying to use the _detail command to automatically click the link so I can scrape the details page for each individual event.

The code I'm using is:

require 'rubygems'
require 'scrubyt'

nuffield_data = Scrubyt::Extractor.define do
  fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'

  event do
    title 'The Coast of Mayo'
    link_url
    event_detail do
      dates "1-4 October"
      times "7:30pm"
    end
  end

  next_page "Next Page", :limit => 20
end

  nuffield_data.to_xml.write($stdout,1)

Is there any way to print out the URL that using the event_detail is trying to access? The error doesn't seem to give me the URL that gave the 404.

Update: I think the link may be a relative link - could this be causing problems? Any ideas how to deal with that?

4

There are 4 best solutions below

0
On BEST ANSWER

I had the same issue with relative links and fixed it like this... you have to set the :resolve param to the correct base url

  event do
    title 'The Coast of Mayo'
    link_url
    event_detail :resolve => 'http://www.nuffieldtheatre.co.uk/cn/events' do
      dates "1-4 October"
      times "7:30pm"
    end
  end
0
On
    sudo gem install ruby-debug

This will give you access to a nice ruby debugger, start the debugger by altering your script:

    require 'rubygems'
    require 'ruby-debug'
    Debugger.start
    Debugger.settings[:autoeval] = true if Debugger.respond_to?(:settings)

    require 'scrubyt'

    nuffield_data = Scrubyt::Extractor.define do
      fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'

      event do
        title 'The Coast of Mayo'
        link_url
        event_detail do
          dates "1-4 October"
          times "7:30pm"
        end
      end

      next_page "Next Page", :limit => 2

    end

    nuffield_data.to_xml.write($stdout,1)

Then find out where scrubyt is throwing an exception - in this case:

    /Library/Ruby/Gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:52:in `fetch'

Find the scrubyt gem on your system, and add a rescue clause to the method in question so that the end of the method looks like this:

      if @@current_doc_protocol == 'file'
        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(open(@@current_doc_url).read))
      else
        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc.body))
        store_host_name(self.get_current_doc_url)   # in case we're on a new host
      end
    rescue
      debugger
      self # the self is here because debugger doesn't like being at the end of a method
    end

Now run the script again and you should be dropped into a debugger when the exception is raised. Just try typing this a the debug prompt to see what the offending URL is:

@@current_doc_url

You can also add a debugger statement anywhere in that method if you want to check what is going on - for example you may want to add one between line 51 and 52 of this method to check how the url that is being called changes and why.

This is basically how I figured out the answer to your previous questions.

Good luck.

0
On

I've tried to access doc_url but that seems to also return nil. When I have access to my server (later in the day) I'll post the code with the debugging bit in it.

0
On

Sorry I have no idea why this would be nil - every time I have run this it returns a url - the method self.fetch requires a URL which you should be able to access as the local variable doc_url. If this returns nil also may you should post the code where you have included the debugger call.