I need collect all "title" from all pages from site.
Site have HTTP Basic Auth configuration.
Without auth I do next:
require 'anemone'
Anemone.crawl("http://example.com/") do |anemone|
anemone.on_every_page do |page|
puts page.doc.at('title').inner_html rescue nil
end
end
But I have some problem with HTTP Basic Auth...
How I can collected titles from site with HTTP Basic Auth?
If I try use "Anemone.crawl("http://username:[email protected]/")" then I have only first page title, but other links have http://example.com/ style and I received 401 error.
HTTP Basic Auth works via HTTP headers. Client, willing to access restricted resource, must provide authentication header, like this one:
It contains name and password, Base64-encoded. More info is in Wikipedia article: Basic Access Authentication.
I googled a little bit and didn't find a way to make Anemone accept custom request headers. Maybe you'll have more luck.
But I found another crawler that claims it can do it: Messie. Maybe you should give it a try
Update
Here's the place where Anemone sets its request headers: Anemone::HTTP. Indeed, there's no customization there. You can monkeypatch it. Something like this should work (put this somewhere in your app):
Obviously, you should provide your own values for the
usernameandpasswordparams to thebasic_authmethod call. It's quick and dirty and hardcode, yes. But sometimes you don't have time to do things in a proper manner. :)