Anemone Ruby spider - create key value array without domain name

Question

Anemone Ruby spider - create key value array without domain name

184 Views Asked by boldfacedesignuk At 23 October 2013 at 11:55

I'm using Anemone to spider a domain and it works fine.

the code to initiate the crawl looks like this:

require 'anemone'

Anemone.crawl("http://www.example.com/") do |anemone|
  anemone.on_every_page do |page|
      puts page.url
  end
end

This very nicely prints out all the page urls for the domain like so:

http://www.example.com/
http://www.example.com/about
http://www.example.com/articles
http://www.example.com/articles/article_01
http://www.example.com/contact

What I would like to do is create an array of key value pairs using the last part of the url for the key, and the url 'minus the domain' for the value.

E.g.

[
   ['','/'],
   ['about','/about'],
   ['articles','/articles'],
   ['article_01','/articles/article_01']
]

Apologies if this is rudimentary stuff but I'm a Ruby novice.

Original Q&A

There are 2 best solutions below

**mcfinnigan** · Answer 1 · 2013-10-23T11:59:28.783000

The simplest and possibly least robust way to do this would be to use

page.url.split('/').last

to obtain your 'key'. You would need to test various edge cases to ensure it worked reliably.

edit: this will return 'www.example.com' as the key for 'http://www.example.com/' which is not the result you require

**Sean Larkin** · Answer 2 · 2013-10-23T12:33:46.077000

I would define an array or hash first outside of the block of code and then add your key value pairs to it:

require 'anemone'

path_array = []
crawl_url = "http://www.example.com/"    

Anemone.crawl(crawl_url) do |anemone|
  anemone.on_every_page do |page|
    path_array << page.url
    puts page.url
  end
end

From here you can then .map your array into a useable multi-dimensional array:

path_array.map{|x| [x[crawl_url.length..10000], x.gsub("http://www.example.com","")]}

=> [["", "/"], ["about", "/about"], ["articles", "/articles"], ["articles/article_01", "/articles/article_01"], ["contact", "/contact"]]

I'm not sure if it will work in every scenario, however I think this can give you a good start for how to collect the data and manipulate it. Also if you are wanting a key/value pair you should look into Ruby's class Hash for more information on how to use and create hash's in Ruby.

Anemone Ruby spider - create key value array without domain name

There are 2 best solutions below

Related Questions in RUBY

Related Questions in ANEMONE

Trending Questions

Popular # Hahtags

Popular Questions