How to go through array of URLs using Curb

171 Views Asked by At

I need to parse this page https://www.petsonic.com/snacks-huesos-para-perros/ and recieve information from every item(name,price,image,etc.). The problem is that i don't know how to parse array of URL. If i were using 'open-uri' i would do something like this

require 'nokogiri'
require 'open-uri'


page="https://www.petsonic.com/snacks-huesos-para-perros/"


doc=Nokogiri::HTML(open(page))
links=doc.xpath('//a[@class="product-name"]/@href')

links.to_a.each do|url|
  doc2=Nokogiri::HTML(open(url))
  text=doc2.xpath('//a[@class="product-name"]').text
  puts text
end

However, i am only allowed to use 'Curb' and that's making me confused

1

There are 1 best solutions below

2
lacostenycoder On BEST ANSWER

You can use the curb gem

gem install curb

Then in your ruby script

require 'curb'
page = "https://www.petsonic.com/snacks-huesos-para-perros/"
str = Curl.get(page).body
links = str.scan(/<a(.*?)<\/a\>/).flatten.select{|l| l[/class\=\"product-name/]}
inner_text_of_links = links.map{|l| l[/(?<=>).*/]}
puts inner_text_of_links

The hard part of this was the regex let's break it down. To get the links we just scan the string for <a> tags, then get those into an array and flatten them into one array.

str.scan(/<a(.*?)<\/a\>/)

Then we select the items which match our pattern. We are looking for the class you specified.

.select{|l| l[/class\=\"product-name/]}

Now to get the innertext of the tag we just map it using a look behind regex

inner_text_of_links = links.map{|l| l[/(?<=>).*/]}