Using Ruby's Anemone Gem to Scrape All Email Addresses From a Site

218 Views Asked by At

I am trying to scrape all the email addresses on a given site using a single file Ruby script. At the bottom of the file I have a hardcoded test-case using a URL that has an email address listed on that specific page (so it should find an email address on the first iteration of the first loop.

For some reason, my regex does not seem to be matching:

#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'

class GetEmails

  def initialize
      @urlCounter, @anemoneCounter  = 0
      $allUrls, $emailUrls, $emails = []
  end


  def has_email?(listingUrl)
   hasListing = false
   Anemone.crawl(listingUrl) do |anemone|
      anemone.on_every_page do |page|
      body_text = page.body.to_s
      matchOrNil = body_text.match(/\A[^@\s]+@[^@\s]+\z/)
       if matchOrNil != nil
        $emailUrls[$anemoneCounter] = listingUrl
        $emails[$anemoneCounter] = body_text.match
        $anemoneCounter += 1
        hasListing = true
      else 
      end
    end
   end
   return hasListing
  end

end 

emailGrab = GetEmails.new()
emailGrab.has_email?("http://genuinestoragesheds.com/contact/")
puts $emails[0]
2

There are 2 best solutions below

0
dezull On BEST ANSWER

\A and \z in your match beginning and end of string respectively. Obviously that webpage contains more that just an email string, or you wound't do the regex test at all.

You can simplify it to just /[^@\s]+@[^@\s]+/, but you would still need to cleanup the string the extract the email.

5
HMLDude On

So here is the working version of the code. Uses a single regex to find a string containing an email and three more to clean it.

#get_emails.rb
require 'rubygems'
require 'open-uri'
require 'nokogiri'
require 'mechanize'
require 'uri'
require 'anemone'

class GetEmails

  def initialize
      @urlCounter = 0
      $anemoneCounter  = 0
      $allUrls = []
      $emailUrls = []
      $emails = []
  end

  def email_clean(email)
    email = email.gsub(/(\w+=)/,"")  
    email = email.gsub(/(\w+:)/, "")
    email = email.gsub!(/\A"|"\Z/, '')
    return email
  end


  def has_email?(listingUrl)
   hasListing = false
   Anemone.crawl(listingUrl) do |anemone|
      anemone.on_every_page do |page|
      body_text = page.body.to_s
      #matchOrNil = body_text.match(/\A[^@\s]+@[^@\s]+\z/)   
      matchOrNil = body_text.match(/[^@\s]+@[^@\s]+/)
       if matchOrNil != nil
        $emailUrls[$anemoneCounter] = listingUrl
        $emails[$anemoneCounter] = matchOrNil
        $anemoneCounter += 1
        hasListing = true
      else 
      end
    end
   end
   return hasListing
  end

end 

emailGrab = GetEmails.new()
found_email = "href=\"mailto:[email protected]\""
puts emailGrab.email_clean(found_email)