How to extract JS rendered HTML using Selenium-webdriver and nokogiri?

586 Views Asked by At

Consider two webpages one and two. Site number two is easy to scrape using nokogiri because it doesn't use JS. Site number one however cannot be scraped using just nokogiri. I googled and searched far and wide and found that if I loaded the page with an automated web browser I could scrape the the rendered HTML. I have the following code right below:

# creates an instance
driver = Selenium::WebDriver.for :chrome

# opens an existing webpage
driver.get 'http://www.bigstub.com/search.aspx' 

# wait is used to let the webpage load up and let the JS render
wait = Selenium::WebDriver::Wait.new(:timeout => 5)

My question is that I am trying to let the page load up an close immediately once I get my desired class. An example is that if I adjust the time out to 10 seconds until I can find the class .title-holder how would I write this code?

Pusedo code: rendered_source_page will time out if .include?("title-holder"). I just don't know how to write it.

UPDATE: In regards to the headless question, selenium has an option or configuration in where you can add in a headless option. This is done by the code below:

options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
driver = Selenium::WebDriver.for :chrome, options: options

For my next question in order for the site to fully scrape the JS rendered HTML I set my timeout variable to 5 seconds:

wait = Selenium::WebDriver::Wait.new(:timeout => 5)
wait.until { /title-holder/.match(driver.page_source) }

wait.until pretty much means wait 5 seconds until I find a title-holder class inside of the page_source or rendered HTML. This pretty much solved all my questions.

2

There are 2 best solutions below

0
Kenkuts On BEST ANSWER

In regards to the headless question, selenium has an option or configuration in where you can add in a headless option. This is done by the code below:

options = Selenium::WebDriver::Chrome::Options.new
options.add_argument('--headless')
driver = Selenium::WebDriver.for :chrome, options: options

For my next question in order for the site to fully scrape the JS rendered HTML I set my timeout variable to 5 seconds:

wait = Selenium::WebDriver::Wait.new(:timeout => 5)
wait.until { /title-holder/.match(driver.page_source) }

wait.until pretty much means wait 5 seconds until I find a title-holder class inside of the page_source or rendered HTML. This pretty much solved all my questions.

0
Vikram Sharma On

I am assuming you are running selenium on a server. So first install Xvfb

sudo apt-get install xvfb

Install firefox

sudo apt-get install firefox

Add the following two gems to your gemfile. You will need headless because you want to run the selenium webdriver on your server. Headless will start and stop Xvfb for you.

#gemfile

gem 'selenium-webdriver'
gem 'headless'

Code for scraping

  headless = Headless.new
  headless.start
  driver = Selenium::WebDriver.for :firefox
  driver.navigate.to example.com
  wait = Selenium::WebDriver::Wait.new(:timeout => 30)
  #scraping code comes here

Housekeeping so that you don't run out of memory.

  driver.quit
  headless.destroy

Hope this helps.