Make NodeJS/JSDom wait for full rendering before scraping

2.6k Views Asked by At

I'm trying to scrape data from a website that I need to log into. Unfortunately, I'm getting different results using JSDom/NodeJS than I would if I were to use a web browser, such as FF. In particular, I'm not getting the log in form with the username, password and submit button.

I understand much of Javascript, at least, is asynchronous. However, I thought the "done" function of JSDom waits synchronously for the full rendering of the page. I guess what I'd like to do is simulate an HTTPS get and wait for the full document.ready to be done.

var jsdom = require("jsdom");
var jsdom_global = require("jsdom-global");
var fs = require("fs");
var jquery = fs.readFileSync("./jquery-3.1.1.min.js", "utf-8");

jsdom.env({
  url: "https://wemc.smarthub.coop/Login.html#login:",
  src: [jquery],
  done: function (err, window) {
    var $ = window.$;
    if($("button#LoginSubmitButton").length) {
        console.log('Click button found');
    } else {
        console.log('Click button not found');
    }
    // The following text boxes are not coming back:
    // $("input#LoginUsernameTextBox")
    // $("input#LoginPasswordTextBox")

    // If I enable the line below, I see a lot less than I would if I
    // do a view source in any reasonable browser.
    //console.log($("body").html());


  }
});
1

There are 1 best solutions below

0
On

Usually, this will happen because JSDOM doesn't execute the JS when it hits the page. In that case, the only elements returned will be the server rendered HTML.

You could try a headless browser module such as PhantomJS etc and see how that goes for you. There's a section about the distinction between the two at the bottom of the JSDOM github page.