Scraping a page - one section isn't loaded using cheerio

3.8k Views Asked by At

I'm scraping a social network with cheerio and meteor. I can log in, search for some information and scrape the page for the info I want. I'm making requests and passing the html to cheerio like Scraping with Meteor.js.

Problem is, there are a section of the page that only appears when a I load the page through a web browser:

In browser:

<div A>
    <div B>
        <ul (...)>
            <li (...)>...</li>
            ...
            <li (...)>...</li>
        </ul>
    </div> <-- end B -->
    <script id="NAME_1" type="fs/embed+m"></script>
    <script type="text/javascript">fs.dupeXHR("NAME_1","NAME_2",{"renderControl":"custom","templateId":"NAME_1"});</script>
</div> <-- end A -->

In console.log(cherio.load(html)):

<div A>
    <script id="NAME_1" type="fs/embed+m"></script>
    <script type="text/javascript">fs.dupeXHR("NAME_1","NAME_2",{"renderControl":"custom","templateId":"NAME_1"});</script>
</div> <-- end A -->

I'm supposing the html is loaded by cheerio without executing the scripts. Am I right? If so, there's some way to make cheerio execute the scripts so I can scrape the page after the content is placed?

I'm making http requests with the following options to simulate a browser request, so I think that's not a problem of the request itself (headless browsers don't make it any better).

Options = function (cookie) {
  this.headers = {
    "Accept": "*/*",
    "Connection": "keep-alive",
    "User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.132 Safari/537.36"
  };
  this.params = {};
  if (cookie) {
    this.headers.Cookie = cookie.get();
  }
};
3

There are 3 best solutions below

0
On BEST ANSWER

Well, did some reverse engineering and found that the section unloaded can be retrieved by making a request to another page using the same options of headers, etc. Although meteor.js uses node.js behind the scenes, maybe the answers are right and this cannot be done the way I thought it could. Who knows (:

2
On

You need to consider few things while scraping.

Modern sites are using the newer frameworks like Angular, EmberJS, These sites HTML are rendered using Javascript (Right click on browser window, and click View Page source, you see naked html without any HTML)

This is same with Meteor apps also.

so for these type of you need to use headeless browser like PhantomJS or ZombieJS to fetch HTML content and use it for scraping

Hope this helps

0
On

You are correct that your method only gets the HTML without simulating the JavaScript. To achieve what you want, consider using packages such as CasperJS or PhantomJS. Here are some examples of how to do so:

var phantomjs = Npm.require('phantomjs');
var spawn = Npm.require('child_process').spawn;
Meteor.methods({
  runTest: function(options){
    command = spawn(phantomjs.path, ['assets/app/phantomDriver.js']);
    command.stdout.on('data',  function (data) {
      console.log('stdout: ' + data);
    });
    command.stderr.on('data', function (data) {
      console.log('stderr: ' + data);
    });
    command.on('exit', function (code) {
      console.log('child process exited with code ' + code);
    });
  }
});


var page = require('webpage').create();
page.open('http://github.com/', function() {
    console.log('Page Loaded');
    page.render('github.png');
    phantom.exit();
});

References:

http://www.meteorpedia.com/read/PhantomJS

https://atmospherejs.com/gadicohen/phantomjs