I'm scraping a social network with cheerio and meteor. I can log in, search for some information and scrape the page for the info I want. I'm making requests and passing the html to cheerio like Scraping with Meteor.js.
Problem is, there are a section of the page that only appears when a I load the page through a web browser:
In browser:
<div A>
<div B>
<ul (...)>
<li (...)>...</li>
...
<li (...)>...</li>
</ul>
</div> <-- end B -->
<script id="NAME_1" type="fs/embed+m"></script>
<script type="text/javascript">fs.dupeXHR("NAME_1","NAME_2",{"renderControl":"custom","templateId":"NAME_1"});</script>
</div> <-- end A -->
In console.log(cherio.load(html)):
<div A>
<script id="NAME_1" type="fs/embed+m"></script>
<script type="text/javascript">fs.dupeXHR("NAME_1","NAME_2",{"renderControl":"custom","templateId":"NAME_1"});</script>
</div> <-- end A -->
I'm supposing the html is loaded by cheerio without executing the scripts. Am I right? If so, there's some way to make cheerio execute the scripts so I can scrape the page after the content is placed?
I'm making http requests with the following options to simulate a browser request, so I think that's not a problem of the request itself (headless browsers don't make it any better).
Options = function (cookie) {
this.headers = {
"Accept": "*/*",
"Connection": "keep-alive",
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.132 Safari/537.36"
};
this.params = {};
if (cookie) {
this.headers.Cookie = cookie.get();
}
};
Well, did some reverse engineering and found that the section unloaded can be retrieved by making a request to another page using the same options of headers, etc. Although meteor.js uses node.js behind the scenes, maybe the answers are right and this cannot be done the way I thought it could. Who knows (: