Cannot Crawl Web Page Data Using ScrapySharp

559 Views Asked by At

I am facing a technical issue, I browsed several articles to find the answer but I couldn’t get a proper answer from any web site.

I am using ScrapySharp for my project to crawl web page data. This issue came when I try to crawl data from the http://edition.cnn.com/POLITICS website.

Firstly, I loaded the page via IE, and I selected Developer tools to inspect the tags. After the I selected the tag that I need for my code "//div[@class='cd__content']", Moreover when I load the above mentioned web page through ScrapySharp

ScrapingBrowser browser = new ScrapingBrowser();
WebPage rootPage = browser.NavigateToPageAsync(new Uri(url));
HtmlNodeCollection rootNodes = rootPage.Html.SelectNodes("//div[@class='cd__content']");

The result for rootNodes shows as null

When I investigate deep, What I saw is the above-mentioned cd__content is inside the "SECTION" tag when the page loads the “SECTION” tag is empty. But when I Inspect via IE or Chrome all tags are filled with information that’s why I could able to pick the element, but when I load the page programmatically it won’t. My question is, how can I load the page with filling all information using ScrapySharp.

Experts, Please help on this.

1

There are 1 best solutions below

0
Zhaph - Ben Duguid On

If you analyse the network traffic for the page, you'll see that the javascript makes a number of calls to load content from http://edition.cnn.com/data/ocs/section/politics/index.html for each "content zone" on the page. The response to those requests contains the HTML and content that appears in the page.

You would need to review that and make similar requests yourself, or see if one or more of their RSS feeds met your needs and provided you with a more parse-able set of content - for example: http://rss.cnn.com/rss/cnn_allpolitics.rss