I am facing a technical issue, I browsed several articles to find the answer but I couldn’t get a proper answer from any web site.
I am using ScrapySharp for my project to crawl web page data. This issue came when I try to crawl data from the http://edition.cnn.com/POLITICS website.
Firstly, I loaded the page via IE, and I selected Developer tools to inspect the tags. After the I selected the tag that I need for my code "//div[@class='cd__content']", Moreover when I load the above mentioned web page through ScrapySharp
ScrapingBrowser browser = new ScrapingBrowser();
WebPage rootPage = browser.NavigateToPageAsync(new Uri(url));
HtmlNodeCollection rootNodes = rootPage.Html.SelectNodes("//div[@class='cd__content']");
The result for rootNodes shows as null
When I investigate deep, What I saw is the above-mentioned cd__content is inside the "SECTION" tag when the page loads the “SECTION” tag is empty. But when I Inspect via IE or Chrome all tags are filled with information that’s why I could able to pick the element, but when I load the page programmatically it won’t. My question is, how can I load the page with filling all information using ScrapySharp.
Experts, Please help on this.
If you analyse the network traffic for the page, you'll see that the javascript makes a number of calls to load content from
http://edition.cnn.com/data/ocs/section/politics/index.htmlfor each "content zone" on the page. The response to those requests contains the HTML and content that appears in the page.You would need to review that and make similar requests yourself, or see if one or more of their RSS feeds met your needs and provided you with a more parse-able set of content - for example: http://rss.cnn.com/rss/cnn_allpolitics.rss