Different response received when using HttpClient and browser

825 Views Asked by At

I am trying to scrape nse website, but when i try it using this method

    static async void DownloadPageAsync(string url)
    {
        HttpClient client = new HttpClient();
        client.DefaultRequestHeaders.TryAddWithoutValidation("Accept", "text/html,application/xhtml+xml,application/xml");
        client.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");
        client.DefaultRequestHeaders.TryAddWithoutValidation("Accept-Charset", "ISO-8859-1");
        HttpResponseMessage response = await client.GetAsync(url);
        Thread.Sleep(30000);
        response.EnsureSuccessStatusCode();
        var responseStream = await response.Content.ReadAsStreamAsync();
        var streamReader = new StreamReader(responseStream);
        var str = streamReader.ReadToEnd();

    }

I am getting this response enter image description here

but when I try the same link via chrome, My response this.. enter image description here

Where am I going wrong.. how to get the chrome response via code... please help.. regards Srivastava

1

There are 1 best solutions below

0
On

So, first off: crawling webpages is not a trivial task. Particularly correct HTML parsing is quite tricky.

There are also some netiquettes regarding web crawling, that you should be aware of before you start writing your web crawler. One in particular is to write down details on how to find more information about your web crawler in your browser. In other words, don't do this, but make it something more fancy - even if you need the 'Gecko' due to browser detection, it's proper to put something between the '('...')'.

client.DefaultRequestHeaders.TryAddWithoutValidation("User-Agent", "Mozilla/5.0 (Windows NT 6.2; WOW64; rv:19.0) Gecko/20100101 Firefox/19.0");

One thing that's notoriously difficult to handle in a web crawler is AJAX calls. Having an incorrect user agent might even make this worse, some web sites decide wether or not to use AJAX based on the browser capabilities. For the context of this question, it's best to simply assume that you cannot properly handle Javascript or AJAX in your crawler (although the truth is way more complex it would take too long to describe here...).

Knowing some stock websites, I think this is also your problem. These numbers are often refreshed using AJAX 'in real time'.