How to continue scraping the next URL if one URL is return a 404 when using IronWebScraper

228 Views Asked by At

I'm building a small scraper that navigates through a set of URL.

Currently I've something like:

public class MyScraper: WebScraper{
    private Queue<String> _urlToParse = new Queue<String>();

    public override void Init(){
        //Initializing _urlToParse with more than 1000 URLs
        Request(_urlToParse.Dequeue(), Parse);
    }
    
    public override void Parse(Response response){
        if(response.WasSuccessful){
            //...Parsing
        }else{
            //logging error
        }
        
        Request(_urlToParse.Dequeue(), Parse);
    }
}

But the Parse Method isn't called when I receive a 404 error.

Consequence:

  1. I cannot log the error(and when Going out the first Request call, I've no way to know if it has been successfull
  2. The next URL is not parsed

I was thinking that I would go to the Parse method with response.WasSuccessful = false and then be able to check the status code.

How should I do to handle this 404?

1

There are 1 best solutions below

5
David Specht On

The only way I could find to log the failed Url is to override the Log(string Message, LogLevel Type) method. There doesn't appear to be a good reason to have response.WasSuccessful. As you said it only appears to call Parse() when it is succesful.

public class MyScraper : WebScraper
{
    private Queue<string> _urlToParse = new Queue<string>();

    public override void Init()
    {
        _urlToParse.Enqueue("https://stackoverflow.com/");
        _urlToParse.Enqueue("https://stackoverflow.com/nothing");
        _urlToParse.Enqueue("https://google.com/");

        Request(_urlToParse.Dequeue(), Parse); 
    }

    public override void Parse(Response response)
    {            
        Console.WriteLine("Handeling response");

        if (_urlToParse.Count > 0)
        {
            Request(_urlToParse.Dequeue(), Parse);
        }            
    }

    public override void Log(string Message, LogLevel Type)
    {
        if (Type.HasFlag(LogLevel.Critical) & Message.StartsWith("Url failed permanently"))
        {
            Console.WriteLine($"Logging failed Url: {Message}");

            if (_urlToParse.Count > 0)
            {
                Request(_urlToParse.Dequeue(), Parse);
            }
        }
    }
}

Another option is it appears that WebScraper has a MaxHttpConnectionLimit that you could use to make sure it was only opening one connection at a time.

public class MyScraper : WebScraper
{
    public override void Init()
    {
        MaxHttpConnectionLimit = 1;

        var urls = new string[]
        {
            "https://stackoverflow.com/",
            "https://stackoverflow.com/nothing",
            "https://google.com/"
        };

        Request(urls, Parse); 
    }

    public override void Parse(Response response)
    {            
        Console.WriteLine("Handeling response");
    }

    public override void Log(string Message, LogLevel Type)
    {
        if (Type.HasFlag(LogLevel.Critical) & Message.StartsWith("Url failed permanently"))
        {
            Console.WriteLine($"Logging failed Url: {Message}");
        }

        base.Log(Message, Type);
    }
}