Using threads to parse multiple Html pages faster

379 Views Asked by At

Here's what I'm trying to do:

  1. Get one html page from url which contains multiple links inside
  2. Visit each link
  3. Extract some data from visited link and create object using it

So far All i did is just simple and slow way:

public List<Link> searchLinks(string name)
    {
        List<Link> foundLinks = new List<Link>();
        // getHtmlDocument() just returns HtmlDocument using input url.
        HtmlDocument doc = getHtmlDocument(AU_SEARCH_URL + fixSpaces(name));
        var link_list = doc.DocumentNode.SelectNodes(@"/html/body/div[@id='parent-container']/div[@id='main-content']/ol[@id='searchresult']/li/h2/a");
        foreach (var link in link_list)
        {
            // TODO Threads

            // getObject() creates object using data gathered
            foundLinks.Add(getObject(link.InnerText, link.Attributes["href"].Value, getLatestEpisode(link.Attributes["href"].Value)));
        }
        return foundLinks;
    }

To make it faster/efficient I need to implement threads, but I'm not sure how i should approach it, because I can't just randomly start threads, I need to wait for them to finish, thread.Join() kind of solves 'wait for threads to finish' problem, but it becomes not fast anymore i think, because threads will be launched after earlier one is finished.

2

There are 2 best solutions below

0
On BEST ANSWER

The simplest way to offload the work to multiple threads would be to use Parallel.ForEach() in place of your current loop. Something like this:

Parallel.ForEach(link_list, link =>
{
    foundLinks.Add(getObject(link.InnerText, link.Attributes["href"].Value, getLatestEpisode(link.Attributes["href"].Value)));
});

I'm not sure if there are other threading concerns in your overall code. (Note, for example, that this would no longer guarantee that the data would be added to foundLinks in the same order.) But as long as there's nothing explicitly preventing concurrent work from taking place then this would take advantage of threading over multiple CPU cores to process the work.

1
On

Maybe you should use Thread pool :

Example from MSDN :

using System;
using System.Threading;

public class Fibonacci
{
private int _n;
private int _fibOfN;
private ManualResetEvent _doneEvent;

public int N { get { return _n; } }
public int FibOfN { get { return _fibOfN; } }

// Constructor. 
public Fibonacci(int n, ManualResetEvent doneEvent)
{
    _n = n;
    _doneEvent = doneEvent;
}

// Wrapper method for use with thread pool. 
public void ThreadPoolCallback(Object threadContext)
{
    int threadIndex = (int)threadContext;
    Console.WriteLine("thread {0} started...", threadIndex);
    _fibOfN = Calculate(_n);
    Console.WriteLine("thread {0} result calculated...", threadIndex);
    _doneEvent.Set();
}

// Recursive method that calculates the Nth Fibonacci number. 
public int Calculate(int n)
{
    if (n <= 1)
    {
        return n;
    }

    return Calculate(n - 1) + Calculate(n - 2);
}
}

public class ThreadPoolExample
{
static void Main()
{
    const int FibonacciCalculations = 10;

    // One event is used for each Fibonacci object.
    ManualResetEvent[] doneEvents = new ManualResetEvent[FibonacciCalculations];
    Fibonacci[] fibArray = new Fibonacci[FibonacciCalculations];
    Random r = new Random();

    // Configure and start threads using ThreadPool.
    Console.WriteLine("launching {0} tasks...", FibonacciCalculations);
    for (int i = 0; i < FibonacciCalculations; i++)
    {
        doneEvents[i] = new ManualResetEvent(false);
        Fibonacci f = new Fibonacci(r.Next(20, 40), doneEvents[i]);
        fibArray[i] = f;
        ThreadPool.QueueUserWorkItem(f.ThreadPoolCallback, i);
    }

    // Wait for all threads in pool to calculate.
    WaitHandle.WaitAll(doneEvents);
    Console.WriteLine("All calculations are complete.");

    // Display the results. 
    for (int i= 0; i<FibonacciCalculations; i++)
    {
        Fibonacci f = fibArray[i];
        Console.WriteLine("Fibonacci({0}) = {1}", f.N, f.FibOfN);
    }
}
}