I'm just beginning to learn C# threading and concurrent collections, and am not sure of the proper terminology to pose my question, so I'll describe briefly what I'm trying to do. My grasp of the subject is rudimentary at best at this point. Is my approach below even feasible as I've envisioned it?
I have 100,000 urls in a Concurrent collection that must be tested--is the link still good? I have another concurrent collection, initially empty, that will contain the subset of urls that an async request determines to have been moved (400, 404, etc errors).
I want to spawn as many of these async requests concurrently as my PC and our bandwidth will allow, and was going to start at 20 async-web-request-tasks per second and work my way up from there.
Would it work if a single async task handled both things: it would make the async request and then add the url to the BadUrls collection if it encountered a 4xx error? A new instance of that task would be spawned every 50ms:
class TestArgs args {
ConcurrentBag<UrlInfo> myCollection { get; set; }
System.Uri currentUrl { get; set; }
}
ConcurrentQueue<UrlInfo> Urls = new ConncurrentQueue<UrlInfo>();
// populate the Urls queue
<snip>
// initialize the bad urls collection
ConcurrentBag<UrlInfo> BadUrls = new ConcurrentBag<UrlInfo>();
// timer fires every 50ms, whereupon a new args object is created
// and the timer callback spawns a new task; an autoEvent would
// reset the timer and dispose of it when the queue was empty
void SpawnNewUrlTask(){
// if queue is empty then reset the timer
// otherwise:
TestArgs args = {
myCollection = BadUrls,
currentUrl = getNextUrl() // take an item from the queue
};
Task.Factory.StartNew( asyncWebRequestAndConcurrentCollectionUpdater, args);
}
public async Task asyncWebRequestAndConcurrentCollectionUpdater(TestArgs args)
{
//make the async web request
// add the url to the bad collection if appropriate.
}
Feasible? Way off?
The approach seems fine, but there are some issues with the specific code you've shown.
But before I get to that, there have been suggestions in the comments that Task Parallelism is the way to go. I think that's misguided. There's a common misconception that if you want to have lots of work going on in parallel, you necessarily need lots of threads. That's only true if the work is compute-bound. But the work you're doing will be IO bound - this code is going to spend the vast majority of its time waiting for responses. It will do very little computation. So in practice, even if it only used a single thread, your initial target of 20 requests per second doesn't seem like a workload that would cause a single CPU core to break into a sweat.
In short, a single thread can handle very high levels of concurrent IO. You only need multiple threads if you need parallel execution of code, and that doesn't look likely to be the case here, because there's so little work for the CPU in this particular job.
(This misconception predates
await
andasync
by years. In fact, it predates the TPL - see http://www.interact-sw.co.uk/iangblog/2004/09/23/threadless for a .NET 1.1 era illustration of how you can handle thousands of concurrent requests with a tiny number of threads. The underlying principles still apply today because Windows networking IO still basically works the same way.)Not that there's anything particularly wrong with using multiple threads here, I'm just pointing out that it's a bit of a distraction.
Anyway, back to your code. This line is problematic:
While you've not given us all your code, I can't see how that will be able to compile. The overloads of
StartNew
that accept two arguments require the first to be either anAction
, anAction<object>
, aFunc<TResult>
, or aFunc<object,TResult>
. In other words, it has to be a method that either takes no arguments, or accepts a single argument of typeobject
(and which may or may not return a value). Your 'asyncWebRequestAndConcurrentCollectionUpdater' takes an argument of typeTestArgs
.But the fact that it doesn't compile isn't the main problem. That's easily fixed. (E.g., change it to
Task.Factory.StartNew(() => asyncWebRequestAndConcurrentCollectionUpdater(args));
) The real issue is what you're doing is a bit weird: you're usingTask.StartNew
to invoke a method that already returns aTask
.Task.StartNew
is a handy way to take a synchronous method (i.e., one that doesn't return aTask
) and run it in a non-blocking way. (It'll run on the thread pool.) But if you've got a method that already returns aTask
, then you didn't really need to useTask.StartNew
. The weirdness becomes more apparent if we look at whatTask.StartNew
returns (once you've fixed the compilation error):That
Task<Task>
reveals what's happening. You've decided to wrap a method that was already asynchronous with a mechanism that is normally used to make non-asynchronous methods asynchronous. And so you've now got aTask
that produces aTask
.One of the slightly surprising upshots of this is that if you were to wait for the task returned by
StartNew
to complete, the underlying work would not necessarily be done:All that will actually do is wait for
asyncWebRequestAndConcurrentCollectionUpdater
to return aTask
. And sinceasyncWebRequestAndConcurrentCollectionUpdater
is already an async method, it will return a task more or less immediately. (Specifically, it'll return a task the moment it performs anawait
that does not complete immediately.)If you want to wait for the work you've kicked off to finish, you'll need to do this:
or, potentially more efficiently, this:
That says: get me the
Task
that my async method returned, and then wait for that. This may not be usefully different from this much simpler code:You may not have gained anything useful by introducing `Task.Factory.StartNew'.
I say "may" because there's an important qualification: it depends on the context in which you start the work. C# generates code which, by default, attempts to ensure that when an
async
method continues after anawait
, it does so in the same context in which theawait
was initially performed. E.g., if you're in a WPF app and youawait
while on the UI thread, when the code continues it will arrange to do so on the UI thread. (You can disable this withConfigureAwait
.)So if you're in a situation in which the context is essentially serialized (either because it's single-threaded, as will be the case in a GUI app, or because it uses something resembling a rental model, e.g. the context of an particular ASP.NET request), it may actually be useful to kick an async task off via
Task.Factory.StartNew
because it enables you to escape the original context. However, you just made your life harder - tracking your tasks to completion is somewhat more complex. And you might have been able to achieve the same effect simply by usingConfigureAwait
inside yourasync
method.And it may not matter anyway - if you're only attempting to manage 20 requests a second, the minimal amount of CPU effort required to do that means that you can probably manage it entirely adequately on one thread. (Also, if this is a console app, the default context will come into play, which uses the thread pool, so your tasks will be able to run multithreaded in any case.)
But to get back to your question, it seems entirely reasonable to me to have a single
async
method that picks a url off the queue, makes the request, examines the response, and if necessary, adds an entry to the bad url collection. And kicking the things off from a timer also seems reasonable - that will throttle the rate at which connections are attempted without getting bogged down with slow responses (e.g., if a load of requests end up attempting to talk to servers that are offline). It might be necessary to introduce a cap for the maximum number of requests in flight if you hit some pathological case where you end up with tens of thousands of URLs in a row all pointing to a server that isn't responding. (On a related note, you'll need to make sure that you're not going to hit any per-client connection limits with whichever HTTP API you're using - that might end up throttling the effective throughput.)You will need to add some sort of completion handling - just kicking off asynchronous operations and not doing anything to handle the results is bad practice, because you can end up with exceptions that have nowhere to go. (In .NET 4.0, these used to terminate your process, but as of .NET 4.5, by default an unhandled exception from an asynchronous operation will simply be ignored!) And if you end up deciding that it is worth launching via
Task.Factory.StartNew
remember that you've ended up with an extra layer of wrapping, so you'll need to do something likemyTask.Unwrap().ContinueWith(...)
to handle it correctly.