Thursday, April 2, 2009

A Bit About the Performance of Concurrent Collections in .Net 4.0

Copyright 2008-2009, Paul Jackson, all rights reserved

A post I made a couple days ago about the side-effect of concurrency (the concurrent collections in the .Net 4.0 Parallel Extensions) allowing modifications to collections while enumerating them has been quite popular, with a lot of attention coming from www.dotnetguru.org, what appears to be a French-language news/aggregation site (I only know enough French to get my face slapped, so it’s hard for me to tell).  I’m not sure why the post would be so popular in France, but the Internet’s weird that way … things take off for unexpected reasons. 

Regardless, it occurred to me that some further research might be in order, before folks get all hot for .Net 4.0 and want to change their collections so they can be modified while enumerating.  The question is: what’s the performance penalty for these threadsafe collections?  If I use one in a single-threaded environment, just to get that modification capability, is the performance-price something I’m willing to pay?

So I set up a simple test to satisfy my curiosity – but first the test platform (your mileage may vary):

  1. The test was done using the Visual Studio 2010 CTP, converted to run under Windows Server 2008 Hyper-V. 
  2. The virtual server was set to use all four cores on the test machine and have 3GB of RAM.
  3. The host CPU was a Q9100 quad-core, 2.26 GHz, with 4GB.

It’s also important to note that the Dictionary class has been around for a while and my guess is it’s been optimized once or twice, while the ConcurrentDictionary is part of a CTP.

The test is set up as two loops – the first a for that adds a million items to a Dictionary; the second a foreach that enumerates them:

   1: static void Main(string[] args)
   2: {
   3:     var dictionary = new Dictionary<int, DateTime>();
   4:  
   5:     var watch = Stopwatch.StartNew();
   6:  
   7:     for (int i = 0; i < 1000000; i++)
   8:     {
   9:         dictionary.Add(i, DateTime.Now);
  10:     }
  11:  
  12:     watch.Stop();
  13:     Console.WriteLine("Adding: {0}", watch.ElapsedMilliseconds);
  14:  
  15:     int count = 0;
  16:     watch.Reset();
  17:     watch.Start();
  18:     foreach (var item in dictionary)
  19:     {
  20:         count += item.Key;
  21:     }
  22:  
  23:     watch.Stop();
  24:     Console.WriteLine("Enumerating: {0}", watch.ElapsedMilliseconds);
  25:     Console.ReadLine();
  26:  
  27: }

Not the most scientific of tests, nor the most comprehensive, but enough to sate my curious-bone until I have time to do a more thorough analysis.  Running this nine times, I got the following results:

  Adding: Enumerating:
  2235 41
  1649 39
  1781 39
  1587 45
  2001 46
  1895 40
  1540 39
  1587 40
  2081 46
Average: 1817 41

 

Then I changed to a ConcurrentDictionary (also note the change from Add() to TryAdd() on line 9):

   1: static void Main(string[] args)
   2: {
   3:     var dictionary = new ConcurrentDictionary<int, DateTime>();
   4:  
   5:     var watch = Stopwatch.StartNew();
   6:  
   7:     for (int i = 0; i < 1000000; i++)
   8:     {
   9:         dictionary.TryAdd(i, DateTime.Now);
  10:     }
  11:  
  12:     watch.Stop();
  13:     Console.WriteLine("Adding: {0}", watch.ElapsedMilliseconds);
  14:  
  15:     int count = 0;
  16:     watch.Reset();
  17:     watch.Start();
  18:     foreach (var item in dictionary)
  19:     {
  20:         count += item.Key;
  21:     }
  22:  
  23:     watch.Stop();
  24:     Console.WriteLine("Enumerating: {0}", watch.ElapsedMilliseconds);
  25:     Console.ReadLine();
  26:  
  27: }

This change resulted in the following times:

  Adding: Enumerating:
  4332 80
  3795 80
  4560 77
  5489 75
  4283 76
  3734 74
  4288 79
  4904 96
  3591 83
Average: 4330 80

 

So there’s clearly a performance difference, with the ConcurrentDictionary being slower, but keep in mind a few key facts:

  • Again, we’re running the CTP of .Net 4.0, so ConcurrentDictionary is new code that hasn’t been optimized yet, while Dictionary is probably unchanged from previous framework versions;
  • We’re dealing with a million-item collection here, and the enumeration time-difference is an average of 39-milliseconds, or 0.000000039-seconds per item in the collection;

The time necessary to do the adding is more troublesome to me, but in dealing with a million-item set, is it really that unreasonable?  That’s a design decision you’d have to make for your application.

Having satisfied the curiosity-beast to a certain extent, yet another question arose (curiosity is like that): Since this post came about from the ability to alter a collection while enumerating it, what effect would that have on the numbers?  So I changed the code to remove each item from the collection as it enumerates:

   1: var dictionary = new ConcurrentDictionary<int, DateTime>();
   2:  
   3: var watch = Stopwatch.StartNew();
   4:  
   5: for (int i = 0; i < 1000000; i++)
   6: {
   7:     dictionary.TryAdd(i, DateTime.Now);
   8: }
   9:  
  10: watch.Stop();
  11: Console.WriteLine("Adding: {0}", watch.ElapsedMilliseconds);
  12:  
  13: int count = 0;
  14: watch.Reset();
  15: watch.Start();
  16: foreach (var item in dictionary)
  17: {
  18:     count += item.Key;
  19:     DateTime temp;
  20:     dictionary.TryRemove(item.Key, out temp);
  21: }
  22:  
  23: watch.Stop();
  24: Console.WriteLine("Enumerating: {0}", watch.ElapsedMilliseconds);
  25: Console.WriteLine("Items in Dictionary: {0}", dictionary.Count);
  26: Console.ReadLine();

Which added significantly to the enumeration time:

  Adding: Enumerating:
  4162 258
  4124 201
  4592 239
  3959 333
  4155 252
  4026 269
  4573 283
  4471 204
  5434 258
Average: 4388 255

 

Removing the current item from the collection during enumeration triples the time spent in the foreach loop – a disturbing development, but we’re still talking about a total of a quarter-second to process a million items, so maybe not worrisome?  Depends on your application and how many items you actually have to process – and other processing that you may have to do.

Now, with the whole purpose of the concurrent collections being parallel development, you have to know that I couldn’t leave it without doing one more test.  After all, those two loops have been sitting there this entire post fairly screaming to try parallelizing them with Parallel.For and Parallel.ForEach:

   1: var dictionary = new ConcurrentDictionary<int, DateTime>();
   2:  
   3: var watch = Stopwatch.StartNew();
   4:  
   5: Parallel.For(0, 1000000, (i) =>
   6: {
   7:     dictionary.TryAdd(i, DateTime.Now);
   8: }
   9: );
  10:  
  11: watch.Stop();
  12: Console.WriteLine("Adding: {0}", watch.ElapsedMilliseconds);
  13:  
  14: int count = 0;
  15: watch.Reset();
  16: watch.Start();
  17:  
  18: Parallel.ForEach(dictionary, (item) =>
  19: {
  20:   //  count += item.Key;
  21:     DateTime temp;
  22:     dictionary.TryRemove(item.Key, out temp);
  23: }
  24: );
  25:  
  26: watch.Stop();
  27: Console.WriteLine("Enumerating: {0}", watch.ElapsedMilliseconds);
  28: Console.WriteLine("Items in Dictionary: {0}", dictionary.Count);
  29: Console.ReadLine();

  Adding: Enumerating:
  7550 482
  4433 464
  7534 482
  4452 464
  4216 393
  3441 264
  6094 483
  5953 676
  5462 446
Average: 5459 462

 

Not good numbers at all, but not unexpected when you think about it.  Each iteration of the two loops would become a Task when parallelized, which means we’re incurring the overhead of instantiating two million Task objects, scheduling them and executing them – but each Task consists of very little code; code that doesn’t take that long to begin with, so any performance improvement we gain by executing in parallel is offset (and more) by the overhead of managing the Tasks.  Something to keep in mind as you’re looking for parallelization candidates in a real application.

So what about the more traditional way of handling this – the situation where we make the decision to remove an item from a collection while enumerating over it.  Typically we’d probably make a list of the items to be removed, then remove them after the first enumeration was complete.

   1: var dictionary = new Dictionary<int, DateTime>();
   2:  
   3: var watch = Stopwatch.StartNew();
   4:  
   5: for (int i = 0; i < 1000000; i++)
   6: {
   7:     dictionary.Add(i, DateTime.Now);
   8: }
   9:  
  10: watch.Stop();
  11: Console.WriteLine("Adding: {0}", watch.ElapsedMilliseconds);
  12:  
  13: watch.Reset();
  14: watch.Start();
  15: var toRemove = new List<int>();
  16:  
  17: foreach (var item in dictionary)               
  18: {
  19:     toRemove.Add(item.Key);
  20: }
  21: foreach (var item in toRemove)
  22: {
  23:     dictionary.Remove(item);
  24: }
  25:  
  26: watch.Stop();
  27: Console.WriteLine("Enumerating: {0}", watch.ElapsedMilliseconds);
  Enumerating:
  190
  266
  106
  113
  129
  105
  107
  142
  117
Average: 141
image image

 

Based on this limited test, the traditional method of waiting until the first enumeration of a collection is complete before removing items from it appears to still be the most efficient.

Adding to a Dictionary is faster than adding to a ConcurrentDictionary, even if the adding is parallelized … provided the parallelized code is so brief that the overhead of parallelization outweighs the benefits.  That last bit is important, because if the parallelized example had done significantly more than just add an item to a Dictionary, the results would likely be different.

When enumerating the items in a collection, the simple Dictionary again proves faster than ConcurrentDictionary; and when actually modifying the collection by removing items, the traditional method of building a list of items to remove and then doing so after the foreach is complete proves to be fastest.

Does this mean that you should never use one of the new concurrent collections in this way?

That’s a design decision that you’ll have to make based on your particular application.  Keeping in mind that the concurrent collections are still in CTP and will likely improve dramatically in performance by the time .Net 4 is released – but also that the very nature of making them threadsafe and, consequently, able to be modified while enumerating will likely mean that they’re always going to be somewhat less performant than their counterparts.

There may be instances, though, where making the decision to sacrifice performance for this capability is the best solution.  For instance, what if the results of processing one item in the collection result in a need to remove an item (or items) that haven’t been processed yet? 

In that case, simply removing the item at the point the decision’s made, rather than maintaining a list of items not to be processed, might be the simplest, most maintainable solution and sacrificing a bit of performance might be worth it.  Like so many things in software development, the answer is simple …

It depends.

Added 04/03/09: One thing I should probably stress more is that these numbers are reflective the use (or misuse[?]) of the concurrent collection in a single-threaded environment.  The premise is “what’s the price I pay for being able to modify the collection while enumerating”.  As such, this post is really about the performance hit of doing a couple things you maybe shouldn’t be doing in the first place: i.e. using a concurrent collection in a single-thread or parallelizing a million-iteration loop with one line of code in it (!). 

As Josh Phillips points out in the Comments, an optimized version of the collection, used for what it’s intended, has much better numbers – but a post on those has to wait until the betas or RCs are available and I can play with the newer bits.  Boy … sure would like some newer bits to play with … wonder who could get me those … <insert subtle raise of eyebrows here>

;)

kick it on DotNetKicks.com Shout it

4 comments:

Josh Phillips said...

Hi Paul,

Thanks for the really nice write-up. We love it when we see people playing with this stuff and we especially love any feedback on performance. :)

As you alluded to, ConcurrentDictionary is really meant to only be used in multi-threaded scenarios. If there's no chance that you're going to access the same instance concurrently, as with most of our types, we highly recommend that you stick to the single-threaded counterparts that don’t need to muck with synchronization primitives.

When we designed ConcurrentDictionary, we spent some time thinking about the most common scenarios where a thread-safe dictionary would be useful. Making a collection thread-safe really forces you to think about which scenarios are most common so that you can structure the elements and synchronization primitives optimally. Of course, using just a single lock or a lock per element wouldn’t provide much benefit in terms of performance. We concluded that the most common scenarios were very read-heavy – think caches, look-ups, etc. Typically you add some items to the structure then spend a lot of time checking if a certain key exists or simply retrieving the value. That said, ConcurrentDictionary is structured in a way that will do somewhat poorly for write-heavy concurrent scenarios (as you’ve discovered) and does really well in read-heavy concurrent scenarios.

In fact, one of the perf tests we’ve written tracks the speedup between a Dictionary + a single lock vs. ConcurrentDictionary with a 90%/10% read/write mix. On an 8-way machine, ConcurrentDictionary consistently sees a 6x speedup. :) As you’ve guessed, we have done a bit of optimization since the last CTP release.

Also, take a look at the constructor overload that has a concurrencyLevel parameter. This parameter is really a hint to the collection that helps it restructure itself for optimal efficiency on a targeted number of writing threads. So if you have 2 threads that will constantly be writing and 2 threads that will constantly be reading, passing in 2 to this parameter should help performance significantly.

Thanks again for the great post!

Josh Phillips
Program Manager
Microsoft
Parallel Computing Platform

Paul Jackson said...

Josh,

Thanks for the comment! As you can probably tell, I'm very excited about all the great stuff your team(s) are working on. I work in a fairly heavily-threaded environment and, well, I frankly hate threading -- as evidenced by the "Work of the Devil" tag that I attach to some threading posts. But your new stuff is so different and easier to work with that I'm looking forward to its release.

I wasn't intending to do a post on performance until closer to release, but my previous one on the use (or misuse(?)) of the concurrent collections in a single-threaded environment prompted me to take this look.

From your comment it sounds like we can look forward to much improved performance when we actually use these things as intended -- instead of the methods we'd have to use to roll our own concurrency today. That's great news, and I'm eagerly awaiting the Betas/RCs, so I can get my hands on some newer bits.

Paul

John Sharratt said...

Hi Paul,

Great article, it's good to see people taking the overhead of locking mechanisms seriously. Often in programming we can slip into the trap of using the wrong tool for the wrong job.

On another note, the parallel task library is the most exciting part of the .Net 4 release and I can't wait for the full release. It's potential to bring software up to the level of multithreading needed for multicore processors is massive.

Just a couple of small points really, firstly

The overhead of the 'DateTime.Now' operation in the add loop can add significatly to the overall overhead (likely more than the dictionary itself). This is because this property ends up making a low level call to query the performance counter.

Secondly the Parallel loop you have used can be optimised massively for tight loops by using partitioning, this will still yield the performance gains of parallel processing but avoid the overhead of tasks per item.

Check this article out for details:

http://msdn.microsoft.com/en-us/library/dd560853%28VS.100%29.aspx

Thomas Braun said...

As such parallel programming has the same problem that is with concurrent programming but .net 4.0 parallel is one step forward. The new API resolves a lot of issues of parallel section, and significantly assists in easing up parallel debugging.