Sunday, April 26, 2009

Bulk Email Software

Copyright 2008-2009, Paul Jackson, all rights reserved

No, I haven’t turned SPAMmer. 

But the company I work for has a need to send emails communications to its customers and the system we threw together in Java almost a decade ago just isn’t up to the current task.  It wasn’t even part of the original requirements for our website, we just decided it would be cool if we could send them emails so we built it in with a couple days’ effort.

Over time, it’s become a critical communication tool and the slapped-together nature of the system is showing its age and limitations – so we started looking at options and came up with three basic possibilities:  enhance the current system, use a hosted service or buy desktop software.

The basic requirements of the system are that we have around 26,000 customer email addresses and a half dozen opt-in news topics.  In addition, we send mandatory “alerts” to all customers, regardless of their opt-in decision – this is disclosed to them and is part of their membership agreement for using the website and the reason is that there are statutory and legal announcements that they have to receive in our business, so for these, they don’t get to say no.

There are also some things that the current system doesn’t support which the business has requested over the years, or that would just be smart to do, so we also need a solution that will:

  • Support HTML emails.  The original system is text only and our communications are pretty dated at this point, being all text.  In fact, their appearance has more in common with Viagra ads than an email from a professional company.
  • We’d like a solution that doesn’t lock up midway through a send.  The current solution is written in Java running under IBM’s WebSphere and uses Enterprise JavaBeans – I’ll wait while those of you with experience in that realm shudder a bit – so it has some memory and performance issues.  One symptom of which is that it sometimes locks up and stops sending, but it doesn’t have a way to resume from where it stopped, so we wind up sending duplicates.  Like I said, we slapped this thing together ten years ago on a whim.
  • Better filtering.  Over time the business has asked to be able to send alerts to subsets of our customers, mostly based on geography.  The current system can’t do this, it’s only by type of customers and subscriptions – it doesn’t know which customers are in Florida or which are in particular counties.  I know this sounds trivial, but the nature of our data doesn’t make it easy – the web data actually doesn’t know about the customer data (where the address is), because they’re two different systems.

And, finally, in case you hadn’t noticed, the economy sucks right now and we’re in the real-estate sector, so cost is a big deal – we want something very, very inexpensive.

There are pros and cons to each option – build it ourselves, use a hosted solution or buy desktop software. 

Being a developer, I like the idea of writing my own solution.  That way, it does exactly what I want it to do the way I want it to do it.  (Okay, fine, in reality it does exactly what I tell it to do in exactly the way I coded it to do it, which isn’t always what I “want”, but that’s a different post.)  But is this really the best use of the company’s resources if their are commercial products that will do the job?  That depends on the capabilities and cost, of course, which is why we do the build vs. buy evaluation.

The online solutions look good, things like Constant Contact – they meet all of the requirements except that they require opt-out from the list, something one type of mailing we have doesn’t allow.  Also, we’d fall in their $150 / month pricing bracket, so it might be more economical, in the long run, to just write it ourselves.

That brought us to desktop solutions in our evaluation and we found SendBlaster.

SendBlaster definitely met our cost requirement – there’s a free version that sends 100 emails at a time or the purchased version that sends an unlimited number and the cost is a flat 75eu or about $100. 

Email blast software mass email sending for email marketing

It supports HTML emails and hasn’t locked up at all during testing – and if it ever does, it can resume a mailing from where it left off.

SendBlaster has some nice import/export features that will let us run queries to or even create a web page to get customers into/out-of the lists and supports an unlimited number of lists, so we can have one list for each of the news topics we allow our customers to subscribe to and one that includes all of them for the mandatory emails.

SendBlaster satisfies most of what the business wants for geographic filtering, too.  We can import our customers’ state, zipcode and county and filter the list based on these.  This will allow us to stop sending Florida emails to customers in other states.

For $100, we really can’t go wrong with this software – if it turns out not to work well for us, we’ve spent less than one month of an online service charge or less than it would cost to put four people in a room for an hour to talk about writing our own.

Monday, April 13, 2009

Two Years Without Television – Windows Media Center and Streaming Alternatives to Cable

Copyright 2008-2009, Paul Jackson, all rights reserved

March marked two years since I stopped paying for cable or satellite television.  No, I didn’t string a cable from the neighbor’s box, neither did I give up watching movies and television shows altogether – I simply stopped paying cable and satellite providers for content that was already being paid for via advertising.

At the time I made the decision to do this I’d noticed that most of the programs I watched were made available on DVD at some point, and that I’d almost always rent the DVDs to catch up on missed episodes; also, online alternatives were becoming more prevalent, so what was the point in paying exorbitant cable or satellite fees?

So I set about building a Windows Media Center PC and, since I’m lazy, ripping all my DVDs so I wouldn’t have to get off the couch to watch one.  Actually, Media Center wasn’t my first choice – I was originally going to simply build a file server and get a media extender device, like those from DLink.  In fact, my first purchase toward this goal was a DLink DSM-320RD. 

The 320 worked wonderfully for streaming music, but I soon found problems with video, especially DVD-quality video and the device’s interface.  The way these devices typically work is that video is streamed from a PC to the device, so the server software is important.  After trying the software that came with the 320, the compatible software from Nero and the free TVersity, I finally gave up on the dedicated-device solution and decided to hook a PC directly to the television.

Base Hardware

One nice thing about building a Media Center PC these days is that you don’t need to get the latest, fastest hardware in order to make it work.  Ultimately all it’s doing is playing video, so there’s not a lot of horsepower required.

Intel Dual-Core 1.6GHz
2 GB
On-board 256 MB Video with DVI connector
300 GB Main Drive
SoundBlaster Live! 24-bit External
Media Center Remote
Wireless Keyboard and Mouse

This has proven to be plenty of power, even with some other things I run on the system, which we’ll get to in a bit.

Television

I started with a 42” Mitusbishi real-projection television that we already had.  In fact, this is the reason I chose a motherboard that had DVI out for the video, instead of HDMI.  Though HDMI was newer and better, the existing television, that was already a few years old, didn’t support it, so I went with DVI. 

After connecting the television and the PC, I ran into the problem of overscan, or the fact that tube and projection televisions project outside the border of the television itself.  This is somewhat incompatible with computer video, so some of the desktop was cut off along the four sides.  Media Center adjusts for this phenomena, but the Windows desktop doesn’t, making the experience fine in Media Center, but rather awkward using the browser.

Last year I replaced the television with an LCD – since LCD’s don’t have an overscan problem, I now have a true 1080p desktop. 

Software

Everything you really need is included in Windows Vista Home Premium or Ultimate.  There are a lot of people out there who still use XP Media Center Edition, but I’ve been running Vista for two years now and haven’t had a single problem that was Vista’s fault.  That being said, I’ll still be switching to Windows 7 Media Center, because of the new features.

A nice thing about Media Center, though, is that it’s extensible and there are two plugins that are essential.

MyNetflix22Small_WatchNowGenre

First, is the MyNetflix plugin, written by Anthony Park.  Obviously, since my goal was to replace cable and satellite with online and DVD, Netflix was a critical component to my solution, but beyond the mailed DVDs is the Watch Now feature of Netflix.  The MyNetflix plugin provides a Media Center interface to Watch Now, allowing you to browse, search and view Watch Now offerings from within Media Center.

You can also browse your mail queue and search for movies, adding them to your mail or Watch Now queues with a Media Center remote.

MyNetflix is free, but donations are accepted. 

MyMovies

The second essential plugin is My Movies from Binnerup Consulting.  My Movies is important to me because it helps organize my DVD collection and plays them from the hard drive. 

My Movies has tons of features, including a comprehensive database of DVD information – front and back cover art, cast and crew, categories, and descriptions.  All of this is searchable and browsable through the Media Center interface, so you can, for instance, find all the movies you have that star a particular actor.

My Movies manages a DVD collection whether you rip them to hard drive or not and supports several DVD carousels.  Even without having the DVD online, it’s nice to have a database of your collection and My Movies also allows you to export your collection data to a website provided by Binnerup – here’s mine.

Getting movies into My Movies is easy, the software can recognize a DVD when it’s inserted into the drive by its Disc ID.  Alternatively you can enter its barcode or scan the barcode with a webcam, search the main My Movies database by title, or enter the data manually.  That last option is useful for entering your own videos – for instance I have all of my family’s Christmas and birthday videos, as well as my daughter’s dance recitals, as part of the collection, so when someone visits I can just browse to these with the remote and show hours of family videos …

There’s also a My Movies client, written by a third-party, that can access the My Movies database from any Windows client, even without Media Center.  I put this on my son’s XP system so he can watch movies in his room.  With My Movies and this client, my original DVD discs remain safely stored away, no matter how many times he wants to watch Star Wars.

My Movies is also free, with donations accepted.

One change I’ve made to my Vista configuration is to enable multiple Remote Desktop Sessions – this is a hack that lets me access the Media Center PC remotely while it’s also running a session on the television.  This allows me to use the Media Center PC to perform other tasks even while I’m watching something.  What tasks?  Well, I’ve used it to capture and process video from old VHS tapes, download large files, even to rip a new DVD while watching a different movie – the CPU isn’t really struggling to play video, either streaming or from a file, so there’s plenty of processing power available for me to use.

Ripping DVDs

I use two programs from Slysoft for ripping DVDs: AnyDVD and CloneDVD.  This combination allows me to rip just the title and audio tracks I want, leaving behind the trailers, menus, special features and other languages.  Doing this reduces the amount of space necessary to store the main title, which is what I typically watch anyway – if I want to watch a special feature, I can always dig the original disc out of storage.

This space savings is important to me – a typical DVD, with all features and sound, runs upwards of 7GB, but stripped of non-essentials, they average between three and six.  One or two gigs may not seem like a lot to save, but added up over four hundred titles, it becomes significant.

Storage

I went ‘round-and-‘round on the storage issue at the beginning, debating between internal storage and external.  Remember my original intent was to build a file server and stream to a media device, so internal storage made sense, but when I changed my mind on that it also changed the storage solution.

With a streaming solution, I could put the file server in a back room or closet, which would solve two problems: dirt and noise.  See there are three dogs and nine cats in my house … yeah, I know … so dust and hair are issues.  A lot of internal drives would mean a case with a lot of fans, and a lot of fans add up to two things: noise and openings in the case.  Having that amount of fan and drive noise in my living room when I’m trying to enjoy a movie wasn’t optimal – neither was having a case with so many infiltration points for dust and hair. 

I could have built two systems, and may take that route one day, but I also wanted to keep startup costs low and build up drive space over time as I expanded my library, so I went with external drives.  The current storage solution is made up of:

Western Digital My Book World Edition 1 TB Network Attached Storage (quantity 2)
Western Digital My Book Essential Edition 1 TB USB 2.0 External Hard Drive
Iomega Prestige 1 TB USB 2.0 Desktop External Hard Drive

The World Books attach directly to the network and the USB drives hang off of them, giving me 4TB of storage on the network.  All the drives were acquired over the last two years by carefully scouring sales, closeouts and store-closings, so I was able to pick them up for much less than the normal price, making it a pretty cheap 4TB.

On a separate note, I think it’s utterly amazing that we live in a world where I can have 4TB of data storage in my house.  To put it in perspective, when I bought my first hard drive, a 20MB drive for my Apple IIgs, it cost around $600 – today, for less than that, I have 4TB just for digital media.  I can’t imagine what will be available in another twenty years or what we’ll use it for … it’s amazing.

Network

I quickly found that wireless simply wasn’t fast or reliable enough to serve video in my environment, so I had to string some cable.  My house has steel studs and wireless has always been a challenge here.  I took the opportunity, while up in the attic, to drop cables into the kids’ rooms as well, so now the only computers not wired are the laptops.  Since they’re usually in the same place all the time, I may break down and string some cable for them, too, leaving the wireless solely for visitors.  From a security-standpoint, I like this idea, too.

Streaming

In addition to DVDs, the Internet a source for me to find video content, with Hulu and FanCast being my favorites.  Although there isn’t an official Media Center plugin available for either site, there is a third-party option in beta.  The SecondRun.tv plugin doesn’t only work with Hulu and Fancast, but other providers as well.

SecondRun has an advantage in that it’s network- and show-based, so content from multiple sources is aggregated based on the program or network, not the streaming provider.  A disadvantage is that, since it’s not provider-based, you have to browse for shows, rather than setting up your Hulu queue and playing that.

I like SecondRun for browsing shows … a lot … but I still want a Hulu plugin for Media Center so that my subscriptions will just show up in my queue.

In the meantime, I’ll simply use a browser.  With Vista set to a large font and a wireless keyboard/mouse, the browser has a decent ten-foot experience.  I can live with it until someone (or me) gets fed up and writes a plugin that accesses the queue.

Done

So that’s it … two years with no cable or satellite bill, I figure that’s saved me close to a couple thousand dollars, more than paying for the hardware.  The Netflix subscription has a fee, but I’d have been paying that anyway to get new movies to watch, so it’s a wash.

It’s a little different – with some shows there’s a delay before they’re available online, so I’m not always “up-to-date” when talking to others about the show, and I’m a year behind if I wait for the DVDs; but even that has advantages, because I can sit down with the full set of DVDs and watch an entire season without having to wait a week between episodes.  I like that.

I really think streaming is the future and sites like Hulu and FanCast have it right – on-demand content supported by advertising.  I’m okay with the ads because they pay for the show and they’re shorter and fewer than in broadcast television.  In fact, I’m looking forward to the day when one of those sites ties the viewer’s profile to the ad-server, so the ads can be targeted better.  Like Google Adwords, being able to target specific markets, instead of just everyone who watches a show, will make the ads more valuable – which translates into fewer ads being necessary and better content … and maybe seeing ads that the viewer’s actually interested in.

As more content moves online, and more viewers realize they have options other than the traditional cable and satellite companies, it’ll be interesting to see how those companies react to the changing market.

Monday, April 6, 2009

The Programmer Responsible for the World’s Financial Collapse

Copyright 2008-2009, Paul Jackson, all rights reserved

I knew it.  You knew it.  We just didn’t want to admit it.  There had to be software at the heart of the world’s financial woes and some way to blame a programmer. 

Sure enough, the press has tracked down one of the programmers responsible for writing the software that helped financial companies with “securitisation”, or the turning of regular mortgage notes into derivative securities that no one actually understands.

Michael Osinski, 55, retired from programming and now farming premium oysters off Long Island, was one of the programmers for a company that supplied this software to many financial firms and was involved in the new “feature” that added subprime mortgage market to the process. 

Osinski decided to go public after being called a “devil” and “facilitator” by people who found out what he used to do. 

Now, if I’d written that software I don’t think I’d go public … there’d probably be a gap on my resume, even, but Osinski decided to talk to the press, and he makes a really valid point:

"Securitisation is a good thing when it allows firms correctly to price risk into their calculations," he said. "If people are re-paying their mortgages, then the process works fine. But if you put garbage in, you'll get junk out."

<a href="http://www.buzzdash.com/polls/should-developers-consider-the-larger-impact-of-their-work-158820/">Should developers consider the larger impact of their work?</a> | <a href="http://www.buzzdash.com">BuzzDash polls</a>

The good or bad of the software depends on how it’s used, the program itself is just a tool.  Like eBay, which is a great auction site, but has had prevalent fraud – is that the fault of the software or the users?

But do we, as developers, have any responsibility to think about how the software we’re writing could be misused and what the consequences might be? 

We do have a responsibility to think about how it might be attacked by someone malicious, right?  The security of a system is our responsibility and we’re supposed to analyze the possible attack surfaces and methods to ensure that the system can’t be misused by an attacker – but what about misuse by a legitimate user?

When dealing with security, I always tell my team to assume the client (whether the end-user or another development team) is either stupid or malicious.  Meaning that they will, at some point, send the most damaging input possible, either because they’re deliberately trying to break the system or they don’t know what they’re doing – and so the system must be protected from that.

So does that extend to analyzing the possible negative impact of a new “feature” or system outside of the software itself?  Do we have a responsibility to ask: What will using this software as designed do to the user, company or world?

Or, even if we don’t have the responsibility, should we do it anyway because, sure-as-shootin’, somebody’ll say it’s our fault?

http://www.telegraph.co.uk/finance/financetopics/financialcrisis/5106510/Former-Wall-Street-computer-whizz-Michael-Osinski-admits-his-work-broke-the-banks.html

kick it on DotNetKicks.com Shout it

Saturday, April 4, 2009

Book Recommendation: One Second After

Copyright 2008-2009, Paul Jackson, all rights reserved

Normally I recommend technology and programming related books on this blog and One Second After, by William Forstchen, is fiction, but it’s fiction about technology, so that’s okay … or, rather, it’s fiction about non-technology.

One of my favorite types of fiction is the displaced-person genre, stories where a person or group are dramatically displaced from their normal environment, especially where technology is involved.  Whether it’s about someone with technological knowledge displaced to where that knowledge doesn’t exist yet (A Connecticut Yankee in King Arthur's Court, Island in the Sea of Time and 1632) or apocalyptic tales of people reliant on technology when that technology or the society fails (Dies the Fire and The Stand), there’s something about these stories that appeals to me.

One Second After is different than other books I’ve read in the genre because most of those others rely on a certain amount of fantasy or, at least, willing suspension of disbelief in order to achieve the displacement.  One Second After doesn’t need to, it’s premise is all too real, believable and possible.

The premise of One Second After is: What would happen to a society if an EMP eliminated most technology?

A generally accepted fact is that an EMP (Electromagnetic Pulse) caused by a nuclear explosion at high-altitude would damage or destroy most unshielded electronics – so imagine the impact on your life if most electronics stopped working.

Not just our convenience and entertainment, but our very survival relies on electronics.  The average city needs power to pump water to its citizens and trucks (modern trucks and cars need their electronics to run) to bring in more food.  Modern medicine is based around technologically-sophisticated diagnostic equipment and drugs have to be shipped on a regular basis.  Our entire financial system is dependent on electronics – it doesn’t matter how much money you have in the bank if the bank records are inaccessible and no one can take a debit card for payment.

One Second After does an excellent job of examining the consequences of all these things and more.  It also explores the societal changes and how people and groups would behave in such a situation – not always pleasantly or likable, but very realistically.

It’s a well-written, plausible story … disturbing because of its very possibility.

Thursday, April 2, 2009

A Bit About the Performance of Concurrent Collections in .Net 4.0

Copyright 2008-2009, Paul Jackson, all rights reserved

A post I made a couple days ago about the side-effect of concurrency (the concurrent collections in the .Net 4.0 Parallel Extensions) allowing modifications to collections while enumerating them has been quite popular, with a lot of attention coming from www.dotnetguru.org, what appears to be a French-language news/aggregation site (I only know enough French to get my face slapped, so it’s hard for me to tell).  I’m not sure why the post would be so popular in France, but the Internet’s weird that way … things take off for unexpected reasons. 

Regardless, it occurred to me that some further research might be in order, before folks get all hot for .Net 4.0 and want to change their collections so they can be modified while enumerating.  The question is: what’s the performance penalty for these threadsafe collections?  If I use one in a single-threaded environment, just to get that modification capability, is the performance-price something I’m willing to pay?

So I set up a simple test to satisfy my curiosity – but first the test platform (your mileage may vary):

  1. The test was done using the Visual Studio 2010 CTP, converted to run under Windows Server 2008 Hyper-V. 
  2. The virtual server was set to use all four cores on the test machine and have 3GB of RAM.
  3. The host CPU was a Q9100 quad-core, 2.26 GHz, with 4GB.

It’s also important to note that the Dictionary class has been around for a while and my guess is it’s been optimized once or twice, while the ConcurrentDictionary is part of a CTP.

The test is set up as two loops – the first a for that adds a million items to a Dictionary; the second a foreach that enumerates them:

   1: static void Main(string[] args)
   2: {
   3:     var dictionary = new Dictionary<int, DateTime>();
   4:  
   5:     var watch = Stopwatch.StartNew();
   6:  
   7:     for (int i = 0; i < 1000000; i++)
   8:     {
   9:         dictionary.Add(i, DateTime.Now);
  10:     }
  11:  
  12:     watch.Stop();
  13:     Console.WriteLine("Adding: {0}", watch.ElapsedMilliseconds);
  14:  
  15:     int count = 0;
  16:     watch.Reset();
  17:     watch.Start();
  18:     foreach (var item in dictionary)
  19:     {
  20:         count += item.Key;
  21:     }
  22:  
  23:     watch.Stop();
  24:     Console.WriteLine("Enumerating: {0}", watch.ElapsedMilliseconds);
  25:     Console.ReadLine();
  26:  
  27: }

Not the most scientific of tests, nor the most comprehensive, but enough to sate my curious-bone until I have time to do a more thorough analysis.  Running this nine times, I got the following results:

  Adding: Enumerating:
  2235 41
  1649 39
  1781 39
  1587 45
  2001 46
  1895 40
  1540 39
  1587 40
  2081 46
Average: 1817 41

 

Then I changed to a ConcurrentDictionary (also note the change from Add() to TryAdd() on line 9):

   1: static void Main(string[] args)
   2: {
   3:     var dictionary = new ConcurrentDictionary<int, DateTime>();
   4:  
   5:     var watch = Stopwatch.StartNew();
   6:  
   7:     for (int i = 0; i < 1000000; i++)
   8:     {
   9:         dictionary.TryAdd(i, DateTime.Now);
  10:     }
  11:  
  12:     watch.Stop();
  13:     Console.WriteLine("Adding: {0}", watch.ElapsedMilliseconds);
  14:  
  15:     int count = 0;
  16:     watch.Reset();
  17:     watch.Start();
  18:     foreach (var item in dictionary)
  19:     {
  20:         count += item.Key;
  21:     }
  22:  
  23:     watch.Stop();
  24:     Console.WriteLine("Enumerating: {0}", watch.ElapsedMilliseconds);
  25:     Console.ReadLine();
  26:  
  27: }

This change resulted in the following times:

  Adding: Enumerating:
  4332 80
  3795 80
  4560 77
  5489 75
  4283 76
  3734 74
  4288 79
  4904 96
  3591 83
Average: 4330 80

 

So there’s clearly a performance difference, with the ConcurrentDictionary being slower, but keep in mind a few key facts:

  • Again, we’re running the CTP of .Net 4.0, so ConcurrentDictionary is new code that hasn’t been optimized yet, while Dictionary is probably unchanged from previous framework versions;
  • We’re dealing with a million-item collection here, and the enumeration time-difference is an average of 39-milliseconds, or 0.000000039-seconds per item in the collection;

The time necessary to do the adding is more troublesome to me, but in dealing with a million-item set, is it really that unreasonable?  That’s a design decision you’d have to make for your application.

Having satisfied the curiosity-beast to a certain extent, yet another question arose (curiosity is like that): Since this post came about from the ability to alter a collection while enumerating it, what effect would that have on the numbers?  So I changed the code to remove each item from the collection as it enumerates:

   1: var dictionary = new ConcurrentDictionary<int, DateTime>();
   2:  
   3: var watch = Stopwatch.StartNew();
   4:  
   5: for (int i = 0; i < 1000000; i++)
   6: {
   7:     dictionary.TryAdd(i, DateTime.Now);
   8: }
   9:  
  10: watch.Stop();
  11: Console.WriteLine("Adding: {0}", watch.ElapsedMilliseconds);
  12:  
  13: int count = 0;
  14: watch.Reset();
  15: watch.Start();
  16: foreach (var item in dictionary)
  17: {
  18:     count += item.Key;
  19:     DateTime temp;
  20:     dictionary.TryRemove(item.Key, out temp);
  21: }
  22:  
  23: watch.Stop();
  24: Console.WriteLine("Enumerating: {0}", watch.ElapsedMilliseconds);
  25: Console.WriteLine("Items in Dictionary: {0}", dictionary.Count);
  26: Console.ReadLine();

Which added significantly to the enumeration time:

  Adding: Enumerating:
  4162 258
  4124 201
  4592 239
  3959 333
  4155 252
  4026 269
  4573 283
  4471 204
  5434 258
Average: 4388 255

 

Removing the current item from the collection during enumeration triples the time spent in the foreach loop – a disturbing development, but we’re still talking about a total of a quarter-second to process a million items, so maybe not worrisome?  Depends on your application and how many items you actually have to process – and other processing that you may have to do.

Now, with the whole purpose of the concurrent collections being parallel development, you have to know that I couldn’t leave it without doing one more test.  After all, those two loops have been sitting there this entire post fairly screaming to try parallelizing them with Parallel.For and Parallel.ForEach:

   1: var dictionary = new ConcurrentDictionary<int, DateTime>();
   2:  
   3: var watch = Stopwatch.StartNew();
   4:  
   5: Parallel.For(0, 1000000, (i) =>
   6: {
   7:     dictionary.TryAdd(i, DateTime.Now);
   8: }
   9: );
  10:  
  11: watch.Stop();
  12: Console.WriteLine("Adding: {0}", watch.ElapsedMilliseconds);
  13:  
  14: int count = 0;
  15: watch.Reset();
  16: watch.Start();
  17:  
  18: Parallel.ForEach(dictionary, (item) =>
  19: {
  20:   //  count += item.Key;
  21:     DateTime temp;
  22:     dictionary.TryRemove(item.Key, out temp);
  23: }
  24: );
  25:  
  26: watch.Stop();
  27: Console.WriteLine("Enumerating: {0}", watch.ElapsedMilliseconds);
  28: Console.WriteLine("Items in Dictionary: {0}", dictionary.Count);
  29: Console.ReadLine();

  Adding: Enumerating:
  7550 482
  4433 464
  7534 482
  4452 464
  4216 393
  3441 264
  6094 483
  5953 676
  5462 446
Average: 5459 462

 

Not good numbers at all, but not unexpected when you think about it.  Each iteration of the two loops would become a Task when parallelized, which means we’re incurring the overhead of instantiating two million Task objects, scheduling them and executing them – but each Task consists of very little code; code that doesn’t take that long to begin with, so any performance improvement we gain by executing in parallel is offset (and more) by the overhead of managing the Tasks.  Something to keep in mind as you’re looking for parallelization candidates in a real application.

So what about the more traditional way of handling this – the situation where we make the decision to remove an item from a collection while enumerating over it.  Typically we’d probably make a list of the items to be removed, then remove them after the first enumeration was complete.

   1: var dictionary = new Dictionary<int, DateTime>();
   2:  
   3: var watch = Stopwatch.StartNew();
   4:  
   5: for (int i = 0; i < 1000000; i++)
   6: {
   7:     dictionary.Add(i, DateTime.Now);
   8: }
   9:  
  10: watch.Stop();
  11: Console.WriteLine("Adding: {0}", watch.ElapsedMilliseconds);
  12:  
  13: watch.Reset();
  14: watch.Start();
  15: var toRemove = new List<int>();
  16:  
  17: foreach (var item in dictionary)               
  18: {
  19:     toRemove.Add(item.Key);
  20: }
  21: foreach (var item in toRemove)
  22: {
  23:     dictionary.Remove(item);
  24: }
  25:  
  26: watch.Stop();
  27: Console.WriteLine("Enumerating: {0}", watch.ElapsedMilliseconds);
  Enumerating:
  190
  266
  106
  113
  129
  105
  107
  142
  117
Average: 141
image image

 

Based on this limited test, the traditional method of waiting until the first enumeration of a collection is complete before removing items from it appears to still be the most efficient.

Adding to a Dictionary is faster than adding to a ConcurrentDictionary, even if the adding is parallelized … provided the parallelized code is so brief that the overhead of parallelization outweighs the benefits.  That last bit is important, because if the parallelized example had done significantly more than just add an item to a Dictionary, the results would likely be different.

When enumerating the items in a collection, the simple Dictionary again proves faster than ConcurrentDictionary; and when actually modifying the collection by removing items, the traditional method of building a list of items to remove and then doing so after the foreach is complete proves to be fastest.

Does this mean that you should never use one of the new concurrent collections in this way?

That’s a design decision that you’ll have to make based on your particular application.  Keeping in mind that the concurrent collections are still in CTP and will likely improve dramatically in performance by the time .Net 4 is released – but also that the very nature of making them threadsafe and, consequently, able to be modified while enumerating will likely mean that they’re always going to be somewhat less performant than their counterparts.

There may be instances, though, where making the decision to sacrifice performance for this capability is the best solution.  For instance, what if the results of processing one item in the collection result in a need to remove an item (or items) that haven’t been processed yet? 

In that case, simply removing the item at the point the decision’s made, rather than maintaining a list of items not to be processed, might be the simplest, most maintainable solution and sacrificing a bit of performance might be worth it.  Like so many things in software development, the answer is simple …

It depends.

Added 04/03/09: One thing I should probably stress more is that these numbers are reflective the use (or misuse[?]) of the concurrent collection in a single-threaded environment.  The premise is “what’s the price I pay for being able to modify the collection while enumerating”.  As such, this post is really about the performance hit of doing a couple things you maybe shouldn’t be doing in the first place: i.e. using a concurrent collection in a single-thread or parallelizing a million-iteration loop with one line of code in it (!). 

As Josh Phillips points out in the Comments, an optimized version of the collection, used for what it’s intended, has much better numbers – but a post on those has to wait until the betas or RCs are available and I can play with the newer bits.  Boy … sure would like some newer bits to play with … wonder who could get me those … <insert subtle raise of eyebrows here>

;)

kick it on DotNetKicks.com Shout it