August 2004 - Posts

Presenting Performance Figures
Authors need to be very careful the exact terms they use to describe the performance characteristics of a particular piece of technology. As Simon Robinson , the technical reviewer for my book often pointed out, benchmark results only directly apply to the test they were derived, and are influenced by my factors like hardware, framework version, operating system and other running process.

In my benchmark harness, the test cases are executed many times on a high priority thread to minimize the impact of some of these potentially interferes, but at the end of the day, the test results are only as good as the code inside the benchmarks.

Even for great tests run using the harness, it is critically important that the results are described. Take, for example, virtual methods. In a test for the book, I compared the cost of calling methods with various types of modifiers applied, and found that there was a performance impact for virtual methods. The impact is quite small, and as John Lam points out, depending on the type of modifier and how the processor groups instructions for execution, the performance impact may be eliminated.

Before my book came out, Fawcette published an article by Francesco Balena in which he stated (or did when this link worked):

"Methods that implement interface members are two to seven times slower than regular methods, depending on the language you use and the coding technique you use to declare and call the method."

See the problem? Francesco implies that by being interface-implementing methods, THE METHOD itself will run "two to seven times slower". A software engineer I work with actually stopped using interfaces in his code based on this advice. What Francesco should have said is "Methods that implement interface members are two to seven times slower to call than …". The actual method's execution speed won't be effected, and for a method that does any type of complex processing, the impact will be negligible. Francesco's slight linguistic slip-up meant that this part of the article conveyed a piece of information that was totally wrong. (I'm not trying to pick on Francesco here - he is a great author, and I enjoy has work. I'm just quoting his slip-up because of the interesting results it has caused. I'm sure there are a few places in my writing where I've made the same slip-up).

The motto

Authors: Be careful what you write - people take your words literally at times.

Readers: Be skeptical of all benchmarks, run the test yourself if in doubt, and always be careful how you apply the result of your benchmark to a specific problem.
In Defence Of DataSets
The DataSet vs business objects debate has flared up on the project I am working on, with pro-business object lobby pushing for the removal of all DataSet traces from the system. Up front I should declare my preference - I am pretty sold on the benefits of DataSets, and while I wouldn't go to the same extent as Adam Cogan ("There are only two type of programmers - those that use DataSets, and those that which they did"), I'd want to be sure that the motivations for ditching DataSets where solid before they got the cut.

The argument against DataSets in this case come down to two main factors:
  • They are Microsoft-specific, and don't play well with other technologies and platforms. This criticism is entirely valid, and I don't disagree with it, as far is it goes. The counter-argument is that wrapping DataSet-centric systems with business object facades so that they play well in the SOA world is certainly possible, and while it isn't easy to accommodate all the semantics of DataSets with a business objects, it is a bridge that can be crossed. For the case of a service that provides data from a database, this is essentially what the anti-DataSet crowd are suggesting that we do from the start, so if it is something that can be accomplished ahead of time, there is no reason that it can't be achieved just in time.
  • DataSets perform worse than business objects. A number of benchmarks exist that show using business objects can result in greater throughput than DataSets. Any hand-crafted type or algorithm is going to perform better than a general-purpose equivalent. Performing the comparison is valid, because you want to understand the cost of the general-purpose solution. It is critical to interpret the raw results of performance comparisons intelligently - is the performance cost of the general-purpose solution offset by the other features it offers, and are those other features (which include having the code already built) worth the cost?

    Before putting the raw performance figures to bed, it is worth addressing two of the DataSet's performance issues - the serialization/ deserialization cost of persisting schema as well as data, and the fatness of the bits sent across the wire caused by the schema. By changing the custom tool associated with DataSets from MSDataSetGenerator to XsdCodeGen, it is a pretty simple task to get rid of the schema, Sharing schema information out-of-band with WSDL or project references is fine in many situations, so the loss of schema information in every persisted DataSet instance is not a drama.

    Small DataSets are typical for the project in discussion, so a test case of a single Order with three OrderDetails children from Northwind was chosen to test the performance improvement that could be won by removing the schema from the persisted format. Benchmarking showed that the schema-free DataSets where five times quicker to serialize and three time quicker to deserialize, with about half as much data transmitted over the wire.

    Given the potential performance issues and Microsoft-specific nature of DataSets, why bother with them? To me, DataSets have the following benefits or features that are either impossible, difficult or tedious to achieve with a business object framework:
  • Developer familiarity. All this isn't a show-stopper for business object frameworks, it should not be under-estimated.
  • No need to bridge the object-relational Impedance Mismatch bridge. This is a huge one. Look at the success of Object Spacing at crossing this bridge.
  • Good designer support in VS.NET.
  • In built support for concurrency management, and the ability to retrieve only data changes.
  • The ability to merge two sets of data that share the same schema. Important for data-binding when data is being updated from external sources, as it means that you don't have to do a full re-bind every time this occurs.
  • In-built filtering support with DataView
  • Data query capabilities ("give me all the employess who joined before 1 Mar 2000")
  • Support for storing error information inline with the data (SetColumnError)
  • Excellent binding capabilities, both at runtime and design time.
  • Rich (if slightly imperfect) eventing infrastructure.
  • Support for any-relation navigation. Object graphs typically only offer parent-child (or child only) navigation.
  • Loss-less persistance with the XML Serializer.
  • Rich in-built XML support (with the help of XmlDataDocument)
  • Ability to extract type information (in the form of an XSD) without needing to use the reflection API.
  • Ability to merge and split an arbitrary number of "instance graphs" together for storage or transport.
  • Nasty Windows Forms Bug - Hidden Windows, Worker Threads and Delayed Handle Creation
    An issue came up on the project I am working on at the moment where one of the applications was freezing up during a population of the UI from data that was being sent over a web service. All the code to correctly manage windows calls being made on the correct thread was being automatically generated, so it was a pretty big surprise that the problem cropped up. The freeze was pretty easy to replicate, and after setting up full debug symbols for Windows and the .NET framework, it was apparent that the call that was hanging was the Win32 function SetWindowsPos.

    After a bit of frigging around, we noticed that although the callback that occurred when the web service ended (which was obviously occurring on a thread pool worker thread and not the UI thread) was actually making direct calls against Control-derived objects. The code is this method was correct - we were checking InvokeRequired, and had the logic to Invoke back onto the UI thread if created, but InvokeRequired was returning false in our case. Looking at the logic of InvokeRequired, the handle of the current thread was being compared to the handle of the thread that the Windows handle was created on, which was the same in this case. This occurred despite the fact that the Control-derived object that we were accessing was created back on the UI thread. What the hell was happening?

    A bit more investigating confirmed that Windows handles are not created until they are actually required. The Handle property only creates the real Windows handle on the first get_ call, which doesn't occur when a Control-derived object is created. The problem in this case was that the window being populated was not actually being shown until the data had come back from the web service, so no one had asked for the handle until it was accessed as part of the InvokeRequired check. This in turn result in the handle being created on the worker thread, which we did not own, and which didn't have a message pump set up to handle windows calls. The result - the app locked up when other calls where made to the previously hidden window, as these calls where made on the main UI thread, which reasonably assumed that all other Control-derived object had also been created on this thread.

    The work-around is simple - access the handle property somewhere during the Control-derived objects creation, which forces the real underlying Windows handle to be created. After that, all works well. The problem was found in the .NET Framework V1.1, and in still there in the current Beta 1 release of the 2.0 Framework. We've submitted the problem to Microsoft, and I'll update you when we get word back. The fix is reasonably simple - they need to track the handle of the thread that the object was created on, and if this is different when the Handle property is accessed for the first time, the call should go back to the object-creating thread.

    A simple re-pro that shows the handle being created on the wrong thread is shown here
    Calling All Visual C++ 6 Programmers
    One of the topics I discussed with Eric Rudder last night was the reasonably large group of programmers who have stuck with Visual C++ 6. This groups seems to be the programmers that time forgot - they have a simple and smooth migration path to 7.0 and 7.1, and these newer products add a heap of functionality that is valuable in the real world - better standards compliance (98.11% in the current release), better security with the /GS switch, better performance (/G7 and SIMD), and the OPTIONAL ability to access the .NET Framework and CLR. The migration path is trivial - even for a large project, a migration is less than a days work. I have successfully migrated largish (>100k LOC) projects from Visual C++ 1.52 to 7.1 without too many dramas, and that is going from 16->32 bit as well.

    So, the question that Eric and I have, is WHY ARE YOU STILL USING VISUAL C++ 6? If there is some reason (real or imagined) for avoiding the migration, please email me with the reason (nick at dotnetperformance dot com), and I'll compile that list and send it to Eric. If you've been putting the move off, now is the time to move. I'd hold off the move to Managed C++ until 2005 ships, but if native code is where you are at, Visual C++ 2003 is an excellent product.