As with almost any performance work, your mileage may vary (in particular the 64-bit JIT may work differently) and you almost certainly shouldn't care. Relatively few people write production code which is worth micro-optimizing. Please don't take this post as an invitation to make code more complicated for the sake of irrelevant and possibly mythical performance changes.
I've been doing quite a bit of work on Noda Time recently - and have started getting my head round all the work that James Keesey has put into the parsing/formatting. I've been reworking it so that we can do everything without throwing any exceptions, and also to work on the idea of parsing a pattern once and building a sequence of actions for both formatting and parsing from the action. To format or parse a value, we then just need to apply the actions in turn. Simples.
Given that this is all in the name of performance (and I consider Noda Time to be worth optimizing pretty hard) I was pretty cross when I ran a complete revamp through the little benchmarking tool we use, and found that my rework had made everything much slower. Even parsing a value after parsing the pattern was slower than parsing both the value and the pattern together. Something was clearly very wrong.
In fact, it turns out that at least two things were very wrong. The first (the subject of this post) was easy to fix and actually made the code a little more flexible. The second (the subject of the next post, which may be tomorrow) is going to be harder to work around.
The new() constraint
In my SteppedPattern type, I have a generic type parameter - TBucket. It's already constrained in terms of another type parameter, but that's irrelevant as far as I'm aware. (After today though, I'm taking very little for granted...) The important thing is that before I try to parse a value, I want to create a new bucket. The idea is that bits of information end up in the bucket as they're being parsed, and at the very end we put everything together. So each parse operation requires a new bucket. How can we create one in a nice generic way?
Well, we can just call its public parameterless constructor. I don't mind the types involved having such a constructor, so all we need to do is add the new() constraint, and then we can call new TBucket():
internal sealed class SteppedPattern<TResult, TBucket> : IParsePattern<TResult>
where TBucket : new()
{
public ParseResult<TResult> Parse(string value)
{
TBucket bucket = new TBucket();
}
}
Great! Nice and simple. Unfortunately, it turned out that that one line of code was taking 75% of the time to parse a value. Just creating an empty bucket - pretty much the simplest bit of parsing. I was amazed when I discovered that.
Fixing it with a provider
The fix is reasonably easy. We just need to tell the type how to create an instance, and we can do that with a delegate:
internal sealed class SteppedPattern<TResult, TBucket> : IParsePattern<TResult>
{
private readonly Func<TBucket> bucketProvider;
internal SteppedPattern(Func<TBucket> bucketProvider)
{
this.bucketProvider = bucketProvider;
}
public ParseResult<TResult> Parse(string value)
{
TBucket bucket = bucketProvider();
}
}
Now I can just call new SteppedPattern(() => new OffsetBucket()) or whatever. This also means I can keep the constructor internal, not that I care much. I could even reuse old parse buckets if that wouldn't be a semantic problem - in other cases it could be useful. Hooray for lambda expressions - until we get to the next post, anyway.
Show me the figures!
You don't want to have to run Noda Time's benchmarks to see the results for yourself, so I wrote a small benchmark to time just the construction of a generic type. As a measure of how insignificant this would be for most apps, these figures are in milliseconds, performing 100 million iterations of the action in question. Unless you're going to do this in performance-critical code, you just shouldn't care.
Anyway, the benchmark has four custom types: two classes, and two structs - a small and a large version of each. The small version has a single int field; the large version has eight long fields. For each type, I benchmarked both approaches to initialization.
The results on two machines (32-bit and 64-bit) are below, for both the v2 CLR and v4. The 64-bit machine is much faster in general - you should only compare results within one machine, as it were.)
CLR v4: 32-bit results (ms per 100 million iterations)
| Test type | new() constraint | Provider delegate |
| Small struct | 689 | 1225 |
| Large struct | 11188 | 7273 |
| Small class | 16307 | 1690 |
| Large class | 17471 | 3017 |
CLR v4: 64-bit results (ms per 100 million iterations)
| Test type | new() constraint | Provider delegate |
| Small struct | 473 | 868 |
| Large struct | 2670 | 2396 |
| Small class | 8366 | 1189 |
| Large class | 8805 | 1529 |
CLR v2: 32-bit results (ms per 100 million iterations)
| Test type | new() constraint | Provider delegate |
| Small struct | 703 | 1246 |
| Large struct | 11411 | 7392 |
| Small class | 143967 | 1791 |
| Large class | 143107 | 2581 |
CLR v2: 64-bit results (ms per 100 million iterations)
| Test type | new() constraint | Provider delegate |
| Small struct | 510 | 686 |
| Large struct | 2334 | 1731 |
| Small class | 81801 | 1539 |
| Large class | 83293 | 1896 |
(An earlier version of this post had a mistake - my original tests used structs for everything, despite the names.)
Others have reported slightly different results, including the new() constraint being better for both large and small structs.
Just in case you hadn't spotted them, look at the results for classes. Those are the real results - it took over 2 minutes to run the test using the new() constraint on my 32-bit laptop, compared with under two seconds for the provider. Yikes. This was actually the situation I was in for Noda Time, which is built on .NET 2.0 - it's not surprising that so much of my benchmark's time was spent constructing classes, given results like this.
Of course you can download the benchmark program for yourself and see how it performs on your machine. It's a pretty cheap-and-cheerful benchmark, but when the differences are this big, minor sources of inaccuracy don't bother me too much. The simplest way to run under CLR v2 is to compile with the .NET 3.5 C# compiler to start with.
What's going on under the hood?
As far as I'm aware, there's no IL to give support for the new() constraint. Instead, the compiler emits a call to Activator.CreateInstance<T>. Apparently, that's slower than calling a delegate - presumably due to trying to find an accessible constructor with reflection, and invoking it. I suspect it could be optimized relatively easily - e.g. by caching the results per type it's called with, in terms of delegates. I'm slightly surprised this hasn't (apparently) been optimized, given how easy it is to cache values by generic type. No doubt there's a good reason lurking there somewhere, even if it's only the memory taken up by the cache.
Either way, it's easy to work around in general.
Conclusion
I wouldn't have found this gotcha if I didn't have before and after tests (or in this case, side-by-side tests of the old way and the new way of parsing). The real lesson of this post shouldn't be about the new() constraint - it should be how important it is to test performance (assuming you care), and how easy it is to assume certain operations are cheap.
Next post: something much weirder.