The irritation of bad names

A couple of days ago I accidentally derailed the comments on Eric Lippert's blog post about unused "using" directives. The reason that redundant code doesn't generate a warning in Visual Studio is that it's what you get to start with in Visual Studio. This led me to rant somewhat about other aspects of Visual Studio's behaviour which sacrifice long term goodness in favour of short term efficiency. Almost all the subsequent comments (at the time of writing this post) are concerned with my rant rather than Eric's post. Some agree with me, some don't - but it's only now that I've spotted the bigger picture behind my annoyances.

All of them are to do with names and the defaults provided. I've blogged before about how hard it is to find a good name - it's a problem I run into time and time again, and the ability to rename something is one of the most important refactorings around.

If you don't know, ask

Now if it's hard to find a good name, it stands to reason that anything the IDE can generate automatically is likely to be a poor name... such as "Form1", "textBox1" or "button1_Click". And yet, in various situations, Visual Studio will happily generate such names, and it can sometimes be a small but significant pain to correct it.

The situation which causes me personally a lot of pain is copying a file. For C# in Depth, I have a lot of very small classes, each with a Main method. When I'm evolving an example, I often want to take the existing code and just change it slightly, but in a new file. So I might have a file called OrderByName.cs containing a class called OrderByName. (I agree this would normally be a bad class name, but in the context of a short but complete example it's okay.) I want to just select the file, hit copy and paste, and be asked for a new filename. The class within the file would then be renamed for me as well. As an aside, this is the behaviour Eclipse has in its Java tooling.

In reality, I'd end up with a new file called "Copy of OrderByName.cs", still containing a class called OrderByName. Renaming the file wouldn't offer to rename the class, as the filename wouldn't match the class name. Renaming the class by just changing it and then hitting Ctrl-. would also rename the original class, which is intensely annoying. You're basically stuck doing it manually with find and replace, as far as I can see. There may be some automated aid available, but at the very least it's non-obvious.

Now the question is: why would I ever want a file called "Copy of OrderByName.cs"? That's always going to be the wrong name, so why doesn't Visual Studio ask me for the right name? It could provide a default so I can skip through if I really want to (and probably an "Apply to all items" if I'm copying multiple files) but at least give me the chance to specify the right filename at the crucial point. Once it knows the right new filename before it's got a broken build, I would hope it would be easy to then apply the new name to the class too.

The underlying point is that if you absolutely have to have a name for something, and there's no way a sensible suggestion can be provided, the user should be consulted. I know there's a lot of discussion these days about not asking the user pointless questions, but this isn't a pointless question... at least when it comes to filenames.

If you don't need a name, don't use one

I'm not a UI person, so some of this section may be either outdated or at least only partially applicable. In particular, I believe WPF does a better job than the Windows Forms designer.

Names have two main purposes, in my view. They can provide semantic meaning to aid the reader, even if a name isn't strictly required (think of the "introduce local variable" refactoring) and they can be used for identification.

Now suppose I'm creating a label on a form. If I'm using the designer, I can probably see the text on the label - its meaning is obvious. I quite possibly don't have to refer to the label anywhere in code, unless I'm changing the value programmatically... so why does it need a name? If you really think it needs a name, is "label1" ever going to be the right name - the one you'd have come up with as the most meaningful one you could think of?

In the comments in Eric's blog, someone pointed out that being prompted for a name every time you dragged a control onto the designer would interrupt workflow... and I quite agree. Many of those controls won't need names. However, as soon as they do need a name, prompting for the name at that point (or just typing it into the property view) isn't nearly such a distraction... indeed, I'd suggest it's actually guiding the developer in question to crystallize their thoughts about the purpose of that control.

Conclusion

Okay, this has mostly been more ranting - but at least it's now on my blog, and I've been able to give a little bit more detail about the general problem I see in Visual Studio - a problem which leads to code containing utterly useless names.

The fundamental principle is that I want every name in my code to be a meaningful one. The IDE should use two approaches to help me with that goal:

  • Don't give a name to anything that doesn't deserve or need one
  • If a name is really necessary, and you can't guess it from the rest of the context, ask the user

I don't expect anything to change, but it's good to have it off my chest.

Posted by skeet | 22 comment(s)
Filed under: ,

Type initialization changes in .NET 4.0

This morning, while checking out an email I'd received about my brain-teasers page, I discovered an interesting change to the CLR in .NET 4.0. At least, I think it's interesting. It's possible that different builds of the CLR have exhibited different behaviour for a while - I only have 32-bit versions of Windows installed, so that's what I'm looking at for this whole post. (Oh, and all testing was done under .NET 4.0b2 - it could still change before release.)

Note: to try any of this code, build in release mode. Running in the debugger or even running a debug build without the debugger may well affect the behaviour.

Precise initialization: static constructors

I've written before about static constructors in C# causing types to be initialized immediately before the type is first used, either by constructing an instance or referring to a static member. In other words, consider the following program:

using System;

class StaticConstructorType
{
    private static int x = Log();
    
    // Force "precise" initialization
    static StaticConstructorType() {}
    
    private static int Log()
    {
        Console.WriteLine("Type initialized");
        return 0;
    }
    
    public static void StaticMethod() {}
}

class StaticConstructorTest
{
    static void Main(string[] args)
    {
        if (args.Length == 0)
        {
            Console.WriteLine("No args");
        }
        else
        {
            StaticConstructorType.StaticMethod();
        }
    }
}

Note how the static variable x is initialized using a method that writes to the console. This program is guaranteed to write exactly one line to the console: StaticConstructorType will not be initialized unless you give a command line argument to force it into the "else" branch. The way the C# compiler controls this is using the beforefieldinit flag.

So far, so boring. We know exactly when the type will be initialized - I'm going to call this "precise" initialization. This behaviour hasn't changed, and couldn't change without it being backwardly incompatible. Now let's consider what happens without the static constructor.

Eager initialization: .NET 3.5

Let's take the previous program and just remove the (code-less) static constructor - and change the name of the type, for clarity:

using System;

class Eager
{
    private static int x = Log();
    
    private static int Log()
    {
        Console.WriteLine("Type initialized");
        return 0;
    }
    
    public static void StaticMethod() {}
}

class EagerTest
{
    static void Main(string[] args)
    {
        if (args.Length == 0)
        {
            Console.WriteLine("No args");
        }
        else
        {
            Eager.StaticMethod();
        }
    }
}

Under .NET 3.5, this either writes both "Type initialized" and "No args" (if you don't pass any command line arguments) or just "Type initialized" (if you do). In other words, the type initialization is eager. In my experience, a type is initialized at the start of execution of the first method which refers to that type.

So what about .NET 4.0? Under .NET 4.0, the above code will never print "Type initialized".

If you don't pass in a command line argument, you see "No args" as you might expect... if you do, there's no output at all. The type is being initialized extremely lazily. Let's see how far we can push it...

Lazy initialization: .NET 4.0

The CLR guarantees that the type initializer will be run at some point before the first reference to any static field. If you don't use a static field, the type doesn't have to be initialized... and it looks like .NET 4.0 obeys that in a fairly lazy way. Another test app:

using System;

class Lazy
{
    private static int x = Log();
    private static int y = 0;
    
    private static int Log()
    {
        Console.WriteLine("Type initialized");
        return 0;
    }
    
    public static void StaticMethod()
    {
        Console.WriteLine("In static method");
    }

    public static void StaticMethodUsingField()
    {
        Console.WriteLine("In static method using field");
        Console.WriteLine("y = {0}", y);
    }
    
    public void InstanceMethod()
    {
        Console.WriteLine("In instance method");
    }
}

class LazyTest
{
    static void Main(string[] args)
    {
        Console.WriteLine("Before static method");
        Lazy.StaticMethod();
        Console.WriteLine("Before construction");
        Lazy lazy = new Lazy();
        Console.WriteLine("Before instance method");
        lazy.InstanceMethod();
        Console.WriteLine("Before static method using field");
        Lazy.StaticMethodUsingField();
        Console.WriteLine("End");
    }
}

This time the output is:

Before static method
In static method
Before construction
Before instance method
In instance method
Before static method using field
Type initialized
In static method using field
y = 0
End

As you can see, the type initialized when StaticMethodUsingField is called. It's not as lazy as it could be - the first line of the method could execute before the type is initialized. Still, being able to construct an instance and call a method on it without triggering the type initializer is slightly surprising.

I've got one final twist... what would you expect this program to do?

using System;

class CachingSideEffect
{
    private static int x = Log();

    private static int Log()
    {
        Console.WriteLine("Type initialized");
        return 0;
    }
    
    public CachingSideEffect()
    {
        Action action = () => Console.WriteLine("Action");
    }
}

class CachingSideEffectTest
{
    static void Main(string[] args)
    {
        new CachingSideEffect();
    }
}

In .NET 4.0, using the Microsoft C# 4 compiler, this does print "Type initialized"... because the C# compiler has created a static field in which to cache the action. The lambda expression doesn't capture any variables, so the same delegate instance can be reused every time. That involves caching it in a static field, triggering type initialization. If you change the action to use Console.WriteLine(this) then it can't cache the delegate, and the constructor no longer triggers initialization.

This bit is completely implementation-specific in terms of the C# compiler, but I thought it might tickle your fancy anyway.

Conclusion

I'd like to stress that none of this should cause your code any problems. The somewhat eager initialization of types without static constructors was entirely legitimate according to the C# and CLR specs, and so the new lazy behaviour of .NET 4.0. If your code assumed that just calling a static method, or creating an instance, would trigger initialization, then that's your own fault to some extent. That doesn't stop it being an interesting change to spot though :)

Posted by skeet | 12 comment(s)
Filed under: ,

LINQ to Rx: second impressions

My previous post had the desired effect: it generated discussion on the LINQ to Rx forum, and Erik and Wes very kindly sent me a very detailed response too. There's no better way to cure ignorance than to display it to the world.

Rather than regurgitating the email verbatim, I've decided to try to write it in my own words, with extra thoughts where appropriate. That way if I've misunderstood anything, I may be corrected - and the very act of trying to explain all this is likely to make me explore it more deeply than I would otherwise.

I'm leaving out the bits I don't yet understand. One of the difficulties with LINQ to Rx at the moment is that the documentation is somewhat sparse - there are loads of videos, and at least there is a CHM file for each of the assemblies bundled in the framework, but many methods just have a single sentence of description. This is entirely understandable - the framework is still in flux, after all. I'd rather have the bits but sparse docs than immaculate docs for a framework I can't play with - but it makes it tricky to go deeper unless you've got time to experiment extensively. There's an rxwiki site which looks like it may be the community's attempt to solve this problem - but it needs a bit more input, I think. When I get a bit of time to breathe, I'd like to try to contribute there.

The good news is that I don't think there were any mechanical aspects that I got definitively wrong in what I wrote... but the bad news is that I wasn't thinking in Rx terms. We'll look at the different aspects separately.

Subscriptions and collections

My first "complaint" was about the way that IEnumerable<T>.ToObservable() worked. Just to recap, I was expecting a three stage startup process:

  • Create the observable
  • Subscribe any observers
  • Tell the observable to "start"

Instead, as soon as an observer subscribes, the observable publishes everything in the sequence to it (on a different thread, by default). Separate calls to Subscribe make the observable iterate over the sequence multiple times.

Now, my original viewpoint makes sense if you think of Subscribe as being like an event subscription. It feels like something which should be passive: another viewer turning on their television to watch a live broadcast.

However, as soon as you think of IObservable.Subscribe as being the dual of IEnumerable.GetEnumerator, the Rx way makes more sense. Each call to GetEnumerator starts the sequence from scratch, and so does each call to Subscribe. This is more like inserting a disc into the DVD player - you're still watching something, but there's a more active element to it. You put the DVD in, it starts playing. I guess following the analogy further would make my suggested model more like a PVR :)

Additionally, this "subscription as an action" model makes more sense of methods like Return and Repeat, and also works better as a reusable object: my own idea of "now push the collection" feels dreadfully stateful: why can't I push the collection twice? What happens if an observer subscribes after I've pushed?

I suspect this will trip up many an Rx neophyte; the video Wes recorded on hot and cold observables should help. Admittedly I'd already watched it before writing the blog post, so I've no excuse...

The subscription model can effectively be modified via composition though; using Subject (as per the blog post), AsyncSubject (which remembers the first value it sees, and only yields that), BehaviorSubject (which remembers the last value it's seen), and ReplaySubject (which remembers everything it sees, optionally limited by a buffer) you can do quite a bit.

Wes included in his email a StartableObservable which one could start and stop. I'd come up with a slightly similar idea at home, an ObservableSequence (not nearly such a good name) but which was limited to sequences: effectively it made the steps listed above explicit for a pull sequence. The code Wes provided was completely isolated from IEnumerable<T> - you would create a StartableObservable from any existing observable, then subscribe to it, then start it. It uses a Subject to effectively collect the subscriptions - starting the observable merely subscribed the subject to the original observable passed into the constructor.

The difference between Wes's solution and mine is more fundamental than whether his is more general-purpose than mine or not (although it clearly is). Wes didn't have to go outside the world of Rx at all. All the building blocks were there, he just put them together - and ended up with another building block, ready to be used with the rest. That's a common theme in this blog post :)

Asynchronous aggregation

I did get one thing right in my previous post: my suggestion that there should be asynchronous versions of the aggregation operators is apparently not a bad one. We may see the framework come with such things in the future... but they won't revolve around Task<T>.

What do we have to represent an asynchronous computation? Why, IObservable<T> of course. It will present the result at some future point in time. Ideally, I suppose you would deal with the count (or maximum line length, or whatever) by reacting to it asynchronously too. If necessary though, you can always just take the value produced and stash it somewhere... which is exactly what an AsyncSubject does, as mentioned above. You can get the value from that by just calling First() on it, which will block if the value hasn't been seen yet - and you don't need to worry about "missing" it, because of the caching within the subject.

When I started this blog post, I didn't understand Prune, but I've found that writing about the whole process has made it somewhat clearer to me. Calling Prune on an observable returns an AsyncSubject - but which also unsubscribes itself from the original observable when the subject is disposed, basically allowing a more orderly cleanup. So, all we need to do is call Prune on the result of our asynchronous aggregation, and we're away.

That's one part of the "non-Rx" framework removed... what else can we take out of the code from the previous blog post? Well, if you look at the FutureAggregate method I posted, it does two things: maintains a running aggregate, and publishes the last result (via a Task<T>). Now the "maintain a running aggregate" looks remarkably like Scan, doesn't it? All the future aggregates (FutureCount etc) can be built from one new building block: an observable which subscribes to an existing one, and yields the last value it sees before completion.

I'll check with Wes whether he's happy for me for me to share his code - if so, I'll put that and the original code into a zip file so it's easy to compare the dull version with the shiny one.

Conclusion

It's not enough to be able to think about Rx. To really appreciate it, you've got to be able to think in Rx. As I'd written a sort of "mini-Rx" before, I was arrogant enough to assume I already knew how to think in observable sequences... but apparently not. (To be fair to myself, it's been a while and Push LINQ didn't try to do anything genuinely asynchronous.)

I'm certainly not "in the zone" yet when it comes to Rx... but I think I can see it in the distance now. I'm heartily glad I raised my concerns over asynchronous aggregation - partly as encouragement to the team to consider including them in the framework, but mostly because it's helped me appreciate the framework a lot better. With any luck, these two somewhat "stream of consciousness" posts may have helped you too.

Now to go over what I wrote last night for the book, and see how much of it was rubbish :)

Posted by skeet | 1 comment(s)
Filed under: , ,

First encounters with Reactive Extensions

I've been researching Reactive Extensions for the last few days, with an eye to writing a short section in chapter 12 of the second edition of C# in Depth. (This is the most radically changed chapter from the first edition; it will be covering LINQ to SQL, IQueryable, LINQ to XML, Parallel LINQ, Reactive Extensions, and writing your own LINQ to Objects operators.) I've watched various videos from Channel 9, but today was the first time I actually played with it. I'm half excited, and half disappointed.

My excited half sees that there's an awful lot to experiment with, and loads to learn about join patterns etc. I'm also looking forward to trying genuine events (mouse movements etc) – so far my tests have been to do with collections.

My disappointed half thinks it's missing something. You see, Reactive Extensions shares some concepts with my own Push LINQ library… except it's had smarter people (no offense meant to Marc Gravell) working harder on it for longer. I'd expect it to be easier to use, and make it a breeze to do anything you could do in Push LINQ. Unfortunately, that's not quite the case.

Subscription model

First, the way that subscription is handled for collections seems slightly odd. I've been imagining two kinds of observable sources:

  • Genuine "event streams" which occur somewhat naturally – for instance, mouse movement events. Subscribing to such an observable wouldn't do anything to it other than adding subscribers.
  • Collections (and the like) where the usual use case is "set up the data pipeline, then tell it to go". In that case calling Subscribe should just add the relevant observers, but not actually "start" the sequence – after all, you may want to add more observers (we'll see an example of this in a minute).

In the latter case, I could imagine an extension method to IEnumerable<T> called ToObservable which would return a StartableObservable<T> or something like that – you'd subscribe what you want, and then call Start on the StartableObservable<T>. That's not what appears to happen though – if you call ToObservable(), you get an implementation which iterates over the source sequence as soon as anything subscribes to it – which just doesn't feel right to me. Admittedly it makes life easy in the case where that's really all you want to do, but it's a pain otherwise.

There's a way of working round this in Reactive Extensions: there's Subject<T> which is both an observer and an observable. You can create a Subject<T>, Subscribe all the observers you want (so as to set up the data pipeline) and then subscribe the subject to the real data source. It's not exactly hard, but it took me a while to work out, and it feels a little unwieldy. The next issue was somewhat more problematic.

Blocking aggregation

When I first started thinking about Push LINQ, it was motivated by a scenario from the C# newsgroup: someone wanted to group a collection in a particular way, and then count how many items were in each group. This is effectively the "favourite colour voting" scenario outlined in the link at the top of this post. The problem to understand is that the normal Count() call is blocking: it fetches items from a collection until there aren't any more; it's in control of the execution flow, effectively. That means if you call it in a grouping construct, the whole group has to be available before you call Count(). So, you can't stream an enormous data set, which is unfortunate.

In Push LINQ, I addressed this by making Count() return Future<int> instead of int. The whole query is evaluated, and then you can ask each future for its actual result. Unfortunately, that isn't the approach that the Reactive Framework has taken – it still returns int from Count(). I don't know the reason for this, but fortunately it's somewhat fixable. We can't change Observable of course, but we can add our own future-based extensions:

public static class ObservableEx
{
    public static Task<TResult> FutureAggregate<TSource, TResult>
        (this IObservable<TSource> source,
        TResult seed, Func<TResult, TSource, TResult> aggregation)
    {
        TaskCompletionSource<TResult> result = new TaskCompletionSource<TResult>();
        TResult current = seed;
        source.Subscribe(value => current = aggregation(current, value),
            error => result.SetException(error),
            () => result.SetResult(current));
        return result.Task;
    }

    public static Task<int> FutureMax(this IObservable<int> source)
    {
        // TODO: Make this generic and throw exception on
        // empty sequence. Left as an exercise for the reader.
        return source.FutureAggregate(int.MinValue, Math.Max);
    }

    public static Task<int> FutureMin(this IObservable<int> source)
    {
        // TODO: Make this generic and throw exception on
        // empty sequence. Left as an exercise for the reader.
        return source.FutureAggregate(int.MaxValue, Math.Min);
    }

    public static Task<int> FutureCount<T>(this IObservable<T> source)
    {
        return source.FutureAggregate(0, (count, _) => count + 1);
    }
}

This uses Task<T> from Parallel Extensions, which gives us an interesting ability, as we'll see in a moment. It's all fairly straightforward - TaskCompletionSource<T> makes it very easy to specify a value when we've finished, or indicate that an error occurred. As mentioned in the comments, the maximum/minimum implementations leave something to be desired, but it's good enough for a blog post :)

Using the non-blocking aggregation operators

Now that we've got our extension methods, how can we use them? First I decided to do a demo which would count the number of lines in a file, and find the maximum and minimum line lengths:

public static List<T> ToList<T>(this IObservable<T> source)
{
    List<T> ret = new List<T>();
    source.Subscribe(x => ret.Add(x));
    return ret;
}
private static IEnumerable<string> ReadLines(string filename)
{
    using (TextReader reader = File.OpenText(filename))
    {
        string line;
        while ((line = reader.ReadLine()) != null)
        {
            yield return line;
        }
    }
}
...
var subject = new Subject<string>();
var lengths = subject.Select(line => line.Length);
var min = lengths.FutureMin();
var max = lengths.FutureMax();
var count = lengths.FutureCount();
            
var source = ReadLines("../../Program.cs");
source.ToObservable(Scheduler.Now).Subscribe(subject);
Console.WriteLine("Count: {0}, Min: {1}, Max: {2}",
                  count.Result, min.Result, max.Result);

As you can see, we use the Result property of a task to find its eventual result - this call will block until the result is ready, however, so you do need to be careful about how you use it. Each line is only read from the file once, and pushed to all three observers, who carry their state around until the sequence is complete, whereupon they publish the result to the task.

I got this working fairly quickly - then went back to the "grouping lines by line length" problem I'd originally set myself. I want to group the lines of a file by their length (all lines of length 0, all lines of length 1 etc) and count each group. The result is effectively a histogram of line lengths. Constructing the query itself wasn't a problem - but iterating through the results was. Fundamentally, I don't understand the details of ToEnumerable yet, particularly the timing. I need to look into it more deeply, but I've got two alternative solutions for the moment.

The first is to implement my own ToList extension method. This simply creates a list and subscribes an observer which adds items to the list as it goes. There's no attempt at "safety" here - if you access the list before the source sequence has completed, you'll see whatever has been added so far. I am still just experimenting :) Here's the implementation:

public static List<T> ToList<T>(this IObservable<T> source)
{
    List<T> ret = new List<T>();
    source.Subscribe(x => ret.Add(x));
    return ret;
}

Now we can construct a query expression, project each group using our future count, make sure we've finished pushing the source before we read the results, and everything is fine:

var subject = new Subject<string>();
var groups = from line in subject
             group line.Length by line.Length into grouped
             select new { Length = grouped.Key, Count = grouped.FutureCount() };
var results = groups.ToList();

var source = ReadLines("../../Program.cs");
source.ToObservable(Scheduler.Now).Subscribe(subject);
foreach (var group in results)
{
    Console.WriteLine("Length: {0}; Count: {1}", group.Length, group.Count.Result);
}

Note how the call to ToList is required before calling source.ToObservable(...).Subscribe - otherwise everything would have been pushed before we started collecting it.

All well and good... but there's another way of doing it too. We've only got a single task being produced for each group - instead of waiting until everything's finished before we dump the results to the console, we can use Task.ContinueWith to write it (the individual group result) out as soon as that group has been told that it's finished. We force this extra action to occur on the same thread as the observer just to make things easier in a console app... but it all works very neatly:

var subject = new Subject<string>();
var groups = from line in subject
             group line.Length by line.Length into grouped
             select new { Length = grouped.Key, Count = grouped.FutureCount() };
                                    
groups.Subscribe(group =>
{
    group.Count.ContinueWith(
         x => Console.WriteLine("Length: {0}; Count: {1}"
                                group.Length, x.Result),
         TaskContinuationOptions.ExecuteSynchronously);
});
var source = ReadLines("../../Program.cs");
source.ToObservable(Scheduler.Now).Subscribe(subject);

Conclusion

That's the lot, so far. It feels like I'm sort of in the spirit of Reactive Extensions, but that maybe I'm pushing it (no pun intended) in a direction which Erik and Wes either didn't anticipate, or at least don't view as particularly valuable/elegant. I very much doubt that they didn't consider deferred aggregates - it's much more likely that either I've missed some easy way of doing this, or there are good reasons why it's a bad idea. I hope to find out which at some point... but in the meantime, I really ought to work out a more idiomatic example for C# in Depth.

Posted by skeet | 14 comment(s)
Filed under: , , ,

"Magic" null argument testing

Warning: here be dragons. I don't think this is the right way to check for null arguments, but it was an intriguing idea.

Today on Stack Overflow, I answered a question about checking null arguments. The questioner was already using an extension similar to my own one in MiscUtil, allowing code like this:

public void DoSomething(string name)
{
    name.ThrowIfNull("name");

    // Normal code here
}

That's all very well, but it's annoying to have to repeat the name part. Now in an ideal world, I'd say it would be nice to add an attribute to the parameter and have the check performed automatically (and when PostSharp works with .NET 4.0, I'm going to give that a go, mixing Code Contracts and AOP…) – but for the moment, how far can we go with extension methods?

I stand by my answer from that question – the code above is the simplest way to achieve the goal for the moment… but another answer raised the interesting prospect of combining anonymous types, extension methods, generics, reflection and manually-created expression trees. Now that's a recipe for hideous code… but it actually works.

The idea is to allow code like this:

public void DoSomething(string name, string canBeNull, int foo, Stream input)
{
    new { name, input }.CheckNotNull();

    // Normal code here
}

That should check name and input, in that order, and throw an appropriate ArgumentNullException - including parameter name - if one of them is null. It uses the fact that projection initializers in anonymous types use the primary expression's name as the property name in the generated type, and the value of that expression ends up in the instance. Therefore, given an instance of the anonymous type initializer like the above, we have both the name and value despite having only typed it in once.

Now obviously this could be done with normal reflection – but that we be slow as heck. No, we want to effectively find the properties once, and generate strongly typed delegates to perform the property access. That sounds like a job for Delegate.CreateDelegate, but it's not quite that simple… to create the delegate, we'd need to know (at compile time) what the property type is. We could do that with another generic type, but we can do better than that. All we really need to know about the value is whether or not it's null. So given a "container" type T, we'd like a bunch of delegates, one for each property, returning whether that property is null for a specified instance – i.e. a Func<T, bool>. And how do we build delegates at execution time with custom logic? We use expression trees…

I've now implemented this, along with a brief set of unit tests. The irony is that the tests took longer than the implementation (which isn't very unusual) – and so did writing it up in this blog post. I'm not saying that it couldn't be improved (and indeed in .NET 4.0 I could probably make the delegate throw the relevant exception itself) but it works! I haven't benchmarked it, but I'd expect it to be nearly as fast as manual tests – insignificant in methods that do real work. (The same wouldn't be true using reflection every time, of course.)

The full project including test cases is now available, but here's the (almost completely uncommented) "production" code.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Reflection;
using System.Linq.Expressions;

public static class Extensions
{
    public static void CheckNotNull<T>(this T container) where T : class
    {
        if (container == null)
        {
            throw new ArgumentNullException("container");
        }
        NullChecker<T>.Check(container);
    }

    private static class NullChecker<T> where T : class
    {
        private static readonly List<Func<T, bool>> checkers;
        private static readonly List<string> names;

        static NullChecker()
        {
            checkers = new List<Func<T, bool>>();
            names = new List<string>();
            // We can't rely on the order of the properties, but we
            // can rely on the order of the constructor parameters
            // in an anonymous type - and that there'll only be
            // one constructor.
            foreach (string name in typeof(T).GetConstructors()[0]
                                             .GetParameters()
                                             .Select(p => p.Name))
            {
                names.Add(name);
                PropertyInfo property = typeof(T).GetProperty(name);
                // I've omitted a lot of error checking, but here's
                // at least one bit...
                if (property.PropertyType.IsValueType)
                {
                    throw new ArgumentException
                        ("Property " + property + " is a value type");
                }
                ParameterExpression param = Expression.Parameter(typeof(T), "container");
                Expression propertyAccess = Expression.Property(param, property);
                Expression nullValue = Expression.Constant(null, property.PropertyType);
                Expression equality = Expression.Equal(propertyAccess, nullValue);
                var lambda = Expression.Lambda<Func<T, bool>>(equality, param);
                checkers.Add(lambda.Compile());
            }
        }

        internal static void Check(T item)
        {
            for (int i = 0; i < checkers.Count; i++)
            {
                if (checkers[i](item))
                {
                    throw new ArgumentNullException(names[i]);
                }
            }
        }
    }
}

Oh, and just as a miracle – the expression tree worked first time. I'm no Marc Gravell, but I'm clearly improving :)

Update: Marc Gravell pointed out that the order of the results of Type.GetProperties isn't guaranteed - something I should have remembered myself. However, the order of the constructor parameters will be the same as in the anonymous type initialization expression, so I've updated the code above to reflect that. Marc also showed how it could almost all be put into a single expression tree which returns either null (for no error) or the name of the "failing" parameter. Very clever :)

Custom value types are like buses

You wait years to write one… and then six of them come along at once.

(Cross-posted to the Noda Time blog and my coding blog as it's relevant to both.)

When we started converting Joda Time to .NET, there was always going to be the possibility of using custom value types (structs) – an opportunity which isn't available in Java. This has meant reducing the type hierarchy a fair amount, but that's actually made things simpler. However, I didn't realise quite how many we'd end up with – or how many would basically just wrap a long.

So far, we have 4 value types whose only field is a long. They are:

  • Instant: an instant on the theoretical timeline, which doesn't know about days of the week, time zones etc. It has a single reference point – the Unix epoch – but that's only so that we can represent it in a concrete fashion. The long represents the number of ticks since the Unix epoch.
  • LocalInstant: this is a tricky one to explain, and I'm still looking for the right way of doing so. The basic idea is that it represents a day and a time within that day, but without reference to a time zone or calendar system. So if I'm talking to someone in a different time zone and an Islamic calendar, we can agree on the idea of "3pm tomorrow" even if we have different ideas of what month that's in and when it starts. A LocalInstant is effectively the instant at which that date/time would occur if you were considering it in UTC… but importantly it's not a genuine instant, in that it doesn't unambiguously represent a point in time.
  • Duration: a number of ticks, which can be added to an instant to get another instant. This is a pure number of ticks – again, it doesn't really know anything about days, months etc (although you can find the duration for the number of ticks in a standard day – that's not the same as adding one day to a date and time within a time zone though, due to daylight savings).
  • Offset: very much like a duration, but only used to represent the offset due to a time zone. This is possibly the most unusual of the value types in Noda, because of the operators using it. You can add an offset to an instant to get a local instant, or you can subtract an offset from a local instant to get an instant… but you can't do those things the other way round.

The part about the odd nature of the operators using Offset really gets to the heart of what I like about Joda Time and what makes Noda Time even better. You see, Joda Time already has a lot of types for different concepts, where .NET only has DateTime/DateTimeOffset/TimeSpan – having these different types and limiting what you can do with them helps to lead developers into the pit of success; the type system helps you naturally get things right.

However, the Joda Time API uses long internally to represent all of these, presumably for the sake of performance: Java doesn't have custom value types, so you'd have to create a whole new object every time you manipulated anything. This could be quite significant in some cases. Using the types above has made the code a lot simpler and more obviously correct – except for a couple of cases where the code I've been porting appears to do some very odd things, which I've only noticed are odd because of the type system. James Keesey, who's been porting the time zone compiler, has had similar experiences: since introducing the offset type with its asymmetric operators, found that he had a bug in some of his ported code – which immediately caused a compile-time error when he'd converted to using offsets.

When I first saw the C# spec, I was dubious about the value of user-defined value types and operator overloading. Indeed I still suspect that both features are overused… but when they're used appropriately, they're beautiful.

Noda Time is still a long way from being a useful API, but I'm extremely pleased with how it's shaping up.

Posted by skeet | 4 comment(s)
Filed under: ,

Where do you benefit from dynamic typing?

Disclaimer: I don't want this to become a flame war in the comments. I'm coming from a position of ignorance, and well aware of it. While I'd like this post to provoke thought, it's not meant to be provocative in the common use of the term.

Chapter 14 of C# in Depth is about dynamic typing in C#. A couple of reviewers have justifiably said that I'm fairly keen on the mantra of "don't use dynamic typing unless you need it" – and that possibly I'm doing dynamic typing a disservice by not pointing out more of its positive aspects. I completely agree, and I'd love to be more positive – but the problem is that I'm not (yet) convinced about why dynamic typing is something I would want to embrace.

Now I want to start off by making something clear: this is meant to be about dynamic typing. Often advocates for dynamically typed languages will mention:

  • REPL (read-eval-print-loop) abilities which allow for a very fast feedback loop while experimenting
  • Terseness – the lack of type names everywhere makes code shorter
  • Code evaluated at execution time (so config files can be scripts etc)

I don't count any of these as benefits of dynamic typing per se. They're benefits which often come alongside dynamic typing, but they're not dependent on dynamic typing. The terseness argument is the one most closely tied to their dynamic nature, but various languages with powerful type inference show that being statically typed doesn't mean having to specify type names everywhere. (C#'s var keyword is a very restricted form of type inference, compared with – say – that of F#.)

What I'm talking about is binding being performed at execution time and only at execution time. That allows for:

  • Duck typing
  • Dynamic reaction to previously undeclared messages
  • Other parts of dynamic typing I'm unaware of (how could there not be any?)

What I'm interested in is how often these are used within real world (rather than demonstration) code. It may well be that I'm suffering from Blub's paradox – that I can't see the valid uses of these features simply because I haven't used them enough. Just to be clear, I'm not saying that I never encounter problems where I would welcome dynamic typing – but I don't run into them every day, whereas I get help from the compiler every day.

Just as an indicator of how set in my statically typed ways I am, at the recent Stack Overflow DevDays event in London, Michael Sparks went through Peter Norvig's spelling corrector. It's a neat piece of code (and yes, I'll finish that port some time) but I kept finding it hard to understand simply because the types weren't spelled out. Terseness can certainly be beneficial, but in this case I would personally have found it simpler if the variable and method types had been explicitly declared.

So, for the dynamic typing fans (and I'm sure several readers will come into that category):

  • How often do you take advantage of dynamic typing in a way that really wouldn't be feasible (or would be very clunky) in a statically typed language?
  • Is it usually the same single problem which crops up regularly, or do you find a wide variety of problems benefit from dynamic typing?
  • When you declare a variable (or first assign a value to a variable, if your language doesn't use explicit declarations) how often do you really either not know its type or want to use some aspect of it which wouldn't typically have been available in a statically typed environment?
  • What balance do you find in your use of duck typing (the same method/member/message has already been declared on multiple types, but there's no common type or interface) vs truly dynamic reaction based on introspection of the message within code (e.g. building a query based on the name of the method, such as FindBooksByAuthor("Josh Bloch"))?
  • What aspects of dynamic typing do I appear to be completely unaware of?

Hopefully someone will be able to turn the light bulb on for me, so I can be more genuinely enthusiastic about dynamic typing, and perhaps even diversify from my comfort zone of C#…

Posted by skeet | 58 comment(s)
Filed under: , ,

Just how spiky is your traffic?

No, this isn't the post about dynamic languages I promise. That will come soon. This is just a quick interlude. This afternoon, while answering a question on Stack Overflow1 about the difference between using an array and a Dictionary<string, string> (where each string was actually the string representation of an integer) I posted the usual spiel about preferring readable code to micro-optimisation. The response in a comment - about the performance aspect - was:

Well that's not so easily said for a .com where performance on a site that receives about 1 million hits a month relies on every little ounce of efficiency gains you can give it.

A million hits a month, eh? That sounds quite impressive, until you actually break it down. Let's take a month of 30 days - that has 30 * 24 * 60 * 60 = 2,592,000 seconds2. In other words, a million hits a month is less than one hit every two seconds. Not so impressive. At Google we tend to measure traffic in QPS (queries per second, even if they're not really queries - the search terminology becomes pervasive) so this is around 0.39 QPS. Astonished that someone would make such a claim in favour of micro-optimisation at that traffic level, I tweeted about it. Several of the replies were along the lines of "yeah, but traffic's not evenly distributed." That's entirely true. Let's see how high we can make the traffic without going absurd though.

Let's suppose this is a site which is only relevant on weekdays - that cuts us down to 20 days in the month. Now let's suppose it's only relevant for one hour per day - it's something people look at when they get to work, and most of the users are in one time zone. That's a pretty massive way of spiking. We've gone down from 30 full days of traffic to 20 hours - or 20 * 60 * 60 = 72000 seconds, giving 14 QPS. Heck, let's say the peak of the spike is double that - a whopping 28 QPS.

Three points about this:

  • 28 QPS is still not a huge amount of traffic.
  • If you're really interested in handling peak traffic of ~28 QPS without latency becoming huge, it's worth quoting that figure rather than "a million hits a month" because the latter is somewhat irrelevant, and causes us to make wild (and probably wildly inaccurate) guesses about your load distribution.
  • If you're going to bring the phrase "a .com" into the picture, attempting to make it sound particularly important, you really shouldn't be thinking about hosting your web site on one server - so the QPS gets diluted again.
  • Even at 28 QPS, the sort of difference that would be made here is tiny. A quick microbenchmark (with all the associated caveats) showed that on my laptop (hardly a server-class machine) I could build the dictionary and index into it 3 times 2.8 million times in about 5 seconds. If every request needed to do that 100 times, then the cost of doing it 28 requests per second on my laptop would still only be 0.5% of that second - not a really significant benefit, despite the hugely exaggerated estimates of how often we needed to do that.

There are various other ways in which it's not a great piece of code, but the charge against premature optimization still stands. You don't need to get every little ounce of efficiency out of your code. Chances are, if you start guessing at where you can get efficiency, you're going to be wrong. Measure, measure, measure - profile, profile, profile. Once you've done all of that and proved that a change reducing clarity has a significant benefit, go for it - but until then, write the most readable code you can. Likewise work out your performance goals in a meaningful fashion before you worry too much - and hits per months isn't a meaningful figure.

Performance is important - too important to be guessed about instead of measured.


1 I'm not linking to it because the Streisand effect would render this question more important than it really is. I'm sure you can find it if you really want to, but that's not the point of the post.

2 Anyone who wants to nitpick and talk about months which are a bit longer or shorter than that due to daylight saving time changes (despite still being 30 days) can implement that logic for me in Noda Time.

Posted by skeet | 14 comment(s)
Filed under: , ,

Noda Time gets its own blog

I've decided it's probably not a good idea to make general Noda Time posts on my personal blog. I'll still post anything that's particularly interesting in a "general coding" kind of way here, even if I discover it in Noda Time, but I thought it would be good for the project to have a blog of its very own, which other team members can post to.

I still have plenty of things I want to blog about here. Next up is likely to be a request for help: I want someone to tell me why I should love the "dynamic" bit of dynamic languages. Stay tuned for more details :)

Posted by skeet | 3 comment(s)
Filed under: , ,

Noda Time is born

There was an amazing response to yesterday's post – not only did readers come up with plenty of names, but lots of people volunteered to help. As a result, I'm feeling under a certain amount of pressure for this project to actually take shape.

The final name chosen is Noda Time. We now have a Google Code Project and a Google Group (/mailing list). Now we just need some code…

I figured it would be worth explaining a bit more about my vision for the project. Obviously I'm only one contributor, and I'm expecting everyone to add there own views, but this can act as a starting point.

I want this project to be more than just a way of getting better date and time handling on .NET. I want it to be a shining example of how to build, maintain and deploy an open source .NET library. As some of you know, I have a few other open source projects on the go, and they have different levels of polish. Some have downloadable binaries, some don't. They all have just-about-enough-to-get-started documentation, but not nearly enough, really. They have widely varying levels of test coverage. Some are easier to build than others, depending on what platform you're using.

In some ways, I'm expecting the code to be the easy part of Noda Time. After all, the implementation is there already – we'll have plenty of interesting design decisions to make in order to marry the concepts of Joda Time with the conventions of .NET, but that shouldn't be too hard. Here are the trickier things, which need discussion, investigation and so forth:

  • What platforms do we support? Here's my personal suggested list:
    • .NET 4.0
    • .NET 3.5
    • .NET 2.0SP1 (require the service pack for DateTimeOffset)
    • Mono (versions TBD)
    • Silverlight 2, 3 and 4
    • Compact Framework 2.0 and 3.5
  • What do we ship, and how do we handle different platforms? For example, can we somehow use Code Contracts to give developers a better experience on .NET 4.0 without making it really hard to build for other versions of .NET? Can we take advantage of the availability of TimeZoneInfo in .NET 3.5 and still build fairly easily for earlier versions? Do developers want debug or release binaries? Can we build against the client profile of .NET 3.5/4.0?
  • What should we use to build? I've previously used NAnt for the overall build process and MSBuild for the code building part. While this has worked quite well, I'm nervous of the dependency on NAnt-Contrib library for the <msbuild> task, and generally being dependent on a build project whose last release was a beta nearly two years ago. Are there better alternatives?
  • How should documentation be created and distributed?
    • Is Sandcastle the best way of building docs? How easy is it to get it running so that any developer can build the docs at any time? (I've previously tried a couple of times, and failed miserable.)
    • Would Monodoc be a better approach?
    • How should non-API documentation be handled? Is the wiki which comes with the Google Code project good enough? Do we need to somehow suck the wiki into an offline format for distribution with the binaries?
  • What do we need to do in order to work in low-trust environments, and how easily can we test that?
  • What do we do about signing? Ship with a "public" snk file which anyone can build with, but have a private version which the team uses to validate a "known good" release? Or just have the private key and use deferred signing?
  • While the library itself will support i18n for things like date/time formatting, do we need to apply it to "developer only" messages such as exceptions?
  • I'm used to testing with NUnit and Rhino.Mocks, but they're not the last word in testing on .NET – what should we use, and why? What about coverage?
  • Do we need any dependencies (e.g. logging)? If so, how do we handle versioning of those dependencies? How are we affected by various licences?

These are all interesting topics, but they're not really specific to Noda Time. Information about them is available all over the place, but that's just the problem – it's all over the place. I would like there to be some sort of documentation saying, "These are the decisions you need to think about, here are the options we chose for Noda Time, and this is why we did so." I don't know what form that documentation will take yet, but I'm considering an ebook.

As you can tell, I'm aiming pretty high with this project – especially as I won't even be using Google's 20% time on it. However, there's little urgency in it for me personally. I want to work out how to do things right rather than how to do them quickly. If it takes me a bit of time to document various decisions, and the code itself ships later, so be it… it'll make the next project that much speedier.

I'm expecting a lot of discussion in the group, and no doubt some significant disagreements. I'm expecting to have to ask a bunch of questions on Stack Overflow, revealing just how ignorant I am on a lot of the topics above (and more). I think it'll be worth it though. I think it's worth setting a goal:

In one year, I want this to be a first-class project which is the natural choice for any developers wanting to do anything more than the simplest of date/time handling on .NET. In one year, I want to have a guide to developing open source class libraries on .NET which tells you everything you need to know other than how to write the code itself.

A year may seem like a long time, but I'm sure everyone who has expressed an interest in the project has significant other commitments – I know I do. Getting there in a year is going to be a stretch – but I'm expecting it to be a very enlightening journey.

Posted by skeet | 50 comment(s)
Filed under: , ,

What's in a name (again)?

I have possibly foolishly decided to stop resisting the urge to port Joda Time to .NET. For those of you who are unaware, "use Joda Time" is almost always the best answer to any question involving "how do I achieve X with java.util.Date/Calendar?" It's a Java library for handling dates and times, and it rocks. There is a plan to include a somewhat redesigned version in some future edition of Java (JSR-310) but it's uncertain whether this will ever happen.

Now, .NET only gained the ability to work with time zones other than UTC and the local time zone (using only managed code) – it has a bit of catching up to do. It's generally easier to work with the .NET BCL than the Java built-in libraries, but it's still not a brilliant position to be in. I think .NET deserves good date/time support, and as no-one else appears to be porting Joda Time, I'm going to do it. (A few people have already volunteered to help. I don't know how easily we'll be able to divvy up the work, but we'll see. I suspect the core may need to be done first, and then people can jump in to implement different chronologies etc. As a side-effect, I may try to use this project as a sort of case in terms of porting, managing an open source project, and properly implementing a .NET library with useful versioning etc.)

The first two problems, however, are to do with naming. First, the project name. Contenders include:

  • Joda Time.NET (sounds like it would be an absolutely direct port; while I intend to port all the tricky bits directly, it's going to be an idiomatic port with appropriate .NET bits. It's also a bit of a mouthful.)
  • Noda Time (as suggested in the comments and in email)
  • TonyTime (after Tony the Pony)
  • CoffeeTime
  • TeaTime
  • A progression of BreakfastTime, CoffeeTime, LunchTime, TeaTime, DinnerTime and SupperTime for different versions (not a serious contender)
  • ParsleySageRosemaryAndThyme (not a serious contender)
  • A few other silly ones too

I suspect I'm going to go for CoffeeTime, but we'll see.

The second problem is going to prove more awkward. I want to mostly copy the names given in Joda Time – aside from anything else, it'll make it familiar to anyone who uses Joda Time in Java (such as me). Now one of the most commonly used classes in Joda is "DateTime". Using that name in my port would be a Bad Idea. Shadowing a name in the System namespace is likely to lead to very disgruntled users who may prove hard to regruntle before they abandon the library.

So what do I do? Go for the subtly different DateAndTime? Tie it to the library with CoffeeDateTime? Change it to Instant? (It'll derive from AbstractInstant anyway – assuming I keep the same hierarchy instead of moving to a composition model and value types.)

Obviously this is a decision which the "team" can make, when we've got one… but it feels like a decision which is lurking round the corner in a hostile way.

What I find interesting is that these are two very different naming problems: one is trying to name something in a relatively arbitrary way – I know I want something reasonably short and memorable for the overall name, but beyond that it doesn't matter too much. The other is trying to nail a very specific name which really has to convey its meaning clearly… but where the obvious name is already taken. Also interestingly, neither is a particularly good example of my most common issue with naming: attempting to come up with a two or three word noun for something that actually needs a whole sentence to describe it adequately.

Oh well – we'll see what happens. In another blog post I'll suggest some of the goals I have in terms of what I'm hoping to learn from the project, and how I'd like it to progress. In other words, expect a work of complete fiction…

If you're interested in helping out with the project, please mail me directly (rather than adding comments here) and as soon as I've set the project up, I'll invite you to the mailing list.

UPDATE: I've already got a few interested names, which is great. Rather than be dictatorial about this, I'll put it to a vote of the people who are willing to help out on it.

Posted by skeet | 54 comment(s)
Filed under: , ,

Revisiting randomness

Almost every Stack Overflow question which includes the words "random" and "repeated" has the same basic answer. It's one of the most common "gotchas" in .NET, Java, and no doubt other platforms: creating a new random number generator without specifying a seed will depend on the current instant of time. The current time as measured by the computer doesn't change very often compared with how often you can create and use a random number generator – so code which repeatedly creates a new instance of Random and uses it once will end up showing a lot of repetition.

One common solution is to use a static field to store a single instance of Random and reuse it. That's okay in Java (where Random is thread-safe) but it's not so good in .NET – if you use the same instance repeatedly from .NET, you can corrupt the internal data structures.

A long time ago, I created a StaticRandom class in MiscUtil – essentially, it was just a bunch of static methods (to mirror the instance methods found in Random) wrapping a single instance of Random and locking appropriately. This allows you to just call StaticRandom.Next(1, 7) to roll a die, for example. However, it has a couple of problems:

  • It doesn't scale well in a multi-threaded environment. When I originally wrote it, I benchmarked an alternative approach using [ThreadStatic] and at the time, locking won (at least on my computer, which may well have only had a single core).
  • It doesn't provide any way of getting at an instance of Random, other than by using new Random(StaticRandom.Next()).

The latter point is mostly a problem because it encourages a style of coding where you just use StaticRandom.Next(…) any time you want a random number. This is undoubtedly convenient in some situations, but it goes against the idea of treating a source of randomness as a service or dependency. It makes it harder to get repeatability and to see what needs that dependency.

I could have just added a method generating a new instance into the existing class, but I decided to play with a different approach – going back to per-thread instances, but this time using the ThreadLocal<T> class introduced in .NET 4.0. Here's the resulting code:

using System;
using System.Threading;

namespace RandomDemo
{
    /// <summary>
    /// Convenience class for dealing with randomness.
    /// </summary>
    public static class ThreadLocalRandom
    {
        /// <summary>
        /// Random number generator used to generate seeds,
        /// which are then used to create new random number
        /// generators on a per-thread basis.
        /// </summary>
        private static readonly Random globalRandom = new Random();
        private static readonly object globalLock = new object();

        /// <summary>
        /// Random number generator
        /// </summary>
        private static readonly ThreadLocal<Random> threadRandom = new ThreadLocal<Random>(NewRandom);

        /// <summary>
        /// Creates a new instance of Random. The seed is derived
        /// from a global (static) instance of Random, rather
        /// than time.
        /// </summary>
        public static Random NewRandom()
        {
            lock (globalLock)
            {
                return new Random(globalRandom.Next());
            }
        }

        /// <summary>
        /// Returns an instance of Random which can be used freely
        /// within the current thread.
        /// </summary>
        public static Random Instance { get { return threadRandom.Value; } }

        /// <summary>See <see cref="Random.Next()" /></summary>
        public static int Next()
        {
            return Instance.Next();
        }

        /// <summary>See <see cref="Random.Next(int)" /></summary>
        public static int Next(int maxValue)
        {
            return Instance.Next(maxValue);
        }

        /// <summary>See <see cref="Random.Next(int, int)" /></summary>
        public static int Next(int minValue, int maxValue)
        {
            return Instance.Next(minValue, maxValue);
        }

        /// <summary>See <see cref="Random.NextDouble()" /></summary>
        public static double NextDouble()
        {
            return Instance.NextDouble();
        }

        /// <summary>See <see cref="Random.NextBytes(byte[])" /></summary>
        public static void NextBytes(byte[] buffer)
        {
            Instance.NextBytes(buffer);
        }
    }
}

The user can still call the static Next(…) methods if they want, but they can also get at the thread-local instance of Random by calling ThreadLocalRandom.Instance – or easily create a new instance with ThreadLocalRandom.NewRandom(). (The fact that NewRandom uses the global instance rather than the thread-local one is an implementation detail really; it happens to be convenient from the point of view of the ThreadLocal<T> constructor. It wouldn't be terribly hard to change this.)

Now it's easy to write a method which needs randomness (e.g. to shuffle a deck of cards) and give it a Random parameter, then call it using the thread-local instance:

public void Shuffle(Random rng)
{
    ...
}
...
deck.Shuffle(ThreadLocalRandom.Instance);

The Shuffle method is then easier to test and debug, and expresses its dependency explicitly.

Performance

I tested the "old" and "new" implementations in a very simple way – for varying numbers of threads, I called Next() a fixed number of times (from each thread) and timed how long it took for all the threads to finish. I've also tried a .NET-3.5-compatible version using ThreadStatic instead of ThreadLocal<T>, and running the original code and the ThreadStatic version on .NET 3.5 as well.

Threads StaticRandom (4.0b2) ThreadLocalRandom (4.0b2) ThreadStaticRandom (4.0b2) StaticRandom(3.5) ThreadStaticRandom (3.5)
1 11582 6016 10150 10373 16049
2 24667 7214 9043 25062 17257
3 38095 10295 14771 36827 25625
4 49402 13435 19116 47882 34415

A few things to take away from this:

  • The numbers were slightly erratic; somehow it was quicker to do twice the work with ThreadStaticRandom on .NET 4.0b2! This isn't the perfect benchmarking machine; we're interested in trends rather than absolute figures.
  • Locking hasn't changed much in performance between framework versions
  • ThreadStatic has improved massively between .NET 3.5 and 4.0
  • Even on 3.5, ThreadStatic wins over a global lock as soon as there's contention

The only downside to the ThreadLocal<T> version is that it requires .NET 4.0 - but the ThreadStatic version works pretty well too.

It's worth bearing in mind that of course this is testing the highly unusual situation where there's a lot of contention in the global lock version. The performance difference in the single-threaded version where the lock is always uncontended is still present, but very small.

Update

After reading the comments and thinking further, I would indeed get rid of the static methods elsewhere. Also, for the purposes of dependency injection, I agree that it's a good idea to have a factory interface where that's not overkill. The factory implementation could use either the ThreadLocal or ThreadStatic implementations, or effectively use the global lock version (by having its own instance of Random and a lock). In many cases I'd regard that as overkill, however.

One other interesting option would be to create a thread-safe instance of Random to start with, which delegated to thread-local "normal" implementations. That would be very useful from a DI standpoint.

OMG Ponies!!! (Aka Humanity: Epic Fail)

(Meta note: I tried to fix the layout for this, I really did. But my CSS skills are even worse than Tony's. If anyone wants to send me a complete sample of how I should have laid this out, I'll fix it up. Otherwise, this is as good as you're going to get :)

Last week at Stack Overflow DevDays, London I presented a talk on how humanity had made life difficult for software developers. There's now a video of it on Vimeo - the audio is fairly poor at the very start, but it improves pretty soon. At the very end my video recorder ran out of battery, so you've just got my slides (and audio) for that portion. Anyway, here's my slide deck and what I meant to say. (A couple of times I forgot exactly which slide was coming next, unfortunately.)

Click on any thumbnail for a larger view.

Good afternoon. This talk will be a little different from the others we've heard today... Joel mentioned on the podcast a long time ago that I'd talk about something "fun and esoteric" – and while I personally find C# 4 fun, I'm not sure that anyone could really call it esoteric. So instead, I thought I'd rant for half an hour about how mankind has made our life so difficult.

By way of introduction, I'm Jon Skeet. You may know me from questions such as Jon Skeet Facts, Why does Jon Skeet never sleep? and a few C# questions here and there. This is Tony the Pony. He's a developer, but I'm afraid he's not a very good one.

(Tony whispers) Tony wants to make it clear that he's not just a developer. He has another job, as a magician. Are you any better at magic than development then? (Tony whispers) Oh, I see. He's not very good at magic either – his repertoire is extremely limited. Basically he's a one trick pony.

Anyway, when it comes to software, Tony gets things done, but he's not terribly smart. He comes unstuck with some of the most fundamental data types we have to work with. It's really not his fault though – humanity has let him down by making things just way too complicated.

You see, the problem is that developers are already meant to be thinking about difficult things... coming up with a better widget to frobjugate the scarf handle, or whatever business problem they're thinking about. They've really got enough to deal with – the simple things ought to be simple.

Unfortunately, time and time again we come up against problems with core elements of software engineering. Any resemblance between this slide and the coding horror logo is truly coincidental, by the way. Tasks which initially sound straightforward become insanely complicated. My aim in this talk is to distribute the blame amongst three groups of people.

First, let's blame users – or mankind as a whole. Users always have an idea that what they want is easy, even if they can't really articulate exactly what they do want. Even if they can give you requirements, chances are those will conflict – often in subtle ways – with requirements of others. A lot of the time, we wouldn't even think of these problems as "requirements" – they're just things that everyone expects to work in "the obvious way". The trouble is that humanity has come up with all kinds of entirely different "obvious ways" of doing things. Mankind's model of the universe is a surprisingly complicated one.

Next, I want to blame architects. I'm using the word "architect" in a very woolly sense here. I'm trying to describe the people who come up with operating systems, protocols, libraries, standards: things we build our software on top of. These are the people who have carefully considered the complicated model used by real people, stroked their beards, and designed something almost exactly as complicated, but not quite compatible with the original.

Finally, I'm going to blame us – common or garden developers. We have four problems: first, we don't understand the complex model designed by mankind. Second, we don't understand the complex model designed by the architects. Third, we don't understand the applications we're trying to build. Fourth, even when we get the first three bits right individually, we still screw up when we try to put them together.

For the rest of this talk, I'm going to give three examples of how things go wrong. First, let's talk about numbers.

You would think we would know how numbers work by now. We've all been doing maths since primary school. You'd also think that computers knew how to handle numbers by now – that's basically what they're built on. How is it that we can search billions of web pages in milliseconds, but we can't get simple arithmetic right? How many times are we going to see Stack Overflow questions along the lines of "Is double multiplication broken in .NET?"

I blame evolution.

We have evolved with 8 fingers and 2 thumbs – a total of 10 digits. This was clearly a mistake. It has led to great suffering for developers. Life would have been a lot simpler if we'd only had eight digits.

Admittedly this gives us three bits, which isn't quite ideal – but having 16 digits (fourteen fingers and two thumbs) or 4 digits (two fingers and two thumbs) could be tricky. At least with eight digits, we'd be able to fit in with binary reasonably naturally. Now just so you don't think I'm being completely impractical, there's another solution – we could have just counted up to eight and ignored our thumbs. Indeed, we could even have used thumbs as parity bits. But no, mankind decided to count to ten, and that's where all the problems started.

Now, Tony – here's a little puzzle for you. I want you to take a look at this piece of Java code (turn Tony to face screen). (Tony whispers) What do you mean you don't know Java? All right, here's the C# code instead...

Is that better? (Tony nods enthusiastically) So, Tony, I want you to tell me the value of d after this line has executed. (Tony whispers)

Tony thinks it's 0.3 Poor Tony. Why on earth would you think that? Oh dear. Sorry, no it's not.

No, you were certainly close, but the exact value is:

0.299999 - Well, I'm not going to read it all out, but that's the exact value. And it is an exact value – the compiler has approximated the 0.3 in the source code to the nearest number which can be exactly represented by a double. It's not the computer's fault that we have this bizarre expectation that a number in our source code will be accurately represented internally.

Let's take a look at two more numbers... 5 and a half in both cases. Now it doesn't look like these are really different – but they are. Indeed, if I were representing these two numbers in a program, I'd quite possibly use different types for them. The first value is discrete – there's a single jump from £5.50 to £5.51, and those are exact amounts of money... whereas when we measure the mass of something, we always really mean “to two decimal places” or something similar. Nothing weighs exactly five and a half kilograms. They're fundamentally different concepts, they just happen to have the same value. What do you do with them? Well, continuous numbers are often best represented as float/double, whereas discrete decimal numbers are usually best represented using a decimal-based type.

Now I've ignored an awful lot of things about numbers which can also trip us up – signed and unsigned, overflow, not-a-number values, infinities, normalised and denormal numbers, parsing and formatting, all kinds of stuff. But we should move on. Next stop, text.

Okay, so numbers aren't as simple as we'd like them to be. Text ought to be easy though, right? I mean, my five year old son can read and write – how hard can it be? One bit of trivia - when I originally copied this out by hand, I missed out "ipsum." Note to self: if you're going to copy out "lorem ipsum" the two words you really, really need to get at least those words right. Fail.

Of course, I'm sure pretty much everyone here knows that text is actually a pain in the neck. Again, I will blame humanity. Here we have two sets of people using completely different characters, speaking different languages, and quite possibly reading in different directions. Apologies if the characters on the right accidentally spell a rude word, by the way - I just picked a few random Kanji characters from the Unicode charts. (As pointed out in the comments, these aren't actually Kanji characters anyway. They're Katakana characters. Doh!) Cultural diversity has screwed over computing, basically.

However, let's take the fact that we've got lots of characters as a given. Unicode sorts all that out, right? Let's see. Time for a coding exercise – Tony, I'd like you to write some code to reverse a string. (Tony whispers) No, I'm not going to start up Visual Studio for you. (Tony whispers) You've magically written it on the next slide? Okay, let's take a look.

Well, this looks quite promising. We're taking a string, converting it into a character array, reversing that array, and then building a new string. I'm impressed, Tony – you've avoided pointless string concatenation and everything. (Tony is happy.) Unfortunately...

... it's broken. I'm just going to give one example of how it's broken – there are lots of others along the same lines. Let's reverse one of my favourite musicals...

Here's one way of representing Les Miserables as a Unicode string. Instead of using one code point for the “e acute”, I've used a combining character to represent the accent, and then an unaccented ASCII e. Display this in a GUI, and it looks fine... but when we apply Tony's reversing code...

... the combining character ends up after the e, so we get an “s acute” instead. Sorry Tony. The Unicode designers with their fancy schemes have failed you.

EDIT: In fact, not only have the Unicode designers made things difficult, but so have implementers. You see, I couldn't remember whether combining characters came before or after base characters, so I wrote a little Windows Forms app to check. That app displayed "Les Mis\u0301erables" as "Les Misérables". Then, based on the comments below, I checked with the standard – and the Unicode combining marks FAQ indicates pretty clearly that the base character comes before the combining character. Further failure points to both me and someone in Microsoft, unless I'm missing something. Thanks to McDowell for pointing this out in the comments. If I ever give this presentation again, I'll be sure to point it out. WPF gets it right, by the way. Update: this can be fixed in Windows Forms by setting the UseCompatibleTextRendering property to false (or setting the default to false). Apparently the default is set to false when you create a new WinForms project in VS2008. Shame I tend to write "quick check" programs in a plain text editor…

Of course the basic point about reversal still holds, but with the correct starting string you'd end up with an acute over the r, not the s.

It's not like the problems are solely in the realm of non-ASCII characters though. I present to you...

A line break. Or rather, one of the representations of a line break. As if the natural cultural diversity of humanity hasn't caused enough problems, software decided to get involved and have line break diversity. Heck, we're not even just limited to CR, LF and CRLF – Unicode has its own special line terminator character as well, just for kicks.

To prove this isn't just a problem for toy examples, here's something that really bit me, back about 9 or 10 years ago. Here's some code which tries to do a case-insensitive comparison for the text "MAIL" in Java. Can anyone spot the problem?

It fails in Turkey. This is reasonably well known now – there's a page about the “Turkey test” encouraging you to try your applications in a Turkish locale – but at the time it was a mystery to me. If you're not familiar with this, the problem is that if you upper-case an “i” in Turkish, you end up with an “I” with a dot on it. This code went into production, and we had a customer in Turkey whose server was behaving oddly. As you can imagine, if you're not aware of that potential problem, it can take a heck of a long time to find that kind of bug.

Here's some code from a newsgroup post. It's somewhat inefficient code to collapse multiple spaces down to a single one. Leaving aside the inefficiency, it looks like it should work. This was before we had String.Contains, so it's using IndexOf to check whether we've got a double space. While we can find two spaces in a row, we'll replace any occurrence of two spaces with a single space. We're assigning the result of string.Replace back to the same variable, so that's avoided one common problem... so how could this fail?

This string will cause that code to go into a tight loop, due to this evil character here. It's a "zero-width non-joiner" – basically a hint that the two characters either side of it shouldn't be squashed up too closely together. IndexOf ignores it, but Replace doesn't. Ouch.

Now I'm not showing these examples to claim I'm some sort of Unicode expert – I'm really, really not. These are just corner cases I happen to have run into. Just like with numbers, I've left out a whole bunch of problems like bidi, encodings, translation, culture-sensitive parsing and the like.

Given the vast array of writing systems the world has come up with – and variations within those systems – any attempt to model text is going to be complicated. The problems come from the inherent complexity, some additional complexity introduced by things like surrogate pairs, and developers simply not having the time to become experts on text processing.

So, we fail at both numbers and text. How about time?

I'm biased when it comes to time-related problems. For the last year or so I've been working on the Google's implementation of ActiveSync, mostly focusing on the calendar side of things. That means I've been exposed to more time-based code than most developers... but it's still a reasonably common area, as you can tell from the number of related questions on Stack Overflow.

To make things slightly simpler, let's ignore relativity. Let's pretend that time is linear – after all, most systems are meant to be modelling the human concept of time, which definitely doesn't include relativity.

Likewise, let's ignore leap seconds. This isn't always a good idea, and there are some wrinkles around library support. For example, Java explicitly says that java.util.Date and Calendar may or may not account for leap seconds depending on the host support. So, it's good to know how predictable that makes our software... I've tried reading various explanations of leap seconds, and always ended up with a headache. For the purposes of this talk, I'm going to assert that they don't exist.

Okay, so let's start with something simple. Tony, what's the time on this slide? (Tony whispers) Tony doesn't want to answer. Anyone? (Audience responds.) Yes, about 5 past 3 on October 28th. So what's the difference between now and the time on this slide? (Audience response.) No, it's actually nearly twelve hours... this clock is showing 5 past 3 in the morning. Tony's answer was actually the right one, in many ways... this slide has a hopeless amount of ambiguity. It's not as bad as it might be, admittedly. Imagine if it said October 11th... Jeff and Joel would be nearly a month out of sync with the rest of us. And then even if we get the date and the time right, it's still ambiguous... because of time zones.

Ah, time zones. My favourite source of WTFs. I could rant for hours about them – but I'll try not to. I'd just like to point out a few of the idiosyncrasies I've encountered. Let's start off with the time zones on this slide. Notice anything strange? (Audience or whisper from Tony) Yes, CST is there three times. Once for Central Standard Time in the US – which is UTC-6. It's also Central Standard Time in Australia – where it's UTC+9.30. It's also Central Summer Time in Australia, where it's UTC+10.30. I think it takes a special kind of incompetence to use the same acronym in the same place for different offsets.

Then let's consider time zones changing. One of the problems I face is having to encode or decode a time zone representation from a single pattern – something like "It's UTC-3 or -2, and daylight savings are applied from the third Sunday in March to the first Sunday in November". That's all very well until the system changes. Some countries give plenty of warning of this... but on October 7th this year, Argentina announced that it wasn't going to use daylight saving time any more... 11 days before its next transition. The reason? Their dams are 90% full. I only heard about this due to one of my unit tests failing. For various complicated reasons, a unit test which expected to recognise the time zone for Godthab actually thought it was Buenos Aires. So due to rainfall thousands of miles away, my unit test had moved Greenland into Argentina. Fail.

If you want more time zone incidents, talk to me afterwards. It's a whole world of pain. I suggest we move away from time zones entirely. In fact, I suggest we adopt a much simpler system of time. I'm proud to present my proposal for coffee time. This is a system which determines the current time based on the answer to the question: "Is it time for coffee?" This is what the clock looks like:

This clock is correct all over the world, is very cheap to produce, and is guaranteed to be accurate forever. Batteries not required.

So where are we?

The real world has failed us. It has concentrated on local simplicity, leading to global complexity. It's easy to organise a meeting if everyone is in the same time zone – but once you get different continents involved, invariably people get confused. It's easy to get writing to work uniformly left to right or uniformly right to left – but if you've got a mixture, it becomes really hard to keep track of. The diversity which makes humanity such an interesting species is the curse of computing.

When computer systems have tried to model this complexity, they've failed horribly. Exhibit A: java.util.Calendar, with its incomprehensible set of precedence rules. Exhibit B: .NET's date and time API, which until relatively recently didn't let you represent any time zone other than UTC or the one local to the system.

Developers have, collectively, failed to understand both the models and the real world. We only need one exhibit this time: the questions on Stack Overflow. Developers asking questions around double, or Unicode, or dates and times aren't stupid. They've just been concentrating on other topics. They've made an assumption that the core building blocks of their trade would be simple, and it turns out they're not.

This has all been pretty negative, for which I apologise. I'm not going to claim to have a complete solution to all of this – but I do want to give a small ray of hope. All this complexity can be managed to some extent, if you do three things.

First, try not to take on more complexity than you need. If you can absolutely guarantee that you won't need to translate your app, it'll make your life a lot easier. If you don't need to deal with different time zones, you can rejoice. Of course, if you write a lot of code under a set of assumptions which then changes, you're in trouble... but quite often you can take the "You ain't gonna need it" approach.

Next, learn just enough about the problem space so that you know more than your application's requirements. You don't need to know everything about Unicode – but you need to be aware of which corner cases might affect your application. You don't need to know everything about how denormal number representation, but you may well need to know how rounding should be applied in your reports. If your knowledge is just a bit bigger than the code you need to write, you should be able to be reasonably comfortable.

Pick the right platforms and libraries. Yes, there are some crummy frameworks around. There are also some good ones. What's the canonical answer to almost any question about java.util.Calendar? Use Joda Time instead. There are similar libraries like ICU – written by genuine experts in these thorny areas. The difference a good library can make is absolutely enormous.

None of this will make you a good developer. Tony's still likely to mis-spell his "main" method through force of habit. You're still going to get off by one errors. You're still going to forget to close database connections. But if you can at least get a handle on some of the complexity of software engineering, it's a start.

Thanks for listening.

Contract classes and nested types within interfaces

I've just been going through some feedback for the draft copy of the second edition of C# in Depth. In the contracts section, I have an example like this:

[ContractClass(typeof(ICaseConverterContracts))]
public interface ICaseConverter
{
    string Convert(string text);
}

[ContractClassFor(typeof(ICaseConverter))]
internal class ICaseConverterContracts : ICaseConverter
{
    string ICaseConverter.Convert(string text)
    {
        Contract.Requires(text != null);
        Contract.Ensures(Contract.Result<string>() != null);
        return default(string);
    }

    private ICaseConverterContracts() {}
}

public class InvariantUpperCaseFormatter : ICaseConverter
{
    public string Convert(string text) 
    {
        return text.ToUpperInvariant();
    }
}

The point is to demonstrate how contracts can be specified for interfaces, and then applied automatically to implementations. In this case, ICaseConverter is the interface, ICaseConverterContracts is the contract class which specifies the contract for the interface, and InvariantUpperCaseFormatter is the real implementation. The binary rewriter effectively copies the contract into each implementation, so you don't need to duplicate the contract in the source code.

The reader feedback asked where the contract class code should live - should it go in the same file as the interface itself, or in a separate file as normal? Now normally, I'm firmly of the "one top-level type per file" persuasion, but in this case I think it makes sense to keep the contract class with the interface. It has no meaning without reference to the interface, after all - it's not a real implementation to be used in the normal way. It's essentially metadata. This does, however, leave me feeling a little bit dirty. What I'd really like to be able to do is nest the contract class inside the interface, just like I do with other classes which are tightly coupled to an "owner" type. Then the code would look like this:

[ContractClass(typeof(ICaseConverterContracts))]
public interface ICaseConverter
{
    string Convert(string text);

    [ContractClassFor(typeof(ICaseConverter))]
    internal class ICaseConverterContracts : ICaseConverter
    {
        string ICaseConverter.Convert(string text)
        {
            Contract.Requires(text != null);
            Contract.Ensures(Contract.Result<string>() != null);
            return default(string);
        }

        private ICaseConverterContracts() {}
    }
}

public class InvariantUpperCaseFormatter : ICaseConverter
{
    public string Convert(string text) 
    {
        return text.ToUpperInvariant();
    }
}

That would make me feel happier - all the information to do with the interface would be specified within the interface type's code. It's possible that with that as a convention, the Code Contracts tooling could cope without the attributes - if interface IFoo contains a nested class IFooContracts which implements IFoo, assume it's a contract class and handle it appropriately. That would be sweet.

You know the really galling thing? I'm pretty sure VB does allow nested types in interfaces...

Posted by skeet | 9 comment(s)
Filed under: , , ,

MVP Again

I'm delighted to be able to announce that I'm now an MVP again.

Google has reconsidered the situation and worked out a compromise: I now receive no significant gifts from Microsoft, and I'm not under NDA with them. While that precludes me from a lot of MVP activities, it removes any concerns to do with Google's Code of Conduct. Basically my MVP status is truly just a token of Microsoft's recognition of what I've done in the C# community - and that's fine by me.

When I announced that I'd been advised not to seek renewal, I was amazed at the scale of the reaction in the comments, other blog posts, Twitter and personal email. I was touched by the response of the community. I really love working at Google, and the fact that we could figure out a solution to this situation is definitely one of the things that makes Google such an awesome place to work. Oh, and did I mention that we're hiring? :)

Anyway, the basic message of this post is: thanks to the community for caring, thanks to Google for reconsidering, and thanks to Microsoft for renewing my award. And they all lived happily ever after...

Posted by skeet | 28 comment(s)
Filed under: ,

Migrating from Visual Studio 2010 beta 1 to beta 2 – solution file change required

Having installed Visual Studio 2010 beta 2 on my freshly-reinstalled netbook (now with Windows 7 and and SSD – yummy) I found that my solution file from Visual Studio 2010 beta 1 wasn’t recognised properly: double-clicking on the file didn’t do anything. Opening the solution file manually was absolutely fine, but slightly less convenient than being able to double-click.

After a bit of investigation, I’ve found the solution. Manually edit the solution file, and change the first few lines from this:

Microsoft Visual Studio Solution File, Format Version 11.00
# Visual Studio 10

to this:

Microsoft Visual Studio Solution File, Format Version 11.00
# Visual Studio 2010

It's just a case of changing "10" to "2010".

Hopefully between this and the linked SuperUser post, this should avoid others feeling the same level of bafflement :)

Posted by skeet | 3 comment(s)

Iterating atomically

The IEnumerable<T> and IEnumerator<T> interfaces in .NET are interesting. They crop up an awful lot, but hardly anyone ever calls them directly - you almost always use a foreach loop to iterate over the collection. That hides all the calls to GetEnumerator(), MoveNext() and Current. Likewise iterator blocks hide the details when you want to implement the interfaces. However, sometimes details matter - such as for this recent Stack Overflow question. The question asks how to create a thread-safe iterator - one that can be called from multiple threads. This is not about iterating over a collection n times independently on n different threads - this is about iterating over a collection once without skipping or duplicating. Imagine it's some set of jobs that we have to complete. We assume that the iterator itself is thread-safe to the extent that calls from different threads at different times, with intervening locks will be handled reasonably. This is reasonable - basically, so long as it isn't going out of its way to be thread-hostile, we should be okay. We also assume that no-one is trying to write to the collection at the same time.

Sounds easy, right? Well, no... because the IEnumerator<T> interface has two members which we effectively want to call atomically. In particular, we don't want the collection { "a", "b" } to be iterated like this:

Thread 1 Thread 2
MoveNext()  
  MoveNext()
Current  
  Current

That way we'll end up not processing the first item at all, and the second item twice.

There are two ways of approaching this problem. In both cases I've started with IEnumerable<T> for consistency, but in fact it's IEnumerator<T> which is the interesting bit. In particular, we're not going to be able to iterate over our result anyway, as each thread needs to have the same IEnumerator<T> - which it won't do if each of them uses foreach (which calls GetEnumerator() to start with).

Fix the interface

First we'll try to fix the interface to look how it should have looked to start with, at least from the point of view of atomicity. Here are the new interfaces:

public interface IAtomicEnumerable<T>
{
    IAtomicEnumerator<T> GetEnumerator();
}

public interface IAtomicEnumerator<T>
{
    bool TryMoveNext(out T nextValue);
}

One thing you may notice is that we're not implementing IDisposable. That's basically because it's a pain to do so when you think about a multi-threaded environment. Indeed, it's possibly one of the biggest arguments against something of this nature. At what point do you dispose? Just because one thread finished doesn't mean that the rest of them have... don't forget that "finish" might mean "an exception was thrown while processing the job, I'm bailing out". You'd need some sort of co-ordinator to make sure that everyone is finished before you actually do any clean-up. Anyway, the nice thing about this being a blog post is we can ignore that little thorny issue :)

The important point is that we now have a single method in IAtomicEnumerator<T> - TryMoveNext, which works the way you'd expect it to. It atomically attempts to move to the next item, returns whether or not it succeeded, and sets an out parameter with the next value if it did succeed. Now there's no chance of two threads using the method and stomping on each other's values (unless they're silly and use the same variable for the out parameter).

It's reasonably easy to wrap the standard interfaces in order to implement this interface:

/// <summary>
/// Wraps a normal IEnumerable[T] up to implement IAtomicEnumerable[T].
/// </summary>
public sealed class AtomicEnumerable<T> : IAtomicEnumerable<T>
{
    private readonly IEnumerable<T> original;

    public AtomicEnumerable(IEnumerable<T> original)
    {
        this.original = original;
    }

    public IAtomicEnumerator<T> GetEnumerator()
    {
        return new AtomicEnumerator(original.GetEnumerator());
    }

    /// <summary>
    /// Implementation of IAtomicEnumerator[T] to wrap IEnumerator[T].
    /// </summary>
    private sealed class AtomicEnumerator : IAtomicEnumerator<T>
    {
        private readonly IEnumerator<T> original;
        private readonly object padlock = new object();

        internal AtomicEnumerator(IEnumerator<T> original)
        {
            this.original = original;
        }

        public bool TryMoveNext(out T value)
        {
            lock (padlock)
            {
                bool hadNext = original.MoveNext();
                value = hadNext ? original.Current : default(T);
                return hadNext;
            }
        }
    }
}

Just ignore the fact that I never dispose of the original IEnumerator<T> :)

We use a simple lock to make sure that MoveNext() and Current always happen together - that nothing else is going to call MoveNext() between our TryMoveNext() calling it, and it fetching the current value.

Obviously you'd need to write your own code to actually use this sort of iterator, but it would be quite simple:

T value;
while (iterator.TryMoveNext(out value))
{
    // Use value
}

However, you may already have code which wants to use an IEnumerator<T>. Let's see what else we can do.

Using thread local variables to fake it

.NET 4.0 has a very useful type called ThreadLocal<T>. It does basically what you'd expect it to, with nice features such as being able to supply a delegate to be executed on each thread to provide the initial value. We can use a thread local to make sure that so long as we call both MoveNext() and Current atomically when we're asked to move to the next element, we can get back the right value for Current later on. It has to be thread local because we're sharing a single IEnumerator<T> across multiple threads - each needs its own separate storage.

This is also the approach we'd use if we wanted to wrap an IAtomicEnumerator<T> in an IEnumerator<T>, by the way. Here's the code to do it:

public class ThreadSafeEnumerable<T> : IEnumerable<T>
{
    private readonly IEnumerable<T> original;

    public ThreadSafeEnumerable(IEnumerable<T> original)
    {
        this.original = original;
    }

    public IEnumerator<T> GetEnumerator()
    {
        return new ThreadSafeEnumerator(original.GetEnumerator());
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }

    private sealed class ThreadSafeEnumerator : IEnumerator<T>
    {
        private readonly IEnumerator<T> original;
        private readonly object padlock = new object();
        private readonly ThreadLocal<T> current = new ThreadLocal<T>();

        internal ThreadSafeEnumerator(IEnumerator<T> original)
        {
            this.original = original;
        }

        public bool MoveNext()
        {
            lock (padlock)
            {
                bool ret = original.MoveNext();
                if (ret)
                {
                    current.Value = original.Current;
                }
                return ret;
            }
        }

        public T Current
        {
            get { return current.Value; }
        }

        public void Dispose()
        {
            original.Dispose();
            current.Dispose();
        }

        object IEnumerator.Current
        {
            get { return Current; }
        }

        public void Reset()
        {
            throw new NotSupportedException();
        }
    }
}

I'm going to say it one last time - we're broken when it comes to disposal. There's no way of safely disposing of the original iterator at "just the right time" when everyone's finished with it. Oh well.

Other than that, it's quite simple. This code has the serendipitous property of actually implementing IEnumerator<T> slightly better than C#-compiler-generated implementations from iterator blocks - if you call the Current property without having called MoveNext(), this will throw an InvalidOperationException, just as the documentation says it should. (It doesn't do the same at the end, admittedly, but that's fixable if we really wanted to be pedantic.

Conclusion

I found this an intriguing little problem. I think there are better ways of solving the bigger picture - a co-ordinator which takes care of disposing exactly once, and which possibly mediates the original iterator etc is probably the way forward... but I enjoyed thinking about the nitty gritty.

Generally speaking, I prefer the first of these approaches. Thread local variables always feel like a bit of a grotty hack to me - they can be useful, but it's better to avoid them if you can. It's interesting to see how an interface can be inherently thread-friendly or not.

One last word of warning - this code is completely untested. It builds, and I can't immediately see why it wouldn't work, but I'm making no guarantees...

Generic collections - relegate to an appendix?

(I tweeted a brief version of this suggestion and the results have been overwhelmingly positive so far, but I thought it would be worth fleshing out anyway.)

I'm currently editing chapter 3 of C# in Depth. In the first edition, it's nearly 48 pages long - the longest in the book, and longer than I want it to be.

One of the sections in there (only 6 pages, admittedly) is a description of various .NET 2.0 collections. However, it's mostly comparing them with the nongeneric collections from .NET 1.0, which probably isn't relevant any more. I suspect my readership has now moved on from "I only know C# 1" to "I've used C# 2 and I'm reasonably familiar with the framework, but I want to know the details of the language."

I propose moving the collections into an appendix. This will mean:

  • I'll cover all versions of .NET, not just 2.0
  • It will all be done in a fairly summary form, like the current appendix. (An appendix doesn't need as much of a narrative structure as a main chapter, IMO.)
  • I'll cover the interfaces as well as the classes - possibly even with pictures (type hierarchies)!
  • Chapter 3 can be a bit slimmer (although I've been adding a little bit here and there, so I'm not going to save a massive amount)
  • It will be easier to find as a quick reference (and I'll write it in a way which makes it easy to use as a reference too, hopefully)
  • I don't have to edit it right now :)

Does this sound like a plan? I don't know why I didn't think of it before, but I think it's the right move. In particular, it's in-keeping with the LINQ operator coverage in the existing appendix.

Posted by skeet | 16 comment(s)
Filed under: ,

MVP no more

It's with some sadness that I have to announce that as of the start of October, I'm no longer a Microsoft MVP.

As renewal time came round again, I asked my employer whether it was okay for me to renew, and was advised not to do so. As a result, while I enjoyed being awarded as an MVP, I've asked not to be considered for renewal this year.

This doesn't mean I'm turning my back on that side of software development, of course. I'm still going to be an active member of the C# community. I'm still writing the second edition of C# in Depth. I'm still going to post on Stack Overflow. I'm still going to blog here about whatever interesting and wacky topics crop up.

I just won't be doing so as an MVP.

Thanks to all the friends I've made in the MVP community and Microsoft over the last 6 years, and I wish you all the best.

Keep in touch.

Posted by skeet | 84 comment(s)
Filed under: ,

An object lesson in blogging and accuracy; was: Efficient "vote counting" with LINQ to Objects - and the value of nothing

Well, this is embarrassing.

Yesterday evening, I excitedly wrote a blog post about an interesting little idea for making a particular type of LINQ query (basically vote counting) efficient. It was an idea that had occurred to me a few months back, but I hadn't got round to blogging about it.

The basic idea was to take a completely empty struct, and use that as the element type in the results of a grouping query - as the struct was empty, it would take no space, therefore "huge" arrays could be created for no cost beyond the fixed array overhead, etc. I carefully checked that the type used for grouping did in fact implement ICollection<T> so that the Count method would be efficient; I wrote sample code which made sure my queries were valid... but I failed to check that the empty struct really took up no memory.

Fortunately, I have smart readers, a number of whom pointed out my mistake in very kind terms.

Ben Voigt gave the reason for the size being 1 in a comment:

The object identity rules require a unique address for each instance... identity can be shared with super- or sub- class objects (Empty Base Optimization) but the total size of the instance has to be at least 1.

This makes perfect sense - it's just a shame I didn't realise it before.

Live and learn, I guess - but apologies for the poorly researched post. I'll attempt to be more careful next time.

Posted by skeet | 11 comment(s)
Filed under: , ,
More Posts Next page »