Very quick interlude: I know that CAPTCHAs aren't working on this blog if you're using Chrome

This is just a quick post to hopefully stem the tide of people commenting (in blog comments and Twitter messages) saying that this blog is broken when it comes to CAPTCHAs and Chrome. I know, but there's nothing I can do about it - I don't run the software on the blog. I haven't looked into why it's not working... if you open the image in a new tab, it displays that fine. Weird, but such is life... I'm hoping that an update to either Community Server or Chrome will fix it eventually.

Posted by skeet | 1 comment(s)
Filed under:

Reimplementing LINQ to Objects: Part 2 - "Where"

Warning: this post is quite long. Although I've chosen a simple operator to implement, we'll encounter a few of the corner cases and principles involved in LINQ along the way. This will also be a somewhat experimental post in terms of format, as I try to work out the best way of presenting the material.

We're going to implement the "Where" clause/method/operator. It's reasonably simple to understand in general, but goes into all of the deferred execution and streaming bits which can cause problems. It's generic, but only uses one type parameter (which is a big deal, IMO - the more type parameters a method has, the harder I find it to understand in general). Oh, and it's a starting point for query expressions, which is a bonus.

What is it?

"Where" has two overloads:

public static IEnumerable<TSource> Where(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static IEnumerable<TSource> Where(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)

Before I go into what it actually does, I'll point out a few things which are going to be common across nearly all of the LINQ operators we'll be implementing:

  • These are extension methods - they're declared in a top-level, non-nested, static class and the first parameter has a "this" modifier. They can be invoked roughly as if they were instance methods on the type of the first parameter.
  • They're generic methods - in this case there's just a single type parameter (TSource) which indicates the type of sequence we're dealing with. So for (say) a list of strings, TSource would be string.
  • They take generic delegates in the Func<...> family. These are usually specified with lambda expressions, but any other way of providing a delegate will work fine too.
  • They deal with sequences. These are represented by IEnumerable<T>, with an iterator over a sequence being represented by IEnumerator<T>.

I fully expect that most readers will be comfortable with all of these concepts, so I won't go into them in more detail. If any of the above makes you nervous, please familiarise yourself with them before continuing, otherwise you're likely to have a hard time.

The purpose of "Where" is to filter a sequence. It takes an input sequence and a predicate, and returns another sequence. The output sequence is of the same element type (so if you put in a sequence of strings, you'll get a sequence of strings out) and it will only contain elements from the input sequence which pass the predicate. (Each item will be passed to the predicate in turn. If the predicate returns true, the item will be part of the output sequence; otherwise it won't.)

Now, a few important details about the behaviour:

  • The input sequence is not modified in any way: this isn't like List<T>.RemoveAll, for example.
  • The method uses deferred execution - until you start trying to fetch items from the output sequence, it won't start fetching items from the input sequence
  • Despite deferred execution, it will validate that the parameters aren't null immediately
  • It streams its results: it only ever needs to look at one result at a time, and will yield it without keeping a reference to it. This means you can apply it to an infinitely long sequence (for example a sequence of random numbers)
  • It will iterate over the input sequence exactly once each time you iterate over the output sequence
  • Disposing of an iterator over the output sequence will dispose of the corresponding iterator over the input sequence. (In case you didn't realise, the foreach statement in C# uses a try/finally block to make sure the iterator is always disposed however the loop finishes.)

Many of these points will be true for a lot of our other operators too.

The overload which takes a Func<TSource, int, bool> lets the predicate use the index within the sequence as well as the value. The index always starts at 0, and increments by 1 each time regardless of previous results from the predicate.

What are we going to test?

Ideally, we'd like to test all of the points above. The details of streaming and how many times the sequence is iterated over are frankly a pain to deal with, unfortunately. Given how much there is to implement already, we'll come back to those.

Let's have a look at some tests. First, here's a simple "positive" test - we're starting with an array of integers, and using a lambda expression to only include elements less than 4 in the output. (The word "filter" is omnipresent but unfortunate. It's easier to talk about "filtering out" elements than "including" them, but the predicate is expressed in a positive way.)

[Test]
public void SimpleFiltering()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = source.Where(x => x < 4);
    result.AssertSequenceEqual(1, 3, 2, 1);
}

I've kept the TestExtensions from MoreLINQ, despite NUnit coming with CollectionAssert. I find the extension methods easier to work with for three reasons:

  • They're extension methods, which helps to reduce the clutter
  • They can use a parameter array for the expected output, which makes the test simpler to express
  • The message is clearer when the assertion fails

Basically, AssertSequenceEqual does what you'd expect it to - it checks that the actual result (usually expressed as the variable you call the method on) matches the expected result (usually expressed as a parameter array).

So far, so good. Now let's check argument validation:

[Test]
public void NullSourceThrowsNullArgumentException()
{
    IEnumerable<int> source = null;
    Assert.Throws<ArgumentNullException>(() => source.Where(x => x > 5));
}

[Test]
public void NullPredicateThrowsNullArgumentException()
{
    int[] source = { 1, 3, 7, 9, 10 };
    Func<int, bool> predicate = null;
    Assert.Throws<ArgumentNullException>(() => source.Where(predicate));
}

I'm not bothering to check the name within the ArgumentNullException, but importantly I'm testing that the arguments are being validated immediately. I'm not trying to iterate over the result - so if the validation is deferred, the test will fail.

The final interesting test for the moment is also around deferred execution, using a helper class called ThrowingEnumerable. This is a sequence which blows up with an InvalidOperationException if you ever try to iterate over it. Essentially, we want to check two things:

  • Just calling Where doesn't start iterating over the source sequence
  • When we call GetEnumerator() to get an iterator and then MoveNext() on that iterator, we should start iterating, causing an exception to be thrown.

We'll need to do something similar for other operators, so I've written a small helper method in ThrowingEnumerable:

internal static void AssertDeferred<T>(
    Func<IEnumerable<int>, IEnumerable<T>> deferredFunction)
{
    ThrowingEnumerable source = new ThrowingEnumerable();
    var result = deferredFunction(source);
    using (var iterator = result.GetEnumerator())
    {
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

Now we can use that to check that Where really defers execution:

[Test]
public void ExecutionIsDeferred()
{
    ThrowingEnumerable.AssertDeferred(src => src.Where(x => x > 0));
}

These tests have all dealt with the simpler predicate overload - where the predicate is only passed the item, not the index. The tests involving the index are very similar.

Let's implement it!

With all the tests passing when running against the real LINQ to Objects, it's time to implement it in our production code. We're going to use iterator blocks, which were introduced in C# 2 to make it easier to implement IEnumerable<T>. I have a couple of articles you can read if you want more background details... or read chapter 6 of C# in Depth (either edition). These give us deferred execution for free... but that can be a curse as well as a blessing, as we'll see in a minute.

At its heart, the implementation is going to look something like this:

// Naive implementation
public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

Simple, isn't it? Iterator blocks allow us to write the code pretty much how we'd describe it: we iterate over each item in the source, and if the predicate returns true for that particular item, we yield (include) it in the output sequence.

Lo and behold, some of our tests pass already. Now we just need argument validation. That's easy, right? Let's give it a go:

// Naive validation - broken!
public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

Hmm. Our validation tests still seem to be red, and putting a breakpoint on the "throw" statements doesn't help... they're not getting hit. What's going on?

I've given a few pretty broad hints already. The problem is deferred execution. Until we start trying to iterate over the result, none of our code will run. Our tests deliberately don't iterate over the result, so validation is never performed.

We've just hit a design flaw in C#. Iterator blocks in C# simply don't work nicely when you want to split execution between "immediate" (usually for validation) and "deferred". Instead, we have to split our implementation into two: a normal method for validation, which then calls the iterator method for the deferred processing:

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<TSource> WhereImpl<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            yield return item;
        }
    }
}

It's ugly, but it works: all our index-less tests go green. From here, it's a short step to implement the version using an index as well:

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<T, int, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return WhereImpl(source, predicate);
}

private static IEnumerable<TSource> WhereImpl<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int, bool> predicate)
{
    int index = 0;
    foreach (TSource item in source)
    {
        if (predicate(item, index))
        {
            yield return item;
        }
        index++;
    }
}

Now the bar is green, and we're done. Hang on though... we haven't used it every way we can yet.

Query expressions

So far, we've been calling the method directly (although as an extension method) - but LINQ also provides us with query expressions. Here's our "SimpleFiltering" test rewritten to use a query expression:

[Test]
public void QueryExpressionSimpleFiltering()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = from x in source
                 where x < 4
                 select x;
    result.AssertSequenceEqual(1, 3, 2, 1);
}

(Note that the name is different here to in the downloadable code, to stop the blog server software blocking the name of the method. Grr.)

That will produce exactly the same code as our earlier test. The compiler basically translates this form into the previous one, leaving the condition (x < 4) as a lambda expression and then converting it appropriately (into a delegate in this case). You may be surprised that this works as we have no Select method yet...  but in this case we have a "no-op" select projection; we're not actually performing a real transformation. In that case - and so long as there's something else in the query, in this case our "where" clause - the compiler effectively omits the "select" clause, so it doesn't matter that we haven't implemented it. If you changed "select x" to "select x * 2", it would fail to compile against our Where-only LINQ implementation.

The fact that query expressions are just based on patterns like this is a very powerful feature for flexibility - it's how LINQ to Rx is able to only implement the operators that make sense in that environment, for example. Similarly, there's nothing in the C# compiler that "knows" about IEnumerable<T> when it comes to query expressions - which is how entirely separate interfaces such as IObservable<T> work just as well.

What have we learned?

There's been a lot to take in here, in terms of both implementation and core LINQ principles:

  • LINQ to Objects is based on extension methods, delegates and IEnumerable<T>
  • Operators use deferred execution where appropriate and stream their data where possible
  • Operators don't mutate the original source, but instead return a new sequence which will return the appropriate data
  • Query expressions are based on compiler translations of patterns; you don't need to implement any more of the pattern than the relevant query expression requires
  • Iterator blocks are great for implementing deferred execution...
  • ... but make eager argument validation a pain

Code download

Linq-To-Objects-2.zip

Many people have asked for a source repository for the project, and that makes sense. I'm putting it together a source repository for it now; it's likely to be done before I post the next part.

Reimplementing LINQ to Objects: Part 1 - Introduction

About a year and a half ago, I gave a talk at a DDD day in Reading, attempting to reimplement as much of LINQ to Objects as possible in an hour. Based on the feedback from the session, I went far too fast... and I was still a long way from finishing. However, I still think it's an interesting exercise, so I thought I'd do it again in a more leisurely way, blogging as I go. Everything will be under the "Reimplementing LINQ to Objects" tag, so that's the best way to get all the parts in order, without any of my other posts.

General approach

The plan is to reimplement the whole of LINQ to Objects, explaining each method (or group of methods) in a blog post. I'm going to try to make the code itself production quality, but I'm not going to include any XML documentation - if I'm already writing up how things work, I don't want to do everything twice. I'll include optimizations where appropriate, hopefully doing better than LINQ to Objects itself.

The approach is going to be fairly simple: for each LINQ method, I'll write some unit tests (most of which I won't show in the blog posts) and make sure they run against the normal .NET implementation. I'll then comment out the using directive for System.Linq, and introduce one for JonSkeet.Linq instead. The tests will fail, I'll implement the methods, and gradually they'll go green again. It's not quite the normal TDD pattern, but it works pretty well.

I will write up a blog entry for each LINQ operator, probably including all of the production code but only "interesting" tests. I'll highlight important patterns as they come up - that's half the point of the exercise, of course.

At the end of each post, I'll include a link to download the "code so far". For the sake of anyone looking at these posts in the future, I'm going to keep the downloads numbered separately, rather than updating a single download over time. I'm hoping that the code will simply grow, but I dare say there'll be some modifications along the way too.

The aim is not to end up with LINQBridge: I'm going to be targeting .NET 3.5 (mostly so I can use extension methods without creating my own attribute) and I'm certainly not going to start worrying about installers and the like. The purpose of all of this is purely educational: if you follow along with these blog posts, with any luck you'll have a deeper understanding of LINQ in general and LINQ to Objects in particular. For example, topics like deferred execution are often misunderstood: looking at the implementation can clarify things pretty well.

Testing

The unit tests will be written using NUnit (just for the sake of my own familiarity). Fairly obviously, one of the things we'll need to test quite a lot is whether two sequences are equal. We'll do this using the TestExtensions class from MoreLINQ (which I've just copied into the project). The netbook on which I'll probably write most of this code only has C# Express 2010 on it, so I'm going to use the external NUnit GUI. I've set this in the project file as the program to start... which you can't do from C# Express directly, but editing the project file is easy, to include this:

<StartAction>Program</StartAction>
<StartProgram>C:\Program Files\NUnit-2.5.7.10213\bin\net-2.0\nunit-x86.exe</StartProgram>

It's a bit of a grotty hack, but it works. The "additional command line parameters" are then set to just JonSkeet.Linq.Tests.dll - the current directory is the bin/debug directory by default, so everything's fine. Obviously if you want to run the tests yourself and you have ReSharper or something similar, you can see the results integrated into Visual Studio.

Although I'm hoping to write reasonably production-level code, I doubt that there'll be as many unit tests as I'd really write against production code. I fully expect the number of lines of test code to dwarf the number of lines of production code even so. There are simply vast numbers of potential corner cases... and quite a few overloads in several cases. Remember that the goal here is to examine interesting facets of LINQ.

Code layout

Just like the real LINQ to Objects, I'll be creating an enormous static Enumerable class... but I'll do so using partial classes, with one method name (but multiple overloads) per file. So Where will be implemented in Where.cs and tested in WhereTest.cs, for example.

First code drop

The first zip file is available: Linq-To-Objects-1.zip. It doesn't contain any production code yet - just 4 tests for Where, so I could check that NUnit was working properly for me. Next stop... implementing Where.

Presentation preparation

As most of you know, I occasionally talk about C# at conferences, user groups or basically anywhere that people won't attack me. A while ago I rejected PowerPoint in favour of a rather less formal approach: hand-drawn slides. Quite a few people have now asked me about how they're prepared - even occasionally making the assumption that my awful handwriting is actually a real font - so I figured it's worth a blog post.

My process is both primitive and complex...

1. Draw the slides

I use plain A4 printer paper and a black flipchart marker to draw the slides. Where I need one bit overlaid on top of another (e.g. to cross something out), I'll draw the two sections separately, to merge later. Given that I'm fiddling around anyway, it doesn't matter if there are extra bits on the paper - so if I make a small mistake, I'll often keep going on the same sheet.

I rarely make a textual plan of my presentations beforehand - I think of roughly where I want to go, and then just draw. Often some slides will be reordered or deleted for the final presentation. I find a good fat pen in my hand helps the creative process.

2. Scan the paper

I have a relatively ancient but still trusty flatbed scanner (a Xerox 2400, if you're interested) which I use to scan everything in. A document scanner would be more convenient - scanning 25 slides individually is very tedious - but I don't really have the space at the moment.

I scan in greyscale at 400dpi, into Paintshop Pro v8.10. It may be 7 years old, but it does the job perfectly adequately. I then re-orient them to landscape mode and save them as BMP files. Yes, it's as daft as it sounds, but necessary for the next step. These files are pretty big - about 15MB - and large in terms of canvas size too... but here's an idea of what they look like (but as a PNG, scaled down significantly):

Slide before alterations

At this point I'm done with Paintshop Pro, so I close it down...

3. Convert into SVG

I used to edit the slides as PNG files in Paintshop Pro, but this had various issues:

  • Colouring was a pain
  • Resizing didn't work terribly well
  • Cutting and pasting shapes wasn't as easy as it might have been

For line drawing like mine, I suspect SVG is the ideal format. I found InkScape as a free tool which works pretty well... but I need to convert the images first. InkScape does actually have a conversion tool built-in, but I've never quite got it to work properly - whereas using potrace directly (I believe InkScape uses potrace internally) means I can batch the conversions and control them as far as I need to.

When I discovered potrace, I was amazed by it. It really does exactly what it says on the tin - it converts bitmaps into SVGs. Unfortunately it's limited in terms of its input formats - hence the BMPs - but that's fine.

For the record, here's the potrace command I use:

potrace -b svg -r 400 -t 8 -O 0.4 -k 0.7 %*

I can't remember what all the arguments mean offhand, but you can find out if you want to :)

4. Edit in InkScape

So, now I have black and white SVGs which I can edit in InkScape. I generally perform the following tweaks:

  • Remove any crud left from my dirty scanner - often there are extra shapes which need deleting
  • Resize everything and rotate shapes - usually to straighten them
  • Apply colour; I usually stick to red, black, blue and green with occasional other colours where necessary. A bit of colour's great, but we're not going for works of art here... and many colours look rubbish on some projectors.

Here's the result of editing the above picture - assuming your browser can handle SVG:

Slide as an SVG

5. Convert to PNG

So now we're done, right? Well, no... because I haven't found any way of presenting nicely from SVGs. Aside from anything else, they take a long time to render, and I really need to be able to flick from slide to slide whenever I want to. However, I've found that PNGs work well as a display format. I use ImageMagick to convert from SVG to PNG usually... although very occasionally it crashes, and I have to use InkScape to export to PNG, which it's perfectly capable of doing. (I use ImageMagick mostly so that I can do it easily in a batch.)

I convert to a 1024x768 (-ish) PNG, so that it won't be too time-consuming to render, and I don't need to worry about scaling it. (Most projectors I've used have been that size.) Here's the PNG for the same image again - shrunk to 600x436 just to avoid taking up too much room:

Final converted slide

Even if you couldn't see the SVG before, you should now be able to spot that not only is the image coloured in (and with a lot better contrast than the original) but I've tweaked the man to be falling instead of flying. (The aim of the slide is to indicate that C# should push you into the pit of success.)

6. Present!

At this point, I have three directories: one of the original BMPs, one of the SVGs, and one of the PNGs. I could then create a PowerPoint presentation consisting of all of the images... but I've found that PowerPoint takes an awfully long time to render images, and it would also be a bit of a pain to add them in the first place.

Instead, I use FastStone Image Viewer to present a slideshow of the images. I've tweaked a few options to make the images appear nice and quickly against a white background. It's easiest if the image files are in alphabetical order, so I tend to go for filenames such as "010 Introduction.png" - always iniitially leaving 10 between each number, so that I can easily slot in an extra slide or two if necessary.

So that's my process... four pieces of free software and one very old commercial one, a fair amount of time, and a slightly odd result.

Looking to the future...

Obviously this isn't an ideal process. I now have a Wacom Bamboo Fun tablet, and the hope is that eventually I'll be able to draw with it directly into InkScape, cutting out huge amounts of the process. However, so far I'm completely inept with it. If you think my drawings are bad and my handwriting is ridiculous when I'm using a pen and paper, you should see it with the tablet... but maybe that will change. I'd like to think so.

I suspect my style may change over time too, although I'm not sure in what way. Different presentations already get different treatment - for some talks I work entirely in Visual Studio, occasionally breaking to talk without coding, but without any slides... for others I'll have slides but no code at all. If you ever happen to see me present though, please give me feedback afterwards - good or bad. It's an area I'd love to improve on, and feedback is the best way of that happening.

Posted by skeet | 13 comment(s)

Don't let this get away

Josh Twist asked me this via Twitter:

is it possible to invoke a member before a ctor is finished (eg maybe using threaded IL trickery) or is this forbidden somehow? :D

Now I don't know why everyone seems to think I enjoy writing code which could have bizarre effects on either you, the compiler, the resulting execution or your co-workers... but it's an interesting topic to look at, anyway.

The perils of partially constructed objects

Hopefully it's reasonably obvious why it's dangerous to access a member before it's been properly constructed - but it may be worse than you've considered.

In particular, immutable types are only immutable after they're fully constructed. It's entirely reasonable for an immutable type to change read-only fields several times during the course of initialization. The fields can only be set in the constructor itself (or a variable initializer for the field) but this can occur several times. If the constructor for the immutable type exposes the instance it's constructing to other code, all the immutability guarantees go out of the window.

Even in mostly-mutable types, code may well assume that it's dealing with some fixed aspects. For example, you may have some database entity type which is either freshly created with a random GUID, or created from an existing record with an ID from the database. In either case, code consuming this type wouldn't expect to see an ID of Guid.Empty, or for the ID to change after it's been observed... even if other properties of the object can be changed later.

What C# does to protect you

C# as a language (plus conforming compiler, of course) protects you from some of this.

When you chain to another constructor, you can't use this to calculate any arguments you want to pass to the other constructor. The code is clearly dealing with a partially constructed object at this point - it knows none of your constructor body has been executed - so it's protecting you from harm. Unfortunately this means you can't even call this.GetType(), which can make it tricky to write objects which populate themselves using reflection.

During the constructor body, you have complete access to this of course - you have to, in order to set any state within the object. This is where things can get nasty.

Virtual methods

One way in which Java, C# and C++ diverge in their constructor behaviour is with regard to virtual methods:

  • In C++, the object only really "becomes" an instance of the subclass when the subclass constructor has been executed, so calling a virtual method will only execute the override in the "current" type hierarchy.
  • In Java, the object is of the final type from the start, so the most deeply overridden implementation of the virtual method is called - but this will occur without any initialization having taken place. All fields will still have their default values (null, 0 etc).
  • C# is like Java, except that variable initializers will have been executed (as they're executed before the base class constructor is called). In other words, initialization within the constructor body won't have taken place, but any fields which are initialized as part of the declaration will have their appropriate values.

This is really dangerous if you're not aware of it. In particular, any time you override a virtual method in Java or C#, you need to know whether it might be called in a partially-initialized state.

Wherever possible, try not to call virtual methods from the constructor for precisely this reason. I would advise that if you absolutely have to do it (I failed to remove this behaviour when porting Joda Time to Noda Time, for example) you document that fact very heavily and make sure that you don't call the method in any other place. Make it protected, too. Basically it should only be part of initialization. If you need similar behaviour at other times, create another method. This allows derived classes to tailor their implementation to the expected state at the time of invocation.

Callbacks

You may be thinking that this is all easy: just avoid virtual methods, don't do anything stupid like setting a static variable to this during a constructor (making it visible from other threads before initialization is completed) and you'll be fine.

Well, I suspect that almost every Windows Forms app in existence publishes this during the constructor. Any time you have an event handler, that's effectively providing a callback... and if that's an instance method, it's tied to the relevant instance, usually this.

How sure are you - really, really sure - that none of those event handlers will fire as part of the rest of the initialization? For example, if you use Visual Studio to hook up the ControlAdded event for a WinForms form, and also add a bunch of controls to the form... when is that event going to fire? Will the autogenerated code add the event handler after it adds the controls, or before? If it adds the handler at the start, then clearly the method handling the event will be called before your constructor finishes... so you need to be ready for that.

How much of a problem is this really?

Like many matters of purity, I suspect this is usually more of a theoretical issue than a practical one. In complicated situations like the Windows Forms one above, most event handlers are likely to be fired after initialization... and there's typically not as strong a sense of invariants being set up as there would be in an immutable data type, for example.

Immutable data types, in turn, are less likely to accidentally let this escape during construction... but the consequences of them doing so are much more severe, of course.

Conclusion

To answer Josh's question: Yes. At least on the simplest reading of the question: members can certainly end up being invoked on an object during its construction. They can potentially end up being invoked on multiple threads during construction. This is basically under the control of the constructors in the type hierarchy though.

In particular, I believe that the .NET memory model is stricter than the ECMA specification in terms of threading: I believe a constructor will have completed (and all its writes retired) before the reference returned by the constructor can be published to another variable, which was a concern in double-checked locking. It's a valid concern to consider though.

Alternative conclusion: almost nothing is really as simple as it appears to be.

Posted by skeet | 13 comment(s)
Filed under:

Code and data

In a recent Stack Overflow question, I answered a question which started off with a broken XPath expression by suggesting that that poster might be better off using LINQ to XML instead. The discussion which followed in the comments (around whether or not this was an appropriate answer) led me to think about the nature of code and data, and how important context is.

I don't think there's any particularly deep insight in this post - so I'll attempt to keep it relatively short. However, you might like to think about how code and data interact in your own experience, and what the effects of this can be.

Code is data

Okay, so let's start off with the obvious: all code is data, at some level. If it's compiled code, it's just binary data which a machine can execute. Put it on another machine with no VM, and there's nothing remarkable about it. It's just a load of 1s and 0s. As source code, most languages are just plain text. Open up some source code written in C#, Ruby, Python, Java, C++ etc in Notepad and it'll be readable. You may miss the syntax highlighting and so forth, but it's still just text.

Code in the right context is more than just data

So what makes this data different to (say) a CSV file or a plain text story? It's all in the context. When you load it into the right editor, or pass it to the right compiler, you get more information: in an editor you may see the aforementioned syntax highlighting, autocompletion, documentation for members you're using; a compiler will either produce errors or a binary file. For something like Python or Ruby, you may want to feed the source into an interpreter instead of a compiler, but the principle is the same: the data takes on more meaning.

Code in the wrong code-related context is just data again

Now let's think about typical places where you might put code (or something with similar characteristics) into the "wrong" context:

  • SQL statements
  • XSLT transformations
  • XPath expressions
  • XML or HTML text
  • Regular expressions

All of these languages have editors which understand them, and will help you avoid problems. All of these are also possible to embed in other code - C#, for example. Indeed, almost all the regular expressions I've personally written have ended up in Java or C# code. At that point, there are two problems:

  • You may want to include text which doesn't embed easily within the "host" language's string literals (particularly double quotes, backslashes and newlines)
  • The code editor doesn't understand the additional meaning to the text

The first problem is at least somewhat mitigated by C#'s support for verbatim string literals - only double quotes remain as a problem. But the second problem is the really big one. Visual Studio isn't going to check that your regular expression or XPath expression looks valid. It's not going to give you syntax highlighting for your SQL statement, much less IntelliSense on the columns present in your database. Admittedly such a thing might be possible, if the IDE looked ahead to find out where the text was going to be used - but I haven't seen any IDE that advanced yet. (The closest I've seen is ReSharper noticing when you're using a format string with the wrong number of parameters - that's primitive but still really useful.)

Of course, you could write your SQL (or XPath etc) in a dedicated editor, and then either copy and paste it into your code or embed it into your eventual binary and load it at execution time. Neither of these is particularly appealing. Copy and paste works well once, but then when you're reading or modifying the code you lose the advantages you had unless you copy and paste it again. Embedding the file can work well in some cases - I use it liberally for test data in unit tests, for example - but I wouldn't want it all over production code. It means that when reading the code, you have to refer to the external resource to work out what's going to happen. In some cases that's not too bad - it's only like opening another class or method, I guess - but in other cases the shift of gears is too distracting.

When code is data, it's easy to mix it with other data - badly

Within C# code, it's easy to see the bits of data which sometimes occur in your code: string or numeric literals, typically. Maybe you subscribe to the "no magic values" philosophy, and only ever have literals (other than 0 or 1, typically) as values for constants. Well, that's just a level of indirection - which in some ways hides the fact that you've still got magic values. If you're only going to use a piece of data once, including it directly in-place actually adds to readability in my view. Anyway, without wishing to dive into that particular debate too deeply, the point is that the compiler (or whatever) will typically stop you from using that data as code - at least without being explicit about it. It will make sure that if you're using a value, it really is a value. If you're trying to use a variable, it had better be a variable. Putting a variable name in quotes means it's just text, and using a word without the quotes will make the compiler complain unless you happen to have a variable with the right name.

Now compare that with embedding XPath within C#, where you might have:

var node = doc.SelectSingleNode("//foo/bar[@baz=xyz]");

Now it may be obvious to you that "xyz" is meant to be a value here, not the name of an attribute, an element, a function or anything like that... but it's not obvious to Visual Studio, which won't give you any warnings. This is only a special case of the previous issue of invalid code, of course, but it does lead onto a related issue... SQL injection attacks.

When you've already got your "code" as a simple text value - a string literal containing your SQL statement, as an obvious example - it's all too easy to start mixing that code/data with genuine data data: a value entered by a user, for example. Hey, let's just concatenate the two together. Or maybe use a format string, effectively mixing three languages (C#, SQL, the primitive string formatting "language" of string.Format) into a single statement. We all know the results, of course: nothing differentiates between the code/data and the genuine data, so if the user-entered value happens to look like SQL to drop a database table, we end up with Little Bobby Tables.

I'm sure 99% of my blog readers know the way to avoid SQL injection attacks: use parameterized SQL statements. Keep the data and the code separate, basically.

Expressing the same ideas, but back in the "native" language

Going back to the start of all this, the above is why I like LINQ to XML. When I express a query using LINQ to XML, it's often a lot longer than it would have been in the equivalent XPath - but I can tell where the data goes. I know where I'm using an element name, where I'm using an attribute name, and where I'm comparing or extracting values. If I miss out some quotes, chances are pretty high that the resulting code will be invalid, and it'll be obvious where the problem is. I'm prepared to sacrifice brevity for the fact that I only work in a single language + library, instead of trying to embed one language within another.

Likewise building XML using LINQ to XML is much better than concatenating strings - I don't need to worry about any nasty escaping issues, for example. LINQ to XML has been so nicely design, it makes all kinds of things incredibly easy.

Regular expressions can sometimes be replaced by simple string operations. Where they can, I will often do so. I'd rather use a few IndexOf and Substring calls over a regular expression in general - but where the patterns I need get too tricky, I will currently fall back to regular expressions. I'm aware of ReadableRex but I haven't looked at it in enough detail to say whether it can take the place of "normal" regular expressions in the way that LINQ to XML can so often take the place of XPath.

Of course, LINQ to SQL (and the Entity Framework) do something similar for SQL... although that's slightly different, and has its own issues around predictability.

In all of these cases, however, the point is that by falling back to more verbose but more native-feeling code, some of the issues of embedding one language within another are removed. Code is still code, data is data again, and the two don't get mixed up with each other.

Conclusion

If I ever manage to organize these thoughts in a more lucid way, I will probably just rewrite them as another (shorter) post. In the meantime, I'd urge you to think about where your code and data get uncomfortably close.

Posted by skeet | 11 comment(s)
Filed under: ,

Writing the perfect question

Update: now that I've actually posted this, I've added a tinyurl to it for easy reference: http://tinyurl.com/so-hints. Nice and easy to remember for comments :)

A while ago, I wrote a blog entry on how to answer questions helpfully on sites like Stack Overflow. Recently I saw a meta question about bad questions and thought it would be worth following up with another blog post on asking questions. For the sake of convenience - and as Stack Overflow is so popular - I will assume the question is going to be asked on Stack Overflow or a similar Stack Exchange site. Most of the post doesn't actually depend on that, but if you're asking elsewhere you may need to tweak the advice a little.

There are plenty of similar resources around, of course - in particular, Eric Raymond's How to Ask Questions the Smart Way is a perennial favourite. Still, I think I can bring something to the table.

The Golden Rule: Imagine You're Trying To Answer The Question

If you don't remember anything else from this post, remember this bit. Everything else follows from here. (And yes, this does smack somewhat of Matthew 7:12.)

Once you've finished writing your question, read it through. Imagine you were coming to it fresh, with no context other than what's on the screen. Does it make sense? Is it clear what's being asked? Is it easy to read and understand? Are there any obvious areas you'd need to ask about before providing an answer? You can usually do this pretty well however stuck you are on the actual question. Just apply common sense. If there's anything wrong with the question when you're reading it, obviously that will be a problem for whoever's actually trying to answer it. So fix the problems. Improve the question until you can read it and think, "If I only knew the answer to the question, it would be a pleasure to provide that answer." At that point, post and wait for the answers to come rolling in.

Obviously this is somewhat easier to do if you have a certain amount of experience answering questions, particularly on the forum where you're about to post. So, what should you be looking out for?

Context

In most cases, anyone answering the question will need to know what language and platform you're using. The basics should usually be communicated through tags, but it may very well be worth providing more information:

  • Language version (e.g. C# 4)
  • Platform version (e.g. .NET 3.5; note that this isn't always implicit from the language version, or vice versa)
  • Operating system, if it could be relevant (e.g. particular permissions issues)
  • Any other relevant software (e.g. database type and version, the IDE you're using, web server you're connecting to)
  • Any other constraints. This is particularly important. It's really annoying to give a perfectly good answer to a question, only to be told that you're not allowed to use feature X or Y which provide the obvious solution.
    • If you have unusual constraints, it's worth explaining why. Not only does this answer the obvious follow-up comment, but it also gives more information about what other solutions may not be applicable.

Describe what you've already tried and the results of any research. (You have searched for a solution to your problem before asking it, haven't you? Stack Overflow isn't meant to replace basic search skills.) Often there will be other people in a similar situation, but the answers didn't quite match your situation. Just like the above point about unusual constraints, it saves time if you can point out differences between your situation and other common ones. It's even worth referring to other related questions explicitly - particularly if they're on the same forum. Aside from anything else, this shows a certain amount of "due diligence" - people are generally more willing to help you if can show you've already put some effort in.

You should absolutely make sure that you tag the question appropriately. If you're not sure which exact tags are appropriate, see what gets auto-suggested and look at samples for each one. If that sounds like a lot of work, just remember how much time you may be able to save yourself in the long run. It gets easier over time, of course.

Problem statement

Make sure it's obvious what you're trying to get out of the question. Too many "questions" are actually just statements: when I do X, something goes wrong.

Well, what did you expect it to do? What are you trying to accomplish? What have you already tried? What happened in those attempts? Be detailed: in particular, if something didn't work, don't just state that: tell us how it failed. If it threw an exception, what was the exception? (Don't just give the type - give the error message and say which line threw it. If there's a nested exception, post that too.)

If at all possible, write a sort of "executive summary" at the start of your question, followed by a more detailed description. Remember that on the list of questions, the first few sentences will appear as a snippet. If you can get a sense of the question across in that snippet, you're more likely to attract views from people who can answer the question.

One trap that many posters fall into is to ask how to achieve some "small" aim, but never say what the larger aim is. Often the smaller aim is either impossible or rarely a good idea - instead, a different approach is needed. Again, if you provide more context when writing your problem statement, we can suggest better designs. It's fine to specify how you're currently trying to solve your bigger problem, of course - that's likely to be necessary detail - but include the bigger goal too.

Sample code and data

I may be biased on this one. I'm a huge believer in sample code, both for questions and answers... and I probably use it in an unconventional way. I usually paste it into a text editor, and try to compile it from the command line. If that's not likely to work (and the problem isn't obvious by inspection), I'm unlikely to bother too much with it. Firing up Eclipse or Visual Studio and finding an existing project I don't care about or starting a new one is going to take much more time.

That means if you want me to look at code, it should:

  • Be standalone. Don't try to talk to a database unless you really have to. (Obviously for database questions, that's kinda hard :) If you use sample XML, provide a short but complete XML file for us to reproduce the issue with. (And the same for other file types, obviously.)
  • Be complete. If there are missing imports or using directives, that's really annoying
  • Actually compile (unless the compilation error is the reason for the question). Don't give me code which is "something like" the real code but which clearly isn't the real code, and may well not exhibit the same symptoms by the time I've fixed it so that it compiles.
  • Ideally not bring up a UI. Unless your code is about a UI issue, don't bring one up. Console apps are simpler, and simplicity is a huge benefit when trying to hunt down a problem.
  • Demonstrate the problem. You should be able to say, "I expected the result to be X, it's actually Y." (You should actually say that too, so that we can check that we get the same results.)
  • Be as short as possible. If I have to wade through hundreds of lines of code to find the problem, I'm doing work that you should be doing. Often if you work hard to reduce the problem to a short but complete program, you'll find the issue yourself. You can absolutely do this without knowing what the problem is; you should be looking to the community for their expertise, not their willingness to spend time on your problem doing the work that you can do yourself.

Yes, this is a relatively onerous list. It doesn't all apply to every problem, but it does apply in a great many situations. While I get put off by reams of irrelevant, badly formatted code, some of which clearly won't compile, the inverse is true as well: if I can tell by looking at the question that the code can go through a copy/paste/compile/run cycle really quickly, I'm much more likely to pay the question significant attention.

In data-oriented questions, it's very often helpful to give some sample data. Cut out anything irrelevant (if your real table has 50 columns, you only need to include relevant ones) but make sure that you give enough sample input for it to be meaningful. For example, if you're trying to group some data by a PersonID column, it's pretty useless if there's only one PersonID given, or if each PersonID only appears once. If you are giving examples of expected input and output, make sure it's clear why that's the expected output. Often I see questions which give a small number of samples, and there are various ways they could be interpreted. This is one area where it's particularly important to reread the question from a stranger's point of view: while a brief summary of the desired results may well make sense to someone who already knows what your app is trying to achieve, it may be gobbledygook to those trying to answer your question.

Spelling, grammar and formatting

I know not everyone speaks English natively. My own command of non-English languages is lamentably poor - I'm incredibly lucky that my native tongue happens to be the lingua franca of the technical world. However, if you're trying to communicate on an English-language forum, you owe it to yourself to make an effort to write at least reasonably correct English.

  • Please use capital letters where appropriate. It really can make a big difference in the readability of text.
  • Please split your text into paragraphs. Imagine this blog post as one big paragraph - it would be almost impossible to read.
  • Please write actual words. There are undoubtedly some abbreviations which are acceptable to most readers - IMO, IIRC etc -  there's no reason to switch into text-speak with "gr8", "bcoz", "u" and so forth. It's unlikely that you're actually writing your question on a phone with only a primitive keyboard; show your readers respect by writing properly. It may take you a few more seconds, but if it means you get an answer quicker, it's surely worth the extra effort.
  • Most browsers have built-in spelling checkers these days, or at least have plug-ins or extensions available to check your text. Technical text often creates a lot of false positives for checkers, but if your spelling isn't generally great, it's worth looking at the suggestions.

Having said all of this, you're not trying to create a literary masterpiece. You're trying to communicate your question as effectively as possible. If you're faced with the choice between an unambiguous but ugly sentence, or a phrase which stirs the soul but leaves the reader confused about exactly what you mean, go for the unambiguous option every time.

One way a huge number of questions can be improved with very little effort is simply formatting them properly. Stack Overflow's markdown editor is very good - the preview below your input box is almost always accurate in terms of the eventual result, and you can always edit the question later if anything doesn't quite work. The exact details of the markdown is beyond the scope of this article - Stack Overflow has a detailed guide though - if you're new to the site, I'd recommend you at least skim through it.

By far the most important kind of formatting is making code look like code. Within a text paragraph, simply surround the code with backticks `like this`. For blocks of code, just indent everything by four spaces. If you're cutting and pasting code, it may already be indented (for example if you're copying code within a class) but if not, the easiest way to indent everything is to paste it (with a blank line between the text before the code and the code itself, select the whole code block, and then press Ctrl-K or the 101/010 button just above the editor.

One of the important things about code formatting is that it means angle brackets (and some other symbols) are preserved instead of being swallowed by the markdown formatter. In some cases this can mean all the difference between a question which is easy to answer and one which doesn't make any sense, particularly in terms of generics in Java and C# or templates in C++. For example, like this

Why can't I convert an expression of type List<string> to List<object>?

makes no sense at all if the type arguments are removed:

Why can't I convert an expression of type List to List?

Often experienced members of the site will recognise what's going on and edit your question for you, but obviously it's better if they don't have to.

Making a good impression

Leaving aside the main body of the question, there are a few simple ways to get the community "on your side" and therefore more likely to give you a useful answer quickly.

  • Register as a user and give yourself a meaningful name. It doesn't have to be your real name, but frankly names like "Top Coder" or "Coding Guru" look pretty silly when you're asking a question which others find simple. That's still better than leaving yourself as "user154232" or whatever identifier is assigned to you by default though. Aside from anything else, it shows a certain amount of commitment to the question and/or site: if you've bothered to give yourself a name, you're less likely to be an "ask-and-run" questioner.
  • Keep an eye on your question. There may well be requests for clarification - and of course, answers! If you receive an answer which wasn't quite what you were looking for, explain carefully (and politely) why it's not suitable for your purposes. Consider going back and editing your question to make it clearer for subsequent users.
  • Don't add your own answer unless it really is an answer. Often users add extra details in an "answer" when they should really have just edited their question. Likewise editing your question is generally a better idea than adding a long comment to an existing answer - particularly if that comment contains a block of code (which won't work well in a comment). If you do change the question in response to an answer though, it's worth adding a comment to the answer just to let the user know that you've updated it though... you may well find they quickly edit their answer to match the revised question.
  • There's no need to include greetings and sign-offs such as "Hi everyone!" and "Thanks - hope to get an answer soon" in the question. These will often be edited out by other users, as they're basically a distraction. Greetings at the start of a question are particularly useless as they can take up valuable space in the snippet displayed in the question list.
  • Above all, be polite. Remember that no-one is getting paid to answer your question. Users are giving up their time to help you - so please be appreciative of that. If you're asking a homework question, explain why you're asking for help with something that traditionally you'd have to answer all by yourself. If a user suggests that your general approach is wrong and that there's a better way of doing things, don't take it personally: they're trying to help you improve your code. By all means disagree robustly, but don't start into ad hominem arguments. (This advice applies to answerers as well, of course.)
  • (Somewhat specific to Stack Overflow.) If an answer is particularly helpful or solves your problem, accept it by clicking on the tick mark by it. After you've asked a certain number of questions, an accept rate will be shown by your username in the question. Some users get annoyed with those with low accept rates (i.e. people who ask a lot of questions but rarely accept answers). Personally I'm not overly bothered by accept rates, but accepting an answer is still a generally good thing to do if it solved your problem. I certainly wouldn't suggest accepting answers just to get your accept rate, if they didn't help you.

Conclusion and feedback

Stack Overflow is an amazing resource (along with other Q&A sites, of course). The idea that you can get a good answer to a wide range of questions within minutes is pretty staggering... but there's an obvious correlation between the quality of a question and the likelihood that you'll get quick, helpful answers. Put that extra bit of effort in yourself, and it will probably pay for itself very quickly.

I'm hoping to keep this blog post up to date with suggestions received - if I've missed out anything, over- or under-emphasized a specific point, or generally gone off track, let me know either in the comments here or mail me (skeet@pobox.com). If this document ends up elsewhere, then that copy may end up being the "canonical" one which is edited over time - in which case I'll indicate that here.

Posted by skeet | 22 comment(s)
Filed under: ,

Reflecting on presentation skills: The Guathon, August 13th 2010

(I apologise for the unstructured nature of this post. I honestly don't know how to structure it. I've thought of a few ways of breaking it up by heading, and none of them really work. Particular apologies to Simon Stewart, who has requested more brevity in my blog. Just for Simon, the executive summary is: Scott Guthrie is a really good speaker. I want to be more like him.)

Yesterday I had the good fortune (well, good friends - thanks Phil!) to attend the Guathon in London. This was a free, day-long event with Scott Guthrie and Mike Ormond, talking about MVC 2 and 3, Visual Studio 2010, Windows Phone 7 and more. This was my encounter with Scott - and indeed the first time I'd seen him present. (I value videos of presentations, but rarely find time to actually watch them, more's the pity.)

Obviously I was interested in hearing about the technologies they were talking about, but I confess I was more interested in watching how Scott went about presenting. (I've seen Mike present before - but clearly Scott was the "big name" here. No offence meant to Mike whatsoever, who did a great job talking about Windows Phone 7.) Scott is a legend in the industry, and as I'm very interested in improving my public speaking skills, I felt I had at least as much to learn in that area as anything else.

I was really impressed. In some ways, Scott didn't present in a way I'd expected him to... but what he did was so much better. Not having seen him before, I'd sort of expected an utterly polished sort of talk - almost like a Steve Jobs presentation. I was hoping to get some insight into what sort of polish I could add to my presentations: where does it make sense to have photos, where do simpler visuals work, where are words important? How do you present against an enormous screen without losing the audience's focus? Do jokes enhance a presentation or detract from it?

In retrospect, this was hopelessly naïve. I think Scott's secret sauce is actually pretty simple: he knows what he's talking about, and talks about it honestly and openly. He's completely authentic, obviously passionate about what he does, good humoured (we had a few bits of mild Google/Bing banter), and interested in the audience.

At almost every turn, Scott asked the audience how many of us had used a certain feature, or developed in a certain way. This was then reflected in the level at which he pitched the next section, as well as giving a few opportunities for jokes. There were questions throughout - particularly in Mike's talk, actually - to the extent that I'd say a good quarter to a third of the time was spent answering the audience. This was a very good thing, in my view - I can't remember finding any of the questions irrelevant or obvious (I should state for the record that I probably asked more questions than anyone else; apologies if other attendees found my questions to be boring). Questions from the audience are always a good reality check, because they're clearly addressing real concerns rather than the ones in the speaker's imagination. But the best thing about the questions was Scott's way of answering, which could broadly be divided into three types of answer:

  • A known answer: "Yes, you can do X - and you can do Y as well. But you can't do Z."
  • An unknown answer which was easily testable: "Hmm. I'm not sure. Let's try it. Ah yes, the code does X." (There were fewer of these, just due to the nature of the questions.)
  • An answer which was unknown but needed further investigation: "Send me a mail and I'll get back to you about it."

The last one is most interesting - because I have absolutely no doubt that Scott will get back to anyone who sent him a mail. (I've sent him two.) Now don't forget that Scott is a Corporate Vice President (Dev Div). He's clearly a busy man... but his openness and passion make an enormously positive impression, suggesting that he's the kind of guy who doesn't think of himself as being above such questions. Assuming this is what he says at all his presentations (and I suspect it is), I dread to think how much time he spends every day answering emails... but I also suspect that it's of enormous benefit to the products for which he's responsible, by keeping the executive level in touch with grass-roots developers.

So, what have I learned from the whole experience, in terms of presentation skills?

  • You can definitely give awesome presentations without fancy graphics. Content is king.
  • There's no substitute for knowing your stuff, and being honest about when you don't know the answer.
  • Interaction with the audience is beneficial to everyone.
  • Sitting down and just writing out code - particularly with audience participation to make the demo "belong" to them - is absolutely fine.
  • Scott's an incredibly nice guy, and it shines through very clearly. I really hope to see him again soon.
  • If you speak clearly, speed doesn't matter too much: Scott talks really fast, but is very easy to listen to.
  • If you lose a vital file in the middle of a presentation, check the recycle bin. It's the virtual equivalent of checking down the back of the sofa.
  • Don't worry if you have more material than you have time to present, particularly if that's due to audience questions.

Whether I'll be able to apply this myself remains to be seen... although I've already been acutely aware of how much more comfortable I am when presenting on "home topics" (e.g. C# language features) than areas where I have a lot less expertise (e.g. Reactive Extensions).

Posted by skeet | 4 comment(s)
Filed under: ,

There's a hole in my abstraction, dear Liza, dear Liza

I had an interesting day at work today. I thought my code had broken... but it turns out it was just a strange corner case which made it work very slowly. Usually when something interesting happens in my code it's quite hard to blog about it, because of all the confidentiality issues involved. In this case, it's extremely easy to reproduce the oddity in an entirely vanilla manner. All we need is the Java collections API.

I have a set - a HashSet, in fact. I want to remove some items from it... and many of the items may well not exist. In fact, in our test case, none of the items in the "removals" collection will be in the original set. This sounds - and indeed is - extremely easy to code. After all, we've got Set<T>.removeAll to help us, right?

Let's make this concrete, and look at a little test. We specify the size of the "source" set and the size of the "removals" collection on the command line, and build both of them. The source set contains only non-negative integers; the removals set contains only negative integers. We measure how long it takes to remove all the elements using System.currentTimeMillis(), which isn't the world most accurate stopwatch but is more than adequate in this case, as you'll see. Here's the code:

import java.util.*;

public class Test
{
    public static void main(String[] args)
    {
        int sourceSize = Integer.parseInt(args[0]);
        int removalsSize = Integer.parseInt(args[1]);
        
        Set<Integer> source = new HashSet<Integer>();
        Collection<Integer> removals = new ArrayList<Integer>();
        
        for (int i = 0; i < sourceSize; i++)
        {
            source.add(i);
        }
        for (int i = 1; i <= removalsSize; i++)
        {
            removals.add(-i);
        }
        
        long start = System.currentTimeMillis();
        source.removeAll(removals);
        long end = System.currentTimeMillis();
        System.out.println("Time taken: " + (end - start) + "ms");
    }
}

Let's start off by giving it an easy job: a source set of 100 items, and 100 to remove:

c:\Users\Jon\Test>java Test 100 100
Time taken: 1ms

Okay, so we hadn't expected it to be slow... clearly we can ramp things up a bit. How about a source of one million items1 and 300,000 items to remove?

c:\Users\Jon\Test>java Test 1000000 300000
Time taken: 38ms

Hmm. That still seems pretty speedy. Now I feel I've been a little bit cruel, asking it to do all that removing. Let's make it a bit easier - 300,000 source items and 300,000 removals:

c:\Users\Jon\Test>java Test 300000 300000
Time taken: 178131ms

Excuse me? Nearly three minutes? Yikes! Surely it ought to be easier to remove items from a smaller collection than the one we managed in 38ms? Well, it does all make sense, eventually. HashSet<T> extends AbstractSet<T>, which includes this snippet in its documentation for the removeAll method:

This implementation determines which is the smaller of this set and the specified collection, by invoking the size method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator's remove method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set's remove method.

Now that sounds reasonable on the surface of it - iterate through the smaller collection, check for the presence in the bigger collection. However, this is where the abstraction is leaky. Just because we can ask for the presence of an item in a large collection doesn't mean it's going to be fast. In our case, the collections are the same size - but checking for the presence of an item in the HashSet is O(1) whereas checking in the ArrayList is O(N)... whereas the cost of iterating is going to be the same for each collection. Basically by choosing to iterate over the HashSet and check for presence in the ArrayList, we've got an O(M * N) solution overall instead of an O(N) solution. Ouch. The removeAll method is making an "optimization" based on assumptions which just aren't valid in this case.

Then fix it, dear Henry, dear Henry, dear Henry

There are two simple ways of fixing the problem. The first is to simply change the type of the collection we're removing from. Simply changing ArrayList<Integer> to HashSet<Integer> gets us back down to the 34ms range. We don't even need to change the declared type of removals.

The second approach is to change the API we use: if we know we want to iterate over removals and perform the lookup in source, that's easy to do:

for (Integer value : removals)
{
    source.remove(value);
}

In fact, on my machine that performs slightly better than removeAll - it doesn't need to check the return value of remove on each iteration, which removeAll does in order to return whether or not any items were removed. The above runs in about 28ms. (I've tested it with rather larger datasets, and it really is faster than the dual-hash-set approach.)

However, both of these approaches require comments in the source code to explain why we're not using the most obvious code (a list and removeAll). I can't complain about the documentation here - it says exactly what it will do. It's just not obvious that you need to worry about it, until you run into such a problem.

So what should the implementation do? Arguably, it really needs to know what's cheap in each of the collections it's dealing with. The idea of probing for performance characteristics before you decide on a strategy is completely anathema to clean abstraction we like to consider with frameworks like Java collections... but maybe in this case it would be a good idea.


1 Please perform Dr Evil impression while reading this. I'm watching you through your webcam, and I'll be disappointed if I don't see you put your little finger to your mouth.

Posted by skeet | 15 comment(s)

Iterate, damn you!

Do you know the hardest thing about presenting code with surprising results? It's hard to do so without effectively inviting readers to look for the trick. Not that that's always enough - I failed the last of Neal and Eric's C# puzzlers at NDC, for example. (If you haven't already watched the video, please do so now. It's far more entertaining than this blog post.) Anyway, this one may be obvious to some of you, but there are some interesting aspects even when you've got the twist, as it were.

What does the following code print?

using System;
using System.Collections.Generic;

public class WeirdIterators
{
    static void ShowNext(IEnumerator<int> iterator)
    {
        if (iterator.MoveNext())
        {
            Console.WriteLine(iterator.Current);
        }
        else
        {
            Console.WriteLine("Done");
        }
    }
    
    static void Main()
    {
        List<int> values = new List<int> { 1, 2 };
        using (var iterator = values.GetEnumerator())
        {
            ShowNext(iterator);
            ShowNext(iterator);
            ShowNext(iterator);
        }
    }
}

If you guessed "1, 2, Done" despite the title of the post and the hints that it was surprising, then you're at least brave and firm in your beliefs. I suspect most readers will correctly guess that it prints "1, 1, 1" - but I also suspect some of you won't have worked out why.

Let's look at the signature of List<T>.GetEnumerator(). We'd expect it to be

public IEnumerator<T> GetEnumerator()

right? That's what IEnumerable<T> says it'll look like. Well, no. List<T> uses explicit interface implementation for IEnumerable<T>. The signature actually looks like this:

public List<T>.Enumerator GetEnumerator()

Hmm... that's an unfamiliar type. Let's have another look in MSDN...

[SerializableAttribute]
public struct Enumerator : IEnumerator<T>, 
    IDisposable, IEnumerator

(It's nested in List<T> of course.) Now that's odd... it's a struct. You don't see many custom structs around, beyond the familiar ones in the System namespace. And hang on, don't iterators fundamentally have to be mutable.

Ah. "Mutable value type" - a phrase to strike terror into the hearts of right-headed .NET developers everywhere.

So what's going on? If we're losing all the changes to the value, why is it printing "1, 1, 1" instead of throwing an exception due to printing out Current without first moving?

Well, we're fetching the iterator into a variable of type List<int>.Enumerator, and then calling ShowNext() three times. On each call, the value is boxed (creating a copy), and the reference to the box is passed to ShowNext().

Within ShowNext(), the value within the box changes when we call MoveNext() - which is how it's able to get the real first element with Current. So that mutation isn't lost... until we return from the method. The box is now eligible for garbage collection, and no change has been made to the iterator variable's value. On the next call to ShowNext(), a new box is created and we see the first item again...

How can we fix it?

There are various things we can do to fix the code - or at least, to make it display "1, 2, Done". We can then find other ways of breaking it again :)

Change the type of the values variable

How does the compiler work out the type of the iterator variable? Why, it looks at the return type of values.GetEnumerator(). And how does it find that? It looks at the type of the values variable, and then finds the GetEnumerator() method. In this case it finds List<int>.GetEnumerator(), so it makes the iterator variable type List<int>.Enumerator.

If suppose just change values to be of type IList<int> (or IEnumerable<int>, or ICollection<int>):

IList<int> values = new List<int> { 1, 2 };

The compiler uses the interface implementation of GetEnumerator() on List<T>. Now that could return a different type entirely - but it actually returns a boxed List<T>.Enumerator. We can see that by just printing out iterator.GetType().

So if it's just returning the same value as before, why does it work?

Well, this time we're boxing once - the iterator gets boxed on its way out of the GetEnumerator() method, and the same box is used for all three calls to ShowNext(). No extra copies are created, and the changes within the box don't get lost.

Change the type of the iterator variable

This is exactly the same as the previous fix - except we don't need to change the type of values. We can just explicitly state the type of iterator:

using (IEnumerator<int> iterator = values.GetEnumerator())

The reason this works is the same as before - we box once, and the changes within the box don't get lost.

Pass the iterator variable by reference

The initial problem was due to the mutations involved in ShowNext() getting lost due to repeated boxing. We've seen how to solve it by reducing the number of boxing operations down to one, but can we remove them entirely?

Well yes, we can. If we want changes to the value of the parameter in ShowNext() to be propagated back to the caller, we just need to pass the variable by reference. When passing by reference the parameter and argument types have to match exactly of course, so we can't leave the iterator variable being type List<T>.Enumerator without changing the parameter type. Now we could explicitly change the type of the parameter to List<T>.Enumerator - but that would tie our implementation down rather a lot, wouldn't it? Let's use generics instead:

static void ShowNext<T>(ref T iterator)
    where T : IEnumerator<int>

Now we can pass iterator by reference and the compiler will infer the type. The interface members (MoveNext() and Current) will be called using constrained calls, so there's no boxing involved...

... except that when you try to just change the method calls to use ref, it doesn't work - because apparently you can't pass a "using variable" by reference. I'd never come across that rule before. Interesting. Fortunately, we can (roughly) expand out the using statement ourselves, like this:

var iterator = values.GetEnumerator();
try
{
    ShowNext(ref iterator);
    ShowNext(ref iterator);
    ShowNext(ref iterator);
}
finally
{
    iterator.Dispose();
}

Again, this fixes the problem - and this time there's no boxing involved.

Let's quickly look at one more example of it not working, before I finish...

Dynamic typing to the anti-rescue

What happens if we change the type of iterator to dynamic (and set everything else back the way it was)? I'll readily admit, I really didn't know what was going to happen here. There are two competing forces:

  • The dynamic type is often really just object behind the scenes... so it will be boxed once, right? That means the changes within the box won't get lost. (This would give "1, 2, Done")
  • The dynamic type is in many ways meant to act as if you'd declared a variable of the type which it actually turns out to be at execution time - so in this case it should work as if the variable was of type List<int>.Enumerator, just like our original code. (This would give "1, 1, 1")

What actually happens? I believe it actually boxes the value returned from GetEnumerator() - and then the C# binder DLR makes sure that the value type behaviour is preserved by copying the box before passing it to ShowNext(). In other words, both bits of intuition are right, but the second effectively overrules the first. Wow. (See the comment below from Chris Burrows for more information about this. I'm sure he's right that it's the only design that makes sense. This is a pretty pathological example in various ways.)

Conclusion

Just say "no" to mutable value types. They do weird things.

(Fortunately the vast majority of the time this particular one won't be a problem - it's rare to use iterators explicitly in the first place, and when you do you very rarely pass them to another method.)

Posted by skeet | 15 comment(s)
Filed under: ,

Degrees of reality in sample code

Yesterday I tweeted a link to an article about overloading that I'd just finished. In that article, all my examples look a bit like this:

using System;

class Test
{
    static void Foo(int x, int y = 5)
    {
        Console.WriteLine("Foo(int x, int y = 5)");
    }
    
    static void Foo(double x)
    {
        Console.WriteLine("Foo(double x)");
    }

    static void Main()
    {
        Foo(10);
    }
}

Each example is followed by an explanation of the output.

Fairly soon afterwards, I received an email from a reader who disagreed with my choices for sample code. ere are a few extracts from the email exchange. Please read them carefully - they really form the context of the rest of this post.

This is really not proper. When a method can do more than one thing, you might offer what are called 'convenience overloads', which make it easier for the consuming developer. When you start swaying away so much that you have wildly different arguments, then it's probably time to refactor and consider creating a second method. With your example with "Foo", it's hard to tell which is the case.

My point is, the 'convenience overloads' should all directly or indirectly call the one REAL method. I'm not a fan of "test", "foo", and "bar", because they rarely make the point clearer, and often make it more confusing. So let me use something more realistic. So let me use something more realistic. This nonsensical example, but hopefully is clear: [code snipped, but it was an OrderProcessor, referring to an OrderDetail class]

...

The point here was to make you aware of the oversight. I do what I can to try to stop bad ideas from propagating, particularly now that you're writing books. When developers read your book and consider it an "authority" on the topic, they take your example as if it's a model for what they should do. I just hope your more mindful of that in your code samples in the future.

...

Specific to this overload issue, this has come up many times for me. Developers will write 3 overloads that do wildly different things or worse, will have 98% of the same code repeated. We try to catch this in a code review, but sometimes we will get pushback because they read it in a book (hence, my comments).

...

I assume your audience is regular developer, right? In other words, the .NET Framework developers at Microsoft perhaps aren't the ones reading your books, but it's thousands of App Developer I and App Developer II that do business development? I just mean that there are far, far more "regular developers" than seasoned, expert developers who will be able to discern the difference and know what is proper. You are DEFINING what is proper in your book, you become an authority on the matter!

Anyhow, all my point was it to realize how far your influence goes once you become an author. Even the simplest, throwaway example can be seen as a best-practice beacon.

Now, this gave me pause for thought. Indeed, I went back and edited the overloading article - not to change the examples, but to make the article's scope clearer. It's describing the mechanics of overloading, rather than suggesting when it is and isn't appropriate to use overloading at all.

I don't think I'm actually wrong here, but I wanted to explore it a little more in this post, and get feedback. First I'd like to suggest a few categorizations - these aren't the only possible ones, of course, but I think they divide the spectrum reasonably. Here I'll give example examples in another area: overriding and polymorphism. I'll just describe the options first, and then we can talk about the pros and cons afterwards.

Totally abstract - no code being presented at all

Sometimes we talk about code without actually giving any examples at all. In order to override a member, it has to be declared as `virtual` in a base class, and then the overriding member uses the `override `modifier. When the virtual member is called, it is dispatched to the most specific implementation which overrides it, even if the caller is unaware of the existence of the implementation class.

Working but pointless code

This is the level my overloading article worked at. Here, you write code whose sole purpose is to demonstrate the mechanics of the feature you're describing. So in this case we might have:

using System;

public class C1
{
    public virtual void M()
    {
        Console.WriteLine("C1.M");
    }
}

public class C2 : C1
{
    public override void M()
    {
        Console.WriteLine("C2.M");
    }
}

public class C3
{
    static void Main()
    {
        C1 c = new C2();
        c.M();
    }
}

Now this is a reasonably extreme example; as a matter of personal preference I tend to use class names like "Test" or "Program" as the entry point, perhaps "BaseClass" and "DerivedClass" where "C1" and "C2" are used here, and "Foo" instead of "M" for the method name. Obviously "Foo" has no more real meaning than "M" as a name - I just get uncomfortable for some reason around single character identifiers other than for local variables. Arguably "M" is better as it stands for "method" and I could use "P" for a property etc. Whatever we choose, we're talking about metasyntactic variables really.

Complete programs indicative of design in a non-business context

This is the level at which I would probably choose to demonstrate overriding. It's certainly the one I've used for talking about generic variance. Here, the goal is to give the audience a flavour of the purpose of the feature as well as demonstrating the mechanics, but to stay in the simplistic realm of non-business examples. To adapt one of my normal examples - where I'd actually use an interface instead of an abstract class - we might end up with an example like this:

using System;
using System.Collections.Generic;

public abstract class Shape
{
    public abstract double Area { get; }
}

public class Square : Shape
{
    private readonly double side;
    
    public Square(double side)
    {
        this.side = side;
    }
    
    public override double Area { get { return side * side; } }
}

public class Circle : Shape
{
    public readonly double radius;
    
    public Circle(double radius)
    {
        this.radius = radius;
    }
    
    public override double Area { get { return Math.PI * radius * radius; } }
}

public class ShapeDemo
{
    static void Main()
    {
        List<Shape> shapes = new List<Shape>
        {
            new Square(10),
            new Circle(5)
        };
        foreach (Shape shape in shapes)
        {
            Console.WriteLine(shape.Area);
        }
    }
}

Now these are pretty tame shapes - they don't even have a location. If I were really going to demonstrate an abstract class I might try to work out something I could do in the base class to make it sensibly a non-interface... but at least we're demonstrating the property being overridden.

Business-like partial example

Here we'll use classes which sound like they could be in a real business application... but we won't fill in all the useful logic, or worry about any properties that aren't needed for the demonstation.

using System;
using System.Collections.Generic;

public abstract class Employee
{
    private readonly DateTime joinDate;
    private readonly decimal salary;
    
    // Most employees don't get bonuses any more
    public virtual int BonusPercentage { get { return 0; } }
    
    public decimal Salary { get { return salary; } }
    public DateTime JoinDate { get { return joinDate; } }
    
    public int YearsOfService
    {
        // TODO: Real calculation
        get { return DateTime.Now.Year - joinDate.Year; }
    }
    
    public Employee(decimal salary, DateTime joinDate)
    {
        this.salary = salary;
        this.joinDate = joinDate;
    }
}

public abstract class Manager : Employee
{
    // Managers always get a 15% bonus
    public override int BonusPercentage { get { return 15; } }
}

public abstract class PreIpoContract : Employee
{
    // The old style contracts were really generous
    public override int BonusPercentage
    {
        get { return YearsOfService * 2; }
    }
}

Now this particular code sample won't even compile: we haven't provided the necessary constructors in the derived classes. Note how the employees don't have names, and there are no relationships between employees and their managers, either.

Obviously we could have filled in all the rest of the code, ending up with a complete solution to an imaginary business need. Other examples at this level may well include customers and orders. One interesting thing to note here: admittedly I've only been working in the industry for 16 years, and only 12 years full time, but I don't think I've ever written a Customer or Order class as part of my job.

Full application example

No, I'm not going to provide an example of this. Usually this is the sort of thing which a book might work up to over the course of the complete text, and you'll end up with a wiki, or an e-commerce site, or an indexed library of books with complete web site around it. If you think I'm going to spend days or even weeks coding something like that just for this blog post, you'll be disappointed :)

Anyway, the idea of this is that it does something genuinely useful, and you can easily lift whole sections of it into other projects - or at least the design of it.

Which approach is best?

I'm sure you know what's coming here: it depends. In particular, I believe it depends on:

Your readership

Are they likely to copy and paste your example into production code without further thought? Arguably in that case the first option might be the best: they may not understand it, but at least it means your code won't be injuring a project.

Simply put, didactic code is not production code. The parables in the Bible aren't meant to be gripping stories with compelling characterization: they're meant to make a point. Scales aren't meant to sound like wonderful music: they're meant to help you improve your abilities to make a nice sound when you're playing real music.

The point you're trying to put across

If I'm trying to explain the mechanics of a feature, I find the second option to be useful. The reader doesn't need to try to take in the context of what the code is trying to accomplish, because it's explicitly not trying to do anything of any use. It's just demonstrating how the language or platform behaves in a particular scenario.

If, on the other hand, you're trying to explain a design principle, then the third or fourth options are useful. The third option can also be useful for the mechanics of a feature which is particularly abstract - like generic variance, as I mentioned earlier. That goes somewhere between "complete guide to where this feature should be used" and "no guidance whatsoever" - a sort of "here's a hint at the kind of situation where it could be useful."

If you're trying to explore a technology for fun, I find the third option works very well for that situation too. For example, while looking at Reactive Extensions, I've written programs to:

  • Group lines in a file by length
  • Give the results of a UK general election
  • Simulate the 1998 Brazil vs Norway world cup football match
  • Implement drag and drop using event filtering

None of these is likely to be directly useful in a real business app - but they were more appealing than solely demonstrating a sequence of numbers being generated (although with an appropriate marble diagram generator, that can be quite fun too).

The technology you're demonstrating

This is clearly related to the previous point, but I think it bears a certain amount of separation. I believe that language topics are fairly easily demonstrated with the second and third options. Library topics often deserve a slightly higher level of abstraction - and if you're going to try to demonstrate that a whole platform is worth investing time and energy in, it's useful to have something pretty real-world to show off.

Your time and skills

You know what? I suck at the fourth and fifth options here. I can't remember writing any complete, independent systems as a software engineer, and none of them have been in line-of-business applications anyway. The closest I've come is writing standalone tools which certainly have been useful, but often take shortcuts in terms of design which I wouldn't countenance in other applications. (And yes, I'm sure there's some discussion to be had around that as well, but it's not the point of this article.)

You may think my employee example above was lousy - and I'd agree with you. It's not really a great fit for inheritance, in my view - and the bonus calculation is certainly a dubious way of forcing in some polymorphism. But it was the best I could come up with in the time available to me. This wasn't some attempt to make it appear less worthy than the other options; I really am that bad at coming up with business-like examples. Other authors (by which I mean anyone writing at all, not just book authors) may well have found much better examples, either by spending more time on them, being more experienced with line-of-business apps, or having a better imagination. Or all three.

I'm not too proud to admit the things I suck at :) If I spent many extra hours coming up with examples for everything I write about, I would get a lot less written. I'm doing this in notional "spare time" after all. So even if you would prefer the fourth option over the third, would you rather have that but see less of my (ahem) "wisdom"? Personally I think everyone's better off with me braindumping using examples in forms which I'm better at.

How to read examples

Most of this post has been from the point of view of an author. Briefly, I'd like to suggest what this might mean for readers. The onus is on the author to make this clear, of course, but I think it's worth trying to be actively better readers ourselves.

  • Understand what the author is trying to achieve. Don't assume that every example will fit nicely in your application. Example code often doesn't come with any argument validation or error handling - and very rarely does it have an appropriate set of unit tests. If you're reading about how something works, don't assume that the examples are in any way realistic. They may well be simplified to demonstrate the behaviour as clearly as possible without the extra "fluff" of useful functionality.
  • Think about what may be missing, particularly if the context is an evangelical one. If someone is trying to sell you on a particular technology, then of course they'll try to show it in its best possible light. Where are the pitfalls? Where does it not stack up?
  • Don't assume authority means anything. I was quite happy to take Jeffrey Richter to task on boxing for example. Jeffrey Richter is a fabulous author and clearly a smart cookie, but that doesn't mean he's right about everything... and I really, really don't like the idea of anyone appealing to my supposed abilities to justify some bad decision. Judge any argument on its merits... find out what people think and why they think it, but then see how well their reasoning actually hangs together.

Conclusion

This was always going to be a somewhat biased look at this topic, because I hold a certain viewpoint which is clearly contrary to the one held by the chap who emailed me. That's why I included a reasonable chunk of his emails - to give at least some representation to the alternatives. This post has effectively been a longwinded justification of the form my examples have taken... but does it ring true?

I can't guarantee to change my writing style drastically on this front - at least not quickly - but I would very much appreciate your thoughts on this. I'm reluctant to exaggerate, but I think it may be even more important than working out whether "Jedi" was meant to be plural or singular - and I certainly received a lot of feedback on that topic.

Posted by skeet | 18 comment(s)

How many Jedi?

(There's no technical content in this post... but you may get a bit of a giggle from it. When I get the second edition web site notes together I'll include this as well... but I thought it was fun enough to deserve a blog post too.)

The second edition of C# in Depth is nearing the end of its technical review cycle, as performed by the great Eric Lippert. Yesterday I received the comments for chapter 13, which includes this section heading:

The revenge of optional parameters and named arguments

Now, my copy editor (Ben) wasn't too keen on this. He suggested an alternative form:

I think "have their revenge" has more of a ring to it than "the revenge of"

Personally I'm quite fond of the original, but Eric suggested another alternative, with customary flair:

I'm not buying it Ben. Your way vs Jon's way:
"The Clones Attack"       / "Attack of the Clones"
"The Sith Have Revenge"   / "The Revenge of the Sith"
"The Empire Strikes Back" / "The Counter-attack of the Empire"
"The Jedi Return"         / "The Return of the Jedi"

I would argue - I have before and I will again - that the proper title for episode two is not the passive, wimpy "Attack of the Clones" but rather the aggressive, dynamic, active "The Clones Attack", preferably with an exclamation point, "The Clones Attack!"

"The Sith Have Revenge" has that awkward verb form. "The Counter-attack of the Empire" is too wordy. And "The Jedi Return" is just... wrong.  So I would score these as the winners being Ben / Jon / Ben / Jon.

I say "the revenge of" is superior to "have their revenge", but that the best would be "Optional and named parameters strike back".

Also, NOOOOOO! You're not my father!

This intrigued me mightily, so I dashed off an email to Eric:

Hi Eric,

I'm just going through your notes for chapter 13, and they've brought up an issue which I think would bother me if I didn't consult you about it.

You suggested that the alternative to "Return of the Jedi" (1) would be "The Jedi Return." That implies multiple Jedi returning - does this include Anakin returning from the Dark Side? Leia's nascent ability being revealed? I had always imagine it to only refer to Luke's return, suggesting "The Jedi Returns" as the parallel title. This could change everything.

Jon

----

(1) There's no leading "The" in the English title, as far as I can tell - although in French it's "Le retour du Jedi." Does this alter your argument at all?

Eric's reply was as prompt as ever:

First off, you’re right, there’s no leading “The”. I had not realized that.

I had always assumed that the “Jedi” of the title “Return of the Jedi” referred to the beginning of the restoration of the Jedi as an institution. With the downfall of the Emperor and Lord Vader, the Jedi are back. Even though technically there’s only one of them alive in the club right now, there will be more.

However, I must admit that in light of episodes one through three, it now seems plausible that in fact the Jedi referred to in the title is neither the Jedi as a class nor Luke as an individual, but rather the redemption of Anakin.

Beyond the dialogue...

So, that's the end of that, right? We can't really tell what Lucas was thinking... except that when I relayed all of this at the office over breakfast, someone suggested that we have a look at some other translations - and that we pay more attention to the French than just the inclusion of "Le" to start with.

The fact that the French version uses "du" suggests it's the return of a singular Jedi rather than many individual Jedi knights... but apparently the singular form can also be used for a collective institution, in line with Eric's "Jedi as a class" theory.

The German version is still ambiguous, as far as I can tell interesting: "Die Rückkehr der Jedi-Ritter" - a colleague suggested that this implies knights plural, but "the return of the knight" and "the return of the knights" translate the same way in Google Translate. The fact that "ritter" is both plural and singular (like sheep in English) looks like it foils us. EDIT: As noted in comments, the genetive form would be "des" for a singular knight. So it really is "knights". I was misled by automated translation - I should have trusted my colleague :) But does this mean "the return of several individual Jedi knights" or "the return of an institution of Jedi knights"? Without knowing the subtleties of the German language, I certainly can't tell for sure.

There's a whole host of titles of course - if any reader gifted in languages wishes to analyse some more of them, I'd be very grateful. One thing I will point out is the alternative US working title: "Revenge of the Jedi." Who really had their revenge in episode VI? Arguably Luke avenged Han by killing Jabba... and perhaps Anakin himself took revenge on the Emperor? Given that Obi Wan effectively started Luke on the crusade against Vader, perhaps it's his revenge from beyond the grave?

These are serious matters which I lament I am unable to explore adequately in this post - but comments are more welcome than ever.

Conclusion

So what happened to the heading in the end? Well, for the moment I've left it as it is. It may yet change before printing though - we'll see. Possibly I should take this opportunity to make Eric's dream change apply in a different context... how about "Attack of the optional parameters and named arguments!" Perhaps not.

Anyway, I'm sure you will all be glad to see that such weighty technical matters are being given the thorough attention they deserve. Yes, the book really will come out sometime reasonably soon.

Posted by skeet | 23 comment(s)
Filed under:

Epicenter 2010: quick plug and concessionary tickets

Just a quick update to mention that I'm speaking at Epicenter 2010 in Dublin on Wednesday, on Noda Time and C# Corner Cases. There are concessionary tickets available, so if you're on the right landmass, please do come along. Don't be put off by the fact that I'm speaking - there are some genuinely good speakers too. (Stephen Colebourne will be talking about Joda Time and JSR-310, in a session which I'm personally sad to miss - I'll be talking about C# at the same time.)

While I'm busy plugging events, I'm also extremely excited about NDC 2010 next week in Oslo. Neal Gafter and Eric Lippert will be doing a C# Puzzler session, Mads Torgersen will be talking about C# 4, I'll be presenting a wish-list for C# 5, and then all four of us will be doing a Q&A session. Should be heaps of fun. (I'll also be presenting C# 4's variance features, and Noda Time again.)

As ever, I'm somewhat late in putting the final touches to all of these talks, so if you've got any suggestions for my C# 5 wish-list or any particularly evil corner cases which have caught you out, add them as comments and I'll try to squeeze 'em in.

Posted by skeet | 12 comment(s)
Filed under: ,

How do we raise our game?

A couple of weeks ago, I was in the Seattle area for work reasons. I took the opportunity to meet up with a lot of smart folks, including some working on the Reactive Extensions team and the C# team. I asked pretty much the same question of almost everyone:

How is Microsoft intending to make developers smarter?

Let me explain what I mean a bit more clearly...

The problem

Now, let me be quite clear: I only have visibility of a small section of the community. In particular, I get to see:

  • Comments from blog readers
  • Questions and answers on Stack Overflow
  • MSDN forum posts (typically the Code Contracts and Reactive Extensions forums, and then only occasionally)
  • Random emails, often from C# in Depth readers
  • User group and conference attendees

Intuition would suggest that these are some of the more interested developers. Not necessarily the smartest, but ones who are at least engaged with the community. It's been a while since I've worked in a regular company - I don't think Google engineers are representative of software engineers as a whole - so I'm not going to suggest I have a good grasp of how able the "disengaged" engineers are. I doubt that they're significantly smarter though.

My concern is that some of the new technologies that I'm getting excited by (in particular Parallel Extensions, Code Contracts and Reactive Extensions) may make it easier for awesome people to write awesome code - but leave everyone else behind. Even though I'm excited about Reactive Extensions, I still regularly get confused by it - this is partly a matter of thin documentation (a natural corollary of an API which is still under development; I'm sure it will improve markedly over time) but partly because a push model is simply harder to think about.

What can we do about it?

Taking Reactive Extensions as an example, how is anyone going to learn about Reactive Extensions to such a degree that they can really use it productively, rather than just for experimentation?

As I see it there are currently five main ways of disseminating information:

User groups and conferences

These are good for getting people interested - but they're not really great at deep dives. To be absolutely cynical about it, I think they're better at entertaining than educating. To be slightly more positive, they can be good at inspiring people to look further for themselves... but unless you have a relatively long session or start off with developers who already have a fair amount of experience in the topic, I don't think they're a great way of imparting detailed information. (Organisers of such events, please note that this is in no way a condemnation of events in general. I'm still interested in speaking at them - I just question their ability to create experts.)

Training courses

I have little experience of training courses - it could be that they're highly effective - but they clearly don't scale in terms of getting information to a wide audience. As training companies make their money by getting people to the courses in person, they're unlikely to provide videos of the training to let everyone take advantage of an individual tutor's experience. We can hope for a network effect here and in some of the other options, of course - an expert trains a novice, the novice gains experience and then trains other novices in their company. There's a risk of truth dilution, but at least this has the possibility of reaching those who would otherwise never voluntarily learn about new technologies.

Blogs

Blogs can be effective, but are difficult to navigate and often outdated. The "navigation" problem is one of picking the right course through the posts in order to try to get a well-rounded knowledge of the subject in an appropriate order.

To give a concrete example, a large amount of the information available in C# in Depth is covered in significantly more detail in Eric Lippert's blog - which is great for those who already know enough to read it, but it doesn't provide a 1, 2, 3 experience for learning C# 2-4. (Aside: Over time I've been coming to the conclusion that treating a subject "in the round" with a carefully considered ordering is the primary feature of tech books these days - you can almost always find more detailed, accurate and timely information online if you know what you're looking for.)

I've been tremendously impressed at the Microsoft developer blog output in the last few years. There are many well-written, informative and enlightening blogs out there, which can certainly go a long way to filling an educational void. Right at this minute, it feels like they're probably the most effective teaching tool in this list... but it somehow feels wrong to try to learn a new technology from scratch from a blog. It's easy to get caught up in details and miss the big picture.

Is there a market for third parties collating blog posts into effective teaching material? I don't just mean "link blogs" - but well-maintained and organised sequences of posts designed to teach particular topics... possibly with extra explanatory posts to provide extra cohesion. Is there sufficient coverage of basic topics in blogs, or are most bloggers aiming at developers who are already experienced in the topic in question? Answers on a postcard...

Books

Books have problems of freshness and detail, as mentioned before. There's also a problem of readership - I'm not convinced that books have a wide enough reach these days. I suspect this is largely because blogs do a good enough job in many areas, and also because technology often moves faster than a book market can keep up. Of course this ends up being a cycle to some extent - with fewer and fewer readers, top authors will have more incentive to blog than to write books, so the overall quality could drop, leading to fewer book readers etc. I'm not going to pretend to know enough about the book market to make concrete predictions - but it does concern me.

(I'm not particularly bothered in terms of my own income - I've never been writing C# in Depth for the money, although of course that provides more incentive for "polish". I wouldn't have put in as much editorial work in a purely amateur capacity. There's a very strange motivational force going on when it comes to tech writing and publishing. That's possibly a seed for another post some time.)

Will there be a book teaching how to think about push sequences and use Reactive Extensions effectively? Maybe. Heck, I'd potentially be interested in writing one - if only to understand it better myself. Would it reach all the people who could benefit from it though? I'm more dubious about that.

Documentation and tutorials

This is probably the most traditional and widespread route to knowledge of new technologies. Does it cut it? I simply don't know. I can't remember the last time I read docs in a "from scratch" mode - my experience is much more in reference mode. I've occasionally tried - but found either my patience or the documentation to be lacking. It's entirely possible that my expectations and methods of learning have changed, and that that's what's gone wrong... but this may be a reasonably common phenomenon.

Again there's a problem of reach - although it's possible that as the most traditional form of learning, it's possible that it has a greater reach into the non-community, if you see what I mean.

Non-conclusion

This has been somewhat rambly, partly because I'm demob-happy: I'm about to go to Athens on holiday with Holly for a few days. (Don't expect any comments to be approved until Sunday.) It's a real concern though, and one which goes way beyond the Microsoft technologies mentioned here.

The challenges faced in computing are growing, and so are the technologies trying to up us meet those challenges... but we need to grow too, in order to take advantage of what's available.

I'm not pronouncing doom on the industry - but I'd like your thoughts on how we can keep up, and how we can help others to do likewise.

Posted by skeet | 30 comment(s)
Filed under: ,

Book review: "Confessions of a public speaker" by Scott Berkun

Resources

Introduction

A couple of weeks ago I was giving a presentation on Reactive Extensions at VBUG 4Thought spring conference, and there was an O'Reilly stand. I picked up CLR via C# 3rd edition (I now have all three editions, which is kinda crazy) and I happened to spot this book too.

I've been doing a reasonable amount of public speaking recently, with more to come in the near future (and local preaching roughly once a month), so I figure it would probably be a good idea to find out how to actually do it properly instead of bumbling along in the way I've been doing so far. This looked as good a starting point as any.

It's been a while since I've had a lot of time for reading, unfortunately - C# in Depth is still sucking my time - but this is a quick read, and I finished it on the plane today. I should point out that I'm currently flying to Seattle for meetings in the Google Kirkland office. The book itself is in the overhead locker, so obviously I could reach it down - but I'd rather not. Surely a book like this should at least largely be judged by the impression it makes on the reader; if I couldn't find enough to talk about when I only finished it a few hours ago, that would be a bad sign. It does mean that I'm not going to be as concrete in my notes as I would usually be - but that's probably reasonably appropriate for a non-technical book anyway.

Content

The book covers various different topics, from preparation to delivery and evaluation. The book is clearly divided into chapters - but a lot of the time the topics seem to leak into chapters other than the ones you might expect them to crop up in. If this were a technical book, I would view that as a bad thing - but here, it just worked. In some ways the book mirrors an actual presentation in terms of fluidity, narration and imagery. Sometimes this is explicitly mentioned, but often in retrospect and never in a self-aggrandising manner.

Although steps for designing your overall talk are examined, there's little guidance on how to design slides themselves: that's delegated to other books. (I'm reasonably happy with my slide style at the moment, and as it's somewhat uncommon, it may well not benefit much from "conventional wisdom" - there are plenty of bigger picture things I want to concentrate on improving first, anyway.)

There are suggestions for audience activity - from moving people to the front to make an underpopulated venue feel more friendly, to trying to make the audience actively use what they've been told for a better learning experience. While I'd like to think I'm a pretty friendly speaker, I could definitely improve here.

While there are some mentions of financial matters, there's no discussion of getting an agent involved, or what kind of events are likely to be the most lucrative and so on. There is the recommendation that you either need to be famous or an expert to make money - which sounds like very good advice to me. I have no particular desire to go into this for money (and I think I have to speak for free under my current contract with Google) so this was fine by me.

Anecdotes abound: they're part of the coverage of pretty much every topic. At the end there's a whole section of gaffes made by Scott and other speakers, as a sort of "you think you've had it bad?" form of encouragement. There's never a sense of the stories being inserted with a crowbar, fortunately - that's a trait which usually annoys me intensely in sermons.

Evaluation

As you can probably tell already, I liked the book a lot. Scott is a good writer, and I strongly suspect he's a great presenter too - I'll be looking out for his name in conferences I'm going to, with the hope of hearing him first hand.

The real trick is actually applying the book to my own speaking though. It would be hard to miss one central point where I fail badly: practising. I simply don't go through the whole "talking in the living room" thing. For a couple of talks I've gone through a dry-run at Google first as a small-scale tech talk, but usually I just put the slides (and code) together, make sure I know roughly what I'm going to say on each topic, and wing it. Assuming the book is accurate, this puts me firmly in the same camp as most speakers - which is somewhat reassuring, but doesn't actually make my talks any better.

So, I'm going to try it. I'm expecting it to be hugely painful, but I'll give it a go. I feel I somehow owe Scott that much - even though he makes it very clear that he expects most readers not to bother. Possibly putting it as a sort of challenge to exceed expectations is a deliberate ploy. More seriously, he convincingly makes the point that I owe the audience that much. We'll see how it goes.

There are plenty of other aspects of the book which I expect to put to good use - particularly in terms of approaching the relevant topic to start with, writing down a big list of possible points, and whittling it down. I'm not going to promise to write a follow-up post trying to work out what's helped and what hasn't... I know perfectly well that I'd be unlikely to get round to writing it.

Conclusion

If you speak in public (which includes "internal" talks at work) I can heartily recommend this book as an entertaining and potentially incredibly helpful read.

We'll see what happens next...

Documentation with Sandcastle - a notebook

(Posted to both my main code blog and the Noda Time blog.)

I apologise in advance if this blog post becomes hard to get good information from. It's a record of trying to get Sandcastle to work for Noda Time; as much as anything it's meant to be an indication of how smooth or otherwise the process of getting started with Sandcastle is. My aim is to be completely honest. If I make stupid mistakes, they'll be documented here. If I have decisions to make, they'll be documented here.

I should point out that I considered using NDoc (it just didn't make sense to use a dead project) and docu (I'm not keen on the output style, and it threw an exception when I tried running it on Noda Time anyway). I didn't try any other options as I'm unaware of them. Hmm.

Starting point and aims

My eventual aim is to include "build the docs" as a task in the build procedure for Noda Time. I don't have much in the way of preconceived ideas of what the output should be: my guess is a CHM file and something to integrate into MSDN, as well as some static web pages. Ideally I'd like to be able to put the web pages on the Google Code project page, but I don't know how feasible that will be. If the best way forward turns out to be something completely different, that's fine.

(I've mailed Scott Hanselman and Matt Hawley about the idea of having an ASP.NET component of some form which could dynamically generate all this stuff on the fly - you'd just need to upload the .xml and .dll files, and let it do the rest. I'm not expecting that idea to be useful for Noda Time in the near future, but you never know.)

Noda Time has some documentation, but there are plenty of public members which don't have any XML documentation at all at the moment. Obviously there's a warning available for this so we'll be able to make sure that eventually everything's documented, but we also need to be able to build documentation before we reach that point.

Step 0: building the XML file

The build project doesn't currently even create the .xml file, so that's the first port of call - just a case of ticking a box and then changing the default filename slightly... because for some bizarre reason, Visual Studio defaults to creating a ".XML" file instead of ".xml". Why? Where else are capitals used in file extensions?

Rebuild the solution, gaze in fear at the 496 warnings generated, and we have everything we should need from Visual Studio. My belief is that I should now be able to close Visual Studio and not reopen it (with the Noda Time solution, anyway) during the course of this blog post.

Step 1: building Sandcastle

First real choice: which version of Sandcastle do I go for? There was a binary release on May 29th 2008, a source release on July 2nd 2008, and three commits to source control since then, the latest of which was in July 2009. Personally I like the idea of not having to actually install anything: tools which can just be run without installation are nicer for Open Source projects, particularly if you can check the binaries into source control with appropriate licence files. That way anyone can build after just fetching. On the other hand, I'm not sure how well the Sandcastle licence fits in with the Apache 2 licence we're using for Noda Time. I can investigate that later.

What the heck, let's try building it from source. It's probably easier to go from that to the installed version than the other way round. Change set 26202 downloaded and unpacked... now how do we build it, and what do we need to build? Okay, there's a solution file, which opens up in VS2008 (unsurprising and not a problem). Gosh, what a lot of projects (but no unit tests?) - still, everything builds with nary a warning. I've no idea what to do with it now, but it's a start. It looks like it's copied four executables and a bunch of DLLs into the ProductionTools directory, which is promising.

Shockingly, it's only just occurred to me to check for some documentation to see whether or not I'm doing the right thing. Looking at the Sandcastle web page, it seems I'm not missing much. Well, I was aware that this might be the case.

Step 2: Sandcastle Help File Builder

I've heard about SHFB from a few places, and it certainly sounds like it's the way to go - it even has a getting started guide and installation instructions! It looks like there's a heck of a lot of documentation for something sounds like it should be simple, but hey, let's dive in. (I know it sounds inconsistent to go from complaining about no documentation to complaining about too much - but I'm really going from complaining about no documentation to complaining about something being so complicated that it needs a lot of documentation. I'm very grateful to the SHFB team for documenting everything, even if I plan to read it on a Just-In-Time basis.)

A few notes from the requirements page:

  • It looks like I'll need to install the HTML Help Workshop if I want CHM files; the Help 2 compiler should already be part of the VS2008 SDK which I'm sure is already installed. I have no idea where Help 3 fits into this :(
  • It looks like I need a DXROOT environment variable pointing at my Sandcastle "installation". I wonder what that means in my home-built version? I'll assume it just means the Development directory containing ProductionTools and ProductionTransforms.
  • There's a further set of patches available in the Sandcastle Styles project. Helpfully, this says it includes all the updates in the July 2009 source code, and can be applied to the binary installation from May 2008. It's not clear, however, whether it can also be applied to a home-built version of Sandcastle. Given that I can get all the latest stuff in conjunction with an installed version, it sounds like it's worth installing the binary release after all. (Done, and patches installed.)
  • It sounds like I need to install the H2 Viewer and H2Reg. (I suspect that H2Reg will be something we direct our users at rather than shipping and running ourselves; I don't intend to have an MSI-style "installer" for Noda Time at the moment, although the recent CoApp announcement sounds interesting. It's too early to worry about that for the moment though.)
  • We're not documenting a web site project, so I'm not bothering with "Custom Web Code Providers". I've installed quite enough by this point, thank you very much. Oh, except I haven't installed SHFB itself yet. I'd better do that now...

Step 3: creating a Help File Builder project

This feels like it could be reasonably straightforward, so long as I don't try to do anything fancy. Let's follow (roughly) the instructions. (I'm doing it straight to Noda Time rather than using the example project.)

Open the GUI, create a new project, add a documentation source of NodaTime.csproj... and hit Build Project. Wow, this takes quite a while - and this is a pretty beefy laptop. However, it seems to work! I have a CHM file which looks like it includes all the right stuff. Hoorah! It's a pretty huge CHM file (just over 3MB) for a relatively small project, but never mind.

Let's build it again, this time with all the output enabled (Help 1, Help 2, MSHelpViewer and Website).

Hmm... no MS Help 2 compiler found. Maybe I didn't have the VS2008 SDK installed after all. After a bit of hunting, it's here. Time to install it - and make sure it doesn't mess up the Sandcastle installation, as the SHFB docs warned me about. Yikes - 109MB. Ah well.

Okay, so after the SDK installation, rebuild the help... which will take even longer of course, as it's now building four different output formats. 3 minutes 18 seconds in the end... not too bad, but not something I'll want to do after every build :)

Step 4: checking the results

  • Help 1 (CHM): looks fine, if old-fashioned :)
  • Help 2 (HxS): via H2Viewer, looks fine - I'm not sure whether I dare to integrate it with MSDN just yet though.
  • ASP.NET web site: works even in Chrome
  • Static HTML: causes Chrome to flicker, constantly reloading. Works fine in Firefox. Maybe I need to submit a bug report.

I'm not entirely sure which output option corresponds to which result here; in particular, is "Website" the static one or the ASP.NET one? What's MSHelpViewer? It's easy enough to find out of course - I'll just experiment at a later date.

Step 5: building from the command line

I can't decide whether this is crucial (as it should be part of a continuous build server) or irrelevant (as there are so many tools to install, I may never get the ability to run a CB server with everything installed). However, it would certainly be nice.

Having set SHFBROOT appropriately, running msbuild gives this error:

SHFB : error BE0067: Unable to obtain assembly name from project file '[...]' using Configuration 'Debug', Platform 'X64'

Using Debug is definitely correct, but X64 sounds wrong... I suspect I want AnyCPU instead. Fortunately, this can be set in the SHFB project file (it was previously just defaulting). Once that's been fixed, the build works with one warning: BHT0001: Unable to get executing project: Unable to obtain internal reference. Supposedly this may indicate a problem in SHFB itself... I shall report it later on. It doesn't seem to affect the ability to produce help though.

Conclusion

That wasn't quite as painful as I'd feared. I'm nearly ready to check in the SHFB project file now - but I need to work out a few other things first, and probably create a specific "XML" configuration for the main project itself. I'm somewhat alarmed at the number of extra bits and pieces that I had to install though - and the lack of any mention of Help 3 is also a bit worrying.

I've just remembered one other option that I haven't tried, too - MonoDoc. I may have another look at that at a later date, although the fact that it needs a GTK# help viewer isn't ideal.

I still think the Open Source community for .NET has left a hole at the moment. It may be that SHFB + Sandcastle is as good as it gets, given the limitations of how much needs to be installed to build MS help files. I'd still like to see a better way of providing docs for web sites though... ideally one which doesn't involve shipping hundreds of files around when only two are really required.

Posted by skeet | 20 comment(s)
Filed under: , ,

Just how lazy are you?

I've been reviewing chapter 10 of C# in Depth, which is about extension methods. This is where I start introducing some of the methods in System.Linq.Enumerable, such as Where and Reverse. I introduce a few pieces of terminology in callouts - and while I believe I'm using this terminology correctly according to MSDN, I suspect that some of it isn't quite what many developers expect... in particular, what does it mean for something to be "lazy"?

Let's start off with a question: is Enumerable.Reverse lazy? Just so we're all on the same page, here are the interesting bits of the behaviour of Reverse:

  • The call to Reverse doesn't fetch any items - it merely checks that you've not given it a null sequence, stores a reference to that sequence, and returns a new sequence.
  • Once the first item is returned from the returned sequence, all the items in the original sequence are fetched and buffered. Obviously this is required, as the first item in the reversed sequence is the last item in the original sequence.

So, is that lazy or not?

One simple definition of lazy

This morning I tweeted on the somewhat ambiguous notion of something being "lazy" - and the responses I received were along the lines of "it's deferred execution". You could potentially sum up this notion of laziness as:

An operation is lazy if it defers work until it is required in order to return a result.

By that definition, Reverse is lazy. Assuming we don't want to perform special optimisations for IList<T> (which change the exact behaviour), Reverse does as little work as it can - it just so happens that when the first item is requested, it has to drain the source sequence.

The MSDN definition of lazy

MSDN describes deferred execution, lazy evaluation and eager evaluation somewhat differently. Admittedly the page I'm taking these definitions from is in the context of LINQ to XML, but that effectively means it's describing LINQ to Objects. It defines them like this:

Deferred execution means that the evaluation of an expression is delayed until its realized value is actually required.

In lazy evaluation, a single element of the source collection is processed during each call to the iterator. This is the typical way in which iterators are implemented.

In eager evaluation, the first call to the iterator will result in the entire collection being processed. A temporary copy of the source collection might also be required. For example, the OrderBy method has to sort the entire collection before it returns the first element.

Now, I take slight issue with the definition of lazy evaluation here as it specifies that a single element of the source collection is processed on each call. That's fine for Cast, OfType, Select and a few other operators - but it would preclude Where, which might have to pull several source elements before it finds one which matches the specified filter. I still think of Where as being lazy.

My definition of lazy

Thinking about this more, I believe the following definition of laziness is helpful:

(This space left intentionally blank.)

I don't believe lazy is a useful term, basically. It's like "strong typing" - you get some sort of woolly feeling about how something will behave if it's lazy, but chances are you'll need something more precise anyway.

For the purposes of LINQ to Objects-style operators which return IEnumerable<T> or any interface derived from it (including IOrderedEnumerable<T>), I propose the following definitions:

An operation uses deferred execution if no elements are fetched from the source sequence until the first element is fetched from the result sequence. This applies to basically all LINQ to Objects operators which return a sequence interface. (It doesn't apply to ToArray or ToList of course.)

An operation is streaming if it only fetches elements from the source sequence as it requires them, and does not store those elements internally unless otherwise specified.

An operation is buffering if it drains the source sequence (i.e. fetches all the elements) when the first element of the result sequence is fetched, and stores the items internally.

I'm not even entirely comfortable with this: you could claim that Reverse is "streaming but with internal storage" - but that's against the spirit of the definitions. Why did I mention the internal storage at all? Well, consider Distinct... that streams data in that it will the result sequence will return the first element immediately after reading the first element from the source sequence, and so on - but it has to remember all the elements it's already returned, for obvious reasons.

Some operations are buffering in one input and streaming in another - for example, Join will read all of the "inner" sequence as soon as it's asked for its first element, but streams the "outer" sequence.

Why does this matter?

Is this just another example of my pedantry and desire for precision? Not really. Consider my old favourite example of LINQ: processing huge log files. Suppose each log entry contains a user ID, and we've got a handy log reader which will iterate over all the log files, yielding one log entry at a time.

  • Using entries.Reverse() would be disastrous if we ever actually used the result. We really, really don't want to load everything into memory.
  • Using entries.Select(x => x.UserId) would be fine, so long as we used the result without buffering it ourselves.
  • Using entries.Select(x => x.UserId).Distinct() might be fine - it would depend on how many users we have. If you're processing some company-internal application logs, that's probably okay even if you've generated a huge number of log entries. If you're processing FaceBook logs, you could have more of a problem.

Basically, you need to know what an operation will do with its input before you can decide whether or not it's usable for a particular situation.

The best answer for this (IMO) is to document any such methods meticulously. Yes, you then lose the flexibility of changing the behaviour at a later date - but at least you've then provided something that can be used with confidence.

Note that Reactive Extensions has a similar problem, but in a slightly different form - in that case, the distinction between "hot" and "cold" observables can make a big difference, along with scheduling etc. Again, documentation is the key in my view.

Posted by skeet | 16 comment(s)
Filed under: , ,

Speaking of which…

I'm delighted to announce that I'm going to be speaking at the Norwegian Developer Conference 2010 in Oslo in June. Rune Grothaug announced this with the very modest claim that my talk (combined with a Q&A with Mads Torgersen afterwards) could "alter the future of C# altogether". Well, I don't know about that - but I'm very much looking forward to it nonetheless.

As I'm doing quite a bit of this public speaking lark at the moment, I thought it might be worth keeping an up-to-date list of my speaking engagements - and what better way than to have a Google Calendar for the job? You can browse the embedded version below, or subscribe to the ical feed from your own calendaring system.

I'll try to keep this up-to-date, but you should be aware that some events may well be tentative - it's probably best to check on the event's web site, which will usually be linked in the description for the event.  Also note that I don't always know which days I'll be at an event - in order to keep a reasonable home life, I'll often just be popping in for a day or two within a longer conference.

Posted by skeet | 3 comment(s)
Filed under: ,

Optimisations in LINQ to Objects

(Edited on February 11th, 2010 to take account of a few mistakes and changes in the .NET 4.0 release candidate.)

I've just been fiddling with the first appendix of C# in Depth, which covers the standard query operators in LINQ, and describes a few details of the LINQ to Objects implementations. As well as specifying which operators stream and which buffer their results, and immediate vs deferred execution, I've where LINQ optimises for different collection types - or where it doesn't, but could. I'm not talking about optimisations which require knowledge of the projection being applied or an overall picture of the query (e.g. seq.Reverse().Any() being optimised to seq.Any()) - this is just about optimisations which can be done on a pretty simple basis.

There are two main operations which can be easily optimised in LINQ: random access to an element by index, and the count of a collection. The tricky thing about optimisations like this is that we do have to make assumptions about implementation: I'm going to assume that any implementation of an interface with a Count property will be able to return the count very efficiently (almost certainly straight from a field) and likewise that random access via an indexer will be speedy. Given that both these operations already have their own LINQ operators (when I say "operator" in this blog post, I mean "LINQ query method" rather than an operator at the level of C# as a language) let's look at those first.

Count

Count() should be pretty straightforward. Just to be clear, I'm only talking about the overload which doesn't take a predicate: it's pretty hard to see how you can do better than iterating through and counting matches when there is a predicate involved.

There are actually two common interfaces which declare a Count property: ICollection<T> and ICollection. While many implemenations of ICollection<T> will also implement ICollection (including List<T> and T[]), it's not guaranteed: they're independent interfaces, unrelated other than by the fact that they both extend IEnumerable.

The MSDN documentation for Enumerable.Count() states:

If the type of source implements ICollection<T>, that implementation is used to obtain the count of elements. Otherwise, this method determines the count.

This is accurate for .NET 3.5, but in .NET 4.0 it does optimise for ICollection as well. (In beta 2 it only optimised for ICollection, skipping the generic interface.)

ElementAt

The equivalent of an indexer in LINQ is the ElementAt operator. Note that it can't really be an indexer as there's no such thing as an "extension indexer" which is arguably a bit of a pity, but off-topic. Anyway, the obvious interface to look for here is IList<T>... and that's exactly what ElementAt does. It ignores the possibility that you've only implemented the nongeneric IList - but I think that's fairly reasonable. After all, the extension method extends IEnumerable<T>, so your collection has to be aware of generics - why would you implement IList but not IList<T>? Also, using the implementation of IList would involve a conversion from object to T, which would at the very least be ugly.

So ElementAt doesn't actually do too badly. Now that we've got the core operations, what else could be optimised?

SequenceEqual

If you were going to write a method to compare two lists, you might end up with something like this (ignoring nullity concerns for the sake of brevity):

public static bool ListEqual<T>(List<T> first, List<T> second, IEqualityComparer<T> comparer)
{
    // Optimise for reflexive comparison
    if (first == second)
    {
        return true;
    }
    // If the counts are different we know the lists are different
    if (first.Count != second.Count)
    {
        return false;
    }
    // Compare each pair of elements in turn
    for (int i = 0; i < first.Count; i++)
    {
        if (!comparer.Equals(first[i], second[i]))
        {
            return false;
        }
    }
    return true;
}

Note the two separate optimisations. The first is always applicable, unless you really want to deal with sequences which will yield different results if you call GetEnumerator() on them twice. You could certainly argue that that would be a legitimate implementation, but I'd be interested to see a situation in which it made sense to try to compare such a sequence with itself and return false. SequenceEqual perform this optimisation.

The second optimisation - checking for different counts - is only really applicable in the case where we know that Count is optimised for both lists. In particular, I always make a point of only iterating through each source sequence once when I write a custom LINQ operator - you never know when you'll be given a sequence which reads a huge log file from disk, yielding one line at a time. (Yes, that is my pet example, but it's a good one and I'm sticking to it.) But we can certainly tell if both sequences implement ICollection or ICollection<T>, so it would make sense to have an "early negative" in that situation.

Last(predicate)

(All of this applies to LastOrDefault as well, by the way.) The implementation of Last which doesn't take a predicate is already optimised for the IList<T> case: in that situation the method finds out the count, and returns list[count - 1] as you might expect. We certainly can't do that when we've been given a predicate, as the last value might not match that predicate. However, we could walk backwards from the end of the list... if you have a list which contains a million items, and the last-but-one matches the predicate, you don't really want to test the first 999998 items, do you? Again, this assumes that we can keep using random access on the list, but I think that's reasonable for IList<T>.

Reverse

Reverse is an interesting case, because it uses deferred execution and streams data. In reality, it always takes a complete copy of the sequence (which in itself does optimise for the case where it implements ICollection<T>; in that situation you know the count to start with and can use source.CopyTo(destinationArray) to speed things up). You might consider an optimisation which uses random access if the source is an implementation of IList<T> - you could just lazily yield the elements in reverse order using random access. However, that would change behaviour. Admittedly the behaviour of Reverse may not be what people expect in the first place. What would you predict that this code does?

string[] array = { "a", "b", "c", "d" };

var query = array.Reverse();
array[0] = "a1";
        
var iterator = query.GetEnumerator();
array[0] = "a2";
        
// We'll assume we know when this will stop :)
iterator.MoveNext();
Console.WriteLine(iterator.Current);
array[0] = "a3";
        
iterator.MoveNext();
Console.WriteLine(iterator.Current);
array[0] = "a4";

iterator.MoveNext();
Console.WriteLine(iterator.Current);
array[0] = "a5";
        
iterator.MoveNext();
Console.WriteLine(iterator.Current);

After careful thought, I accurately predicted the result (d, c, b, a2) - but you do need to take deferred execution *and* eager buffering into account. If nothing else, this should be a lesson in not changing the contents of query sources while you're iterating over the query unless you're really sure of what you're doing.

With the candidate "optimisation" in place, we'd see (d, c, b, a5), but only when working on array directly. Working on array.Select(x => x) would have to give the original results, as it would have to iterate through all the initial values before finding the last one.

LongCount

This is an interesting one... LongCount really doesn't make much sense unless you expect your sequence to have more than 2^31 elements, but there's no optimisation present. The contract for IList<T> doesn't state what Count should do if the list has more than Int32.MaxValue elements, so that can't really be used - but potentially Array.LongValue could be used for large arrays.

A bigger question is when this would actually be useful. I haven't tried timing Enumerable.Range(0, int.MaxValue) to see how long it would take to become relevant, but I suspect it would be a while. I can see how LongCount could be useful in LINQ to SQL - but does it even make sense in LINQ to Objects? Maybe it will be optimised in a future version with ILongList<T> for large lists...

EDIT: In fact, given comments, it sounds like the time taken to iterate over int.MaxValue items isn't that high after all. That'll teach me to make assumptions about running times without benchmarking... I still can't say I've seen LongCount used in anger in LINQ to Objects, but it's not quite as silly as I thought :)

Conclusion

The optimisations I've described here all have the potential to take a long-running operation down to almost instantaneous execution, in the "right" situation. There may well be other opportunities lurking - things I haven't thought of. The good news is that missing optimisations could be applied in future releases without breaking any sane code. I do wonder whether supporting methods (e.g. TryFastCount and TryFastElementAt) would be useful to encourage other people writing their own LINQ operators to optimise appropriately - but that's possibly a little too much of a niche case.

Blog post frequency may well change in either direction for the near future - I'm going to be very busy with last-minute changes, fixes, indexing etc for the book, which will give me even less time for blogging. On the other hand, it can be mind-numbingly tedious, so I may resort to blogging as a form of relief...

Posted by skeet | 26 comment(s)
Filed under: , ,

The irritation of bad names

A couple of days ago I accidentally derailed the comments on Eric Lippert's blog post about unused "using" directives. The reason that redundant code doesn't generate a warning in Visual Studio is that it's what you get to start with in Visual Studio. This led me to rant somewhat about other aspects of Visual Studio's behaviour which sacrifice long term goodness in favour of short term efficiency. Almost all the subsequent comments (at the time of writing this post) are concerned with my rant rather than Eric's post. Some agree with me, some don't - but it's only now that I've spotted the bigger picture behind my annoyances.

All of them are to do with names and the defaults provided. I've blogged before about how hard it is to find a good name - it's a problem I run into time and time again, and the ability to rename something is one of the most important refactorings around.

If you don't know, ask

Now if it's hard to find a good name, it stands to reason that anything the IDE can generate automatically is likely to be a poor name... such as "Form1", "textBox1" or "button1_Click". And yet, in various situations, Visual Studio will happily generate such names, and it can sometimes be a small but significant pain to correct it.

The situation which causes me personally a lot of pain is copying a file. For C# in Depth, I have a lot of very small classes, each with a Main method. When I'm evolving an example, I often want to take the existing code and just change it slightly, but in a new file. So I might have a file called OrderByName.cs containing a class called OrderByName. (I agree this would normally be a bad class name, but in the context of a short but complete example it's okay.) I want to just select the file, hit copy and paste, and be asked for a new filename. The class within the file would then be renamed for me as well. As an aside, this is the behaviour Eclipse has in its Java tooling.

In reality, I'd end up with a new file called "Copy of OrderByName.cs", still containing a class called OrderByName. Renaming the file wouldn't offer to rename the class, as the filename wouldn't match the class name. Renaming the class by just changing it and then hitting Ctrl-. would also rename the original class, which is intensely annoying. You're basically stuck doing it manually with find and replace, as far as I can see. There may be some automated aid available, but at the very least it's non-obvious.

Now the question is: why would I ever want a file called "Copy of OrderByName.cs"? That's always going to be the wrong name, so why doesn't Visual Studio ask me for the right name? It could provide a default so I can skip through if I really want to (and probably an "Apply to all items" if I'm copying multiple files) but at least give me the chance to specify the right filename at the crucial point. Once it knows the right new filename before it's got a broken build, I would hope it would be easy to then apply the new name to the class too.

The underlying point is that if you absolutely have to have a name for something, and there's no way a sensible suggestion can be provided, the user should be consulted. I know there's a lot of discussion these days about not asking the user pointless questions, but this isn't a pointless question... at least when it comes to filenames.

If you don't need a name, don't use one

I'm not a UI person, so some of this section may be either outdated or at least only partially applicable. In particular, I believe WPF does a better job than the Windows Forms designer.

Names have two main purposes, in my view. They can provide semantic meaning to aid the reader, even if a name isn't strictly required (think of the "introduce local variable" refactoring) and they can be used for identification.

Now suppose I'm creating a label on a form. If I'm using the designer, I can probably see the text on the label - its meaning is obvious. I quite possibly don't have to refer to the label anywhere in code, unless I'm changing the value programmatically... so why does it need a name? If you really think it needs a name, is "label1" ever going to be the right name - the one you'd have come up with as the most meaningful one you could think of?

In the comments in Eric's blog, someone pointed out that being prompted for a name every time you dragged a control onto the designer would interrupt workflow... and I quite agree. Many of those controls won't need names. However, as soon as they do need a name, prompting for the name at that point (or just typing it into the property view) isn't nearly such a distraction... indeed, I'd suggest it's actually guiding the developer in question to crystallize their thoughts about the purpose of that control.

Conclusion

Okay, this has mostly been more ranting - but at least it's now on my blog, and I've been able to give a little bit more detail about the general problem I see in Visual Studio - a problem which leads to code containing utterly useless names.

The fundamental principle is that I want every name in my code to be a meaningful one. The IDE should use two approaches to help me with that goal:

  • Don't give a name to anything that doesn't deserve or need one
  • If a name is really necessary, and you can't guess it from the rest of the context, ask the user

I don't expect anything to change, but it's good to have it off my chest.

Posted by skeet | 28 comment(s)
Filed under: ,
More Posts Next page »