December 2010 - Posts

Reimplementing LINQ to Objects: Part 19 - Join

You might expect that as joining is such a fundamental concept in SQL, it would be a complex operation to both describe and implement in LINQ. Fortunately, nothing could be further from the truth...

What is it?

Join has a mere two overloads, with the familiar pattern of one taking a custom equality comparer and the other not:

public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector)

public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)

These are daunting signatures to start with - four type parameters, and up to six normal method parameters (including the first "this" parameter indicating an extension method). Don't panic. It's actually quite simple to understand each individual bit:

  • We have two sequences (outer and inner). The two sequences can have different element types (TOuter and TInner).
  • For each sequence, there's also a key selector - a projection from a single item to the key for that item. Note that although the key type can be different from the two sequence types, there's only one key type - both key selectors have to return the same kind of key (TKey).
  • An optional equality comparer is used to compare keys.
  • A delegate (resultSelector) is used to project a pair of items whose keys are equal (one from each sequence) to the result type (TResult)

The idea is that we look through the two input sequences for pairs which correspond to equal keys, and yield one output element for each pair. This is an equijoin operation: we can only deal with equal keys, not pairs of keys which meet some arbitrary condition. It's also an inner join in database terms - we will only see an item from one sequence if there's a "matching" item from the other sequence. I'll talk about mimicking left joins when I implement GroupJoin.

The documentation gives us details of the order in which we need to return items:

Join preserves the order of the elements of outer, and for each of these elements, the order of the matching elements of inner.

For the sake of clarity, it's probably worth including a naive implementation which at least gives the right results in the right order:

// Bad implementation, only provided for the purposes of discussion
public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(
    this IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    // Argument validation omitted
    foreach (TOuter outerElement in outer)
    {
        TKey outerKey = outerKeySelector(outerElement);
        foreach (TInner innerElement in inner)
        {
            TKey innerKey = innerKeySelector(innerElement);
            if (comparer.Equals(outerKey, innerKey))
            {
                yield return resultSelector(outerElement, innerElement);
            }
        }
    }
}

Aside from the missing argument validation, there are two important problems with this:

  • It iterates over the inner sequence multiple times. I always advise anyone implementing a LINQ-like operator to only iterate over any input sequence once. There are sequences which are impossible to iterate over multiple times, or which may give different results each time. That's bad news.
  • It always has a complexity of O(N * M) for N items in the inner sequence and M items in the outer sequence. Eek. Admittedly that's always a possible complexity - two sequences which have the same key for all elements will always have that complexity - but in a typical situation we can do considerably better.

The real Join operator uses the same behaviour as Except and Intersect when it comes to how the input sequences are consumed:

  • Argument validation occurs eagerly - both sequences and all the "selector" delegates have to be non-null; the comparer argument can be null, leading to the default equality comparer for TKey being used.
  • The overall operation uses deferred execution: it doesn't iterate over either input sequence until something starts iterating over the result sequence.
  • When MoveNext is called on the result sequence for the first time, it immediately consumes the whole of the inner sequence, buffering it.
  • The outer sequence is streamed - it's only read one element at a time. By the time the result sequence has started yielding results from the second element of outer, it's forgotten about the first element.

We've started veering towards an implementation already, so let's think about tests.

What are we going to test?

I haven't bothered with argument validation tests this time - even with cut and paste, the 10 tests required to completely check everything feels like overkill.

However, I have tested:

  • Joining two invalid sequences, but not using the results (i.e. testing deferred execution)
  • The way that the two sequences are consumed
  • Using a custom comparer
  • Not specifying a comparer
  • Using sequences of different types

See the source code for more details, but here's a flavour - the final test:

[Test]
public void DifferentSourceTypes()
{
    int[] outer = { 5, 3, 7 };
    string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

    var query = outer.Join(inner,
                           outerElement => outerElement,
                           innerElement => innerElement.Length,
                           (outerElement, innerElement) => outerElement + ":" + innerElement);
    query.AssertSequenceEqual("5:tiger", "3:bee", "3:cat", "3:dog", "7:giraffe");
}

To be honest, the tests aren't very exciting. The implementation is remarkably simple though.

Let's implement it!

Trivial decision 1: make the comparer-less overload call the one with the comparer.

Trivial decision 2: use the split-method technique to validate arguments eagerly but defer the rest of the operation

That leaves us with the actual implementation in an iterator block, which is all I'm going to provide the code for here. So, what are we going to do?

Well, we know that we're going to have to read in the whole of the "inner" sequence - but let's wait a minute before deciding how to store it. We're then going to iterate over the "outer" sequence. For each item in the outer sequence, we need to find the key and then work out all the "inner" sequence items which match that key. Now, the idea of finding all the items in a sequence which match a particular key should sound familiar to you - that's exactly what a lookup is for. If we build a lookup from the inner sequence as our very first step, the rest becomes easy: we can fetch the sequence of matches, then iterate over them and yield the return value of calling result selector on the pair of elements.

All of this is easier to see in code than in words, so here's the method:

private static IEnumerable<TResult> JoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    foreach (var outerElement in outer)
    {
        var key = outerKeySelector(outerElement);
        foreach (var innerElement in lookup[key])
        {
            yield return resultSelector(outerElement, innerElement);
        }
    }
}

Personally, I think this is rather beautiful... in particular, I like the way that it uses every parameter exactly once. Everything is just set up to work nicely.

But wait, there's more... If you look at the nested foreach loops, that should remind you of something: for each outer sequence element, we're computing a nested sequence, then applying a delegate to each pair, and yielding the result. That's almost exactly the definition of SelectMany! If only we had a "yield foreach" or "yield!" I'd be tempted to use an implementation like this:

private static IEnumerable<TResult> JoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    // Warning: not really valid C#
    yield foreach outer.SelectMany(outerElement => lookup[outerKeySelector(outerElement)],
                                   resultSelector);
}

Unfortunately there's no such thing as a "yield foreach" statement. We can't just call SelectMany and return the result directly, because then we wouldn't be deferring execution. The best we can sensibly do is loop:

private static IEnumerable<TResult> JoinImpl<TOuter, TInner, TKey, TResult>(
    IEnumerable<TOuter> outer,
    IEnumerable<TInner> inner,
    Func<TOuter, TKey> outerKeySelector,
    Func<TInner, TKey> innerKeySelector,
    Func<TOuter, TInner, TResult> resultSelector,
    IEqualityComparer<TKey> comparer)
{
    var lookup = inner.ToLookup(innerKeySelector, comparer);
    var results = outer.SelectMany(outerElement => lookup[outerKeySelector(outerElement)],
                                   resultSelector);
    foreach (var result in results)
    {
        yield return result;
    }
}

At that stage, I think I prefer the original form - which is still pretty simple, thanks to the lookup.

While I have avoided using other operators for implementations in the past, in this case it feels so completely natural, it would be silly not to use ToLookup to implement Join.

One final point before I wrap up for the night...

Query expressions

Most of the operators I've been implementing recently don't have any direct mapping in C# query expressions. Join does, however. The final test I showed before can also be written like this:

int[] outer = { 5, 3, 7 };
string[] inner = { "bee", "giraffe", "tiger", "badger", "ox", "cat", "dog" };

var query = from x in outer
            join y in inner on x equals y.Length
            select x + ":" + y;

query.AssertSequenceEqual("5:tiger", "3:bee", "3:cat", "3:dog", "7:giraffe");

The compiler will effectively generate the same code (although admittedly I've used shorter range variable names here - x and y instead of outerElement and innerElement respectively). In this case the resultSelector delegate it supplies is simply the final projection from the "select" clause - but if we had anything else (such as a where clause) between the join and the select, the compiler would introduce a transparent identifier to propagate the values of both x and y through the query. It looks like I haven't blogged explicitly about transparent identifiers, although I cover them in C# in Depth. Maybe once I've finished actually implementing the operators, I'll have a few more general posts on this sort of thing.

Anyway, the point is that for simple join clauses (as opposed to join...into) we've implemented everything we need to. Hoorah.

Conclusion

I spoiled some of the surprise around how easy Join would be to implement by mentioning it in the ToLookup post, but I'm still impressed by how neat it is. Again I should emphasize that this is due to the design of LINQ - it's got nothing to do with my own prowess.

This won't be the end of our use of lookups though... the other grouping constructs can all use them too. I'll try to get them all out of the way before moving on to operators which feel a bit different...

Addendum

It turns out this wasn't quite as simple as I'd expected. Although ToLookup and GroupBy handle null keys without blinking, Join and GroupJoin ignore them. I had to write an alternative version of ToLookup which ignores null keys while populating the lookup, and then replace the calls of "ToLookup" in the code above with calls to "ToLookupNoNullKeys". This isn't documented anywhere, and is inconsistent with ToLookup/GroupBy. I've opened up a Connect issue about it, in the hope that it at least gets documented properly. (It's too late to change the behaviour now, I suspect.)

Posted by skeet | 6 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 18 - ToLookup

I've had a request to implement Join, which seems a perfectly reasonable operator to aim towards. However, it's going to be an awful lot easier to implement if we write ToLookup first. That will also help with GroupBy and GroupJoin, later on.

In the course of this post we'll create a certain amount of infrastructure - some of which we may want to tweak later on.

What is it?

ToLookup has four overloads, with these signatures:

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)

Essentially these boil down to two required parameters and two optional ones:

  • The source is required, and must not be null
  • The keySelector is required, and must not be null
  • The elementSelector is optional, and defaults to an identity projection of a source element to itself. If it is specified, it must not be null.
  • The comparer is optional, and defaults to the default equality comparer for the key type. It may be null, which is equivalent to specifying the default equality comparer for the key type.

Now we can just consider the most general case - the final overload. However, in order to understand what ToLookup does, we have to know what ILookup<TKey, TElement> means - which in turn means knowing about IGrouping<TKey, TElement>:

public interface IGrouping<out TKey, out TElement> : IEnumerable<TElement>
{
    TKey Key { get; }
}

public interface ILookup<TKey, TElement> : IEnumerable<IGrouping<TKey, TElement>>
{
    int Count { get; }
    IEnumerable<TElement> this[TKey key] { get; }
    bool Contains(TKey key);
}

The generics make these interfaces seem somewhat scary at first sight, but they're really not so bad. A grouping is simply a sequence with an associated key. This is a pretty simple concept to transfer to the real world - something like "Plays by Oscar Wilde" could be an IGrouping<string, Play> with a key of "Oscar Wilde". The key doesn't have to be "embedded" within the element type though - it would also be reasonable to have an IGrouping<string, string> representing just the names of plays by Oscar Wilde.

A lookup is essentially a map or dictionary where each key is associated with a sequence of values instead of a single one. Note that the interface is read-only, unlike IDictionary<TKey, TValue>. As well as looking up a single sequence of values associated with a key, you can also iterate over the whole lookup in terms of groupings (instead of the key/value pair from a dictionary). There's one other important difference between a lookup and a dictionary: if you ask a lookup for the sequence corresponding to a key which it doesn't know about, it will return an empty sequence, rather than throwing an exception. (A key which the lookup does know about will never yield an empty sequence.)

One slightly odd point to note is that while IGrouping is covariant in TKey and TElement, ILookup is invariant in both of its type parameters. While TKey has to be invariant, it would be reasonable for TElement to be covariant - to go back to our "plays" example, an IGrouping<string, Play> could be sensibly regarded as an IGrouping<string, IWrittenWork> (with the obvious type hierarchy). However, the interface declarations above are the ones in .NET 4, so that's what I've used in Edulinq.

Now that we understand the signature of ToLookup, let's talk about what it actually does. Firstly, it uses immediate execution - the lookup returned by the method is effectively divorced from the original sequence; changes to the sequence after the method has returned won't change the lookup. (Obviously changes to the objects within the original sequence may still be seen, depending on what the element selector does, etc.) The rest is actually fairly straightforward, when you consider what parameters we have to play with:

  • The keySelector is applied to each item in the input sequence. Keys are always compared for equality using the "comparer" parameter.
  • The elementSelector is applied to each item in the input sequence, to project the item to the value which will be returned within the lookup.

To demonstrate this a little further, here are two applications of ToLookup - one using the first overload, and one using the last. In each case we're going to group plays by author and then display some information about them. Hopefully this will make some of the concepts I've described a little more concrete:

ILookup<string, Play> lookup = allPlays.ToLookup(play => play.Author);
foreach (IGrouping<string, Play> group in lookup)
{
    Console.WriteLine("Plays by {0}", group.Key);
    foreach (Play play in group)
    {
        Console.WriteLine("  {0} ({1})", play.Name, play.CopyrightDate);
    }
}

// Or...

ILookup<string, string> lookup = allPlays.ToLookup(play => play.Author,
                                                   play => play.Name,
                                                   StringComparer.OrdinalIgnoreCase);
foreach (IGrouping<string, string> group in lookup)
{
    Console.WriteLine("Plays by {0}:", group.Key);
    foreach (string playName in group)
    {
        Console.WriteLine("  {0}", playName);
    }
}
        
// And demonstrating looking up by a key:
IEnumerable<string> playsByWilde = lookup["Oscar Wilde"];
Console.WriteLine("Plays by Oscar Wilde: {0}",
                  string.Join("; ", playsByWilde));

Note that although I've used explicit typing for all the variables here, I would usually use implicit typing with var, just to keep the code more concise.

Now we just have the small matter of working out how to go from a key/element pair for each item, to a data structure which lets us look up sequences by key.

Before we move onto the implementation and tests, there are two aspects of ordering to consider:

  • How are the groups within the lookup ordered?
  • How are the elements within each group ordered?

The documentation for ToLookup is silent on these matters - but the docs for the closely-related GroupBy operator are more specific:

The IGrouping<TKey, TElement> objects are yielded in an order based on the order of the elements in source that produced the first key of each IGrouping<TKey, TElement>. Elements in a grouping are yielded in the order they appear in source.

Admittedly "in an order based on..." isn't as clear as it might be, but I think it's reasonable to make sure that our implementation yields groups such that the first group returned has the same key as the first element in the source, and so on.

What are we going to test?

I've actually got relatively few tests this time. I test each of the overloads, but not terribly exhaustively. There are three tests worth looking at though. The first two show the "eager" nature of the operator, and the final one demonstrates the most complex overload by grouping people into families.

[Test]
public void SourceSequenceIsReadEagerly()
{
    var source = new ThrowingEnumerable();
    Assert.Throws<InvalidOperationException>(() => source.ToLookup(x => x));
}

[Test]
public void ChangesToSourceSequenceAfterToLookupAreNotNoticed()
{
    List<string> source = new List<string> { "abc" };
    var lookup = source.ToLookup(x => x.Length);
    Assert.AreEqual(1, lookup.Count);

    // Potential new key is ignored
    source.Add("x");
    Assert.AreEqual(1, lookup.Count);

    // Potential new value for existing key is ignored
    source.Add("xyz");
    lookup[3].AssertSequenceEqual("abc");
}


[Test]
public void LookupWithComparareAndElementSelector()
{
    var people = new[] {
        new { First = "Jon", Last = "Skeet" },
        new { First = "Tom", Last = "SKEET" }, // Note upper-cased name
        new { First = "Juni", Last = "Cortez" },
        new { First = "Holly", Last = "Skeet" },
        new { First = "Abbey", Last = "Bartlet" },
        new { First = "Carmen", Last = "Cortez" },                
        new { First = "Jed", Last = "Bartlet" }
    };

    var lookup = people.ToLookup(p => p.Last, p => p.First, StringComparer.OrdinalIgnoreCase);

    lookup["Skeet"].AssertSequenceEqual("Jon", "Tom", "Holly");
    lookup["Cortez"].AssertSequenceEqual("Juni", "Carmen");
    // The key comparer is used for lookups too
    lookup["BARTLET"].AssertSequenceEqual("Abbey", "Jed");

    lookup.Select(x => x.Key).AssertSequenceEqual("Skeet", "Cortez", "Bartlet");
}

I'm sure there are plenty of other tests I could have come up with. lf you want to see any ones in particular (either as an edit to this post if they already exist in source control, or as entirely new tests), please leave a comment.

EDIT: It turned out there were a few important tests missing... see the addendum.

Let's implement it!

There's quite a lot of code involved in implementing this - but it should be reusable later on, potentially with a few tweaks. Let's get the first three overloads out of the way to start with:

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)
{
    return source.ToLookup(keySelector, element => element, EqualityComparer<TKey>.Default);
}

public static ILookup<TKey, TSource> ToLookup<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IEqualityComparer<TKey> comparer)
{
    return source.ToLookup(keySelector, element => element, comparer);
}

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector)
{
    return source.ToLookup(keySelector, elementSelector, EqualityComparer<TKey>.Default);
}

Now we can just worry about the final one. Before we implement that though, we're definitely going to need an implementation of IGrouping. Let's come up with a really simple one to start with:

internal sealed class Grouping<TKey, TElement> : IGrouping<TKey, TElement>
{
    private readonly TKey key;
    private readonly IEnumerable<TElement> elements;

    internal Grouping(TKey key, IEnumerable<TElement> elements)
    {
        this.key = key;
        this.elements = elements;
    }

    public TKey Key { get { return key; } }

    public IEnumerator<TElement> GetEnumerator()
    {
        return elements.GetEnumerator();
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

How do we feel about this? Well, groupings should be immutable. We have no guarantee that the "values" sequence won't be changed after creation - but it's an internal class, so we've just got to be careful about how we use it. We also have to be careful that we don't use a mutable sequence type which allows a caller to get from the iterator (the IEnumerator<TElement> returned by GetEnumerator) back to the sequence and then mutate it. In our case we're actually going to use List<T> to provide the sequences, and while List<T>.Enumerator is a public type, it doesn't expose the underlying list. Of course a caller could use reflection to mess with things, but we're not going to try to protect against that.

Okay, so now we can pair a sequence with a key... but we still need to implement ILookup. This is where there are multiple options. We want our lookup to be immutable, but there are various degrees of immutability we could use:

  • Mutable internally, immutable in public API
  • Mutable privately, immutable to the internal API
  • Totally immutable, even within the class itself

The first option is the simplest to implement, and it's what I've gone for at the moment. I've created a Lookup class which is allows a key/element pair to be added to it from within the Edulinq assembly. It uses a Dictionary<TKey, List<TElement>> to map the keys to sequences efficiently, and a List<TKey> to remember the order in which we first saw the keys. Here's the complete implementation:

internal sealed class Lookup<TKey, TElement> : ILookup<TKey, TElement>
{
    private readonly Dictionary<TKey, List<TElement>> map;
    private readonly List<TKey> keys;

    internal Lookup(IEqualityComparer<TKey> comparer)
    {
        map = new Dictionary<TKey, List<TElement>>(comparer);
        keys = new List<TKey>();
    }

    internal void Add(TKey key, TElement element)
    {
        List<TElement> list;
        if (!map.TryGetValue(key, out list))
        {
            list = new List<TElement>();
            map[key] = list;
            keys.Add(key);
        }
        list.Add(element);
    }

    public int Count
    {
        get { return map.Count; }
    }

    public IEnumerable<TElement> this[TKey key]
    {
        get
        {
            List<TElement> list;
            if (!map.TryGetValue(key, out list))
            {
                return Enumerable.Empty<TElement>();
            }
            return list.Select(x => x);
        }
    }

    public bool Contains(TKey key)
    {
        return map.ContainsKey(key);
    }

    public IEnumerator<IGrouping<TKey, TElement>> GetEnumerator()
    {
        return keys.Select(key => new Grouping<TKey, TElement>(key, map[key]))
                   .GetEnumerator();
    }
        
    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

All of this is really quite straightforward. Note that we provide an equality comparer to the constructor, which is then passed onto the dictionary - that's the only thing that needs to know how to compare keys.

There are only two points of interest, really:

  • In the indexer, we don't return the list itself - that would allow the caller to mutate it, after casting it to List<TElement>. Instead, we just call Select with an identity projection, as a simple way of inserting a sort of "buffering" layer between the list and the caller. There are other ways of doing this, of course - including implementing the indexer with an iterator block.
  • In GetEnumerator, we're retaining the key order by using our list of keys and performing a lookup on each key.

We're currently creating the new Grouping objects lazily - which will lead to fewer of them being created if the caller doesn't actually iterate over the lookup, but more of them if the caller iterates over it several times. Again, there are alternatives here - but without any good information about where to make the trade-off, I've just gone for the simplest code which works for the moment.

One last thing to note about Lookup - I've left it internal. In .NET, it's actually public - but the only way of getting at an instance of it is to call ToLookup and then cast the result. I see no particular reason to make it public, so I haven't.

Now we're finally ready to implement the last ToLookup overload - and it becomes pretty much trivial:

public static ILookup<TKey, TElement> ToLookup<TSource, TKey, TElement>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    Func<TSource, TElement> elementSelector,
    IEqualityComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    if (elementSelector == null)
    {
        throw new ArgumentNullException("elementSelector");
    }

    Lookup<TKey, TElement> lookup = new Lookup<TKey, TElement>(comparer ?? EqualityComparer<TKey>.Default);

    foreach (TSource item in source)
    {
        TKey key = keySelector(item);
        TElement element = elementSelector(item);
        lookup.Add(key, element);
    }
    return lookup;
}

When you look past the argument validation, we're just creating a Lookup, populating it, and then returning it. Simple.

Thread safety

Something I haven't addressed anywhere so far is thead safety. In particular, although all of this is nicely immutable when viewed from a single thread, I have a nasty feeling that theoretically, if the return value of our implementation of ToLookup was exposed to another thread, it could potentially observe the internal mutations we're making here, as we're not doing anything special in terms of the memory model here.

I'm basically scared by lock-free programming these days, unless I'm using building blocks provided by someone else. While investigating the exact guarantees offered here would be interesting, I don't think it would really help with our understanding of LINQ as a whole. I'm therefore declaring thread safety to be out of scope for Edulinq in general :)

Conclusion

So that's ToLookup. Two new interfaces, two new classes, all for one new operator... so far. We can reuse almost all of this in Join though, which will make it very simple to implement. Stay tuned...

Addendum

It turns out that I missed something quite important: ToLookup has to handle null keys, as do various other LINQ operators (GroupBy etc). We're currently using a Dictionary<TKey, TValue> to organize the groups... and that doesn't support null keys. Oops.

So, first steps: write some tests proving that it fails as we expect it to. Fetch by a null key of a lookup. Include a null key in the source of a lookup. Use GroupBy, GroupJoin and Join with a null key. Watch it go bang. That's the easy part...

Now, we can do all the special-casing in Lookup itself - but it gets ugly. Our Lookup code was pretty simple before; it seems a shame to spoil it with checks everywhere. What we really need is a dictionary which does support null keys. Step forward NullKeyFriendlyDictionary, a new internal class in Edulinq. Now you might expect this to implement IDictionary<TKey, TValue>, but it turns out that's a pain in the neck. We hardly use any of the members of Dictionary - TryGetValue, the indexer, ContainsKey, and the Count property. That's it! So those are the only members I've implemented.

The class contains a Dictionary<TKey, TValue> to delegate most requests to, and it just handles null keys itself. Here's a quick sample:

internal sealed class NullKeyFriendlyDictionary<TKey, TValue>
{
    private readonly Dictionary<TKey, TValue> map;
    private bool haveNullKey = false;
    private TValue valueForNullKey;

    internal NullKeyFriendlyDictionary(IEqualityComparer<TKey> comparer)
    {
        map = new Dictionary<TKey, TValue>(comparer);
    }

    internal bool TryGetValue(TKey key, out TValue value)
    {
        if (key == null)
        {
            // This will be default(TValue) if haveNullKey is false,
            // which is what we want.
            value = valueForNullKey;
            return haveNullKey;
        }
        return map.TryGetValue(key, out value);
    }

    // etc
}

There's one potential flaw here, that I can think of: if you provide an IEqualityComparer<TKey> which treats some non-null key as equal to a null key, we won't spot that. If your source then contains both those keys, we'll end up splitting them into two groups instead of keeping them together. I'm not too worried about that - and I suspect there are all kinds of ways that could cause problems elsewhere anyway.

With this in place, and Lookup adjusted to use NullKeyFriendlyDictionary instead of just Dictionary, all the tests pass. Hooray!

At the same time as implementing this, I've tidied up Grouping itself - it now implements IList<T> itself, in a way which is immutable to the outside world. The Lookup now contains groups directly, and can return them very easily. The code is generally tidier, and anything using a group can take advantage of the optimizations applied to IList<T> - particularly the Count() operator, which is often applied to groups.

Posted by skeet | 8 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 17 - Except

I'm really pleased with this one. Six minutes ago (at the time of writing this), I tweeted about the Intersect blog post. I then started writing the tests and implementation for Except... and I'm now done.

The tests are cut/paste/replace with a few tweaks - but it's the implementation that I'm most pleased with. You'll see what I mean later - it's beautiful.

What is it?

Except is our final set operator, with the customary two overloads:

public static IEnumerable<TSource> Except<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

public static IEnumerable<TSource> Except<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

The result of calling Except is the elements of the first sequence which are not in the second sequence.

Just for completeness, here's a quick summary of the behaviour:

  • The first overload uses the default equality comparer for TSource, and the second overload does if you pass in "null" as the comparer, too.
  • Neither "first" nor "second" can be null
  • The method does use deferred execution
  • The result sequence only contains distinct elements; even if the first sequence contains duplicates, the result sequence won't

This time, the documentation doesn't say how "first" and "second" will be used, other than to describe the result as a set difference. In practice though, it's exactly the same as Intersect: when we ask for the first result, the "second" input sequence is fully evaluated, then the "first" input sequence is streamed.

What are we going to test?

I literally cut and paste the tests for Intersect, did a find/replace on Intersect/Except, and then looked at the data in each test. In particular, I made sure that there were potential duplicates in the "first" input sequence which would have to be removed in the result sequence. I also tweaked the details of the data in the final tests shown before (which proved the way in which the two sequences were read) but the main thrust of the tests are the same.

Nothing to see here, move on...

Let's implement it!

I'm not even going to bother showing the comparer-free overload this time. It just calls the other overload, as you'd expect. Likewise the argument validation part of the implementation is tedious. Let's focus on the part which does the work. First, we'll think back to Distinct and Intersect:

  • In Distinct, we started with an empty set and populated it as we went along, making sure we never returned anything already in the set
  • In Intersect, we started with a populated set (from the second input sequence) and removed elements from it as we went along, only returning elements where an equal element was previously in the set

Except is simply a cross between the two: from Distinct we keep the idea of a "banned" set of elements that we add to; from Intersect we take the idea of starting off with a set populated from the second input element. Here's the implementation:

private static IEnumerable<TSource> ExceptImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> bannedElements = new HashSet<TSource>(second, comparer);
    foreach (TSource item in first)
    {
        if (bannedElements.Add(item))
        {
            yield return item;
        }
    }
}

The only differences between this and Intersect are:

  • The name of the method (ExceptImpl instead of IntersectImpl)
  • The name of the local variable holding the set (bannedElements instead of potentialElements)
  • The method called in the loop (Add instead of Remove)

Isn't that just wonderful? Perhaps it shouldn't make me quite as happy as it does, but hey...

Conclusion

That concludes the set operators - and indeed my blogging for the night. It's unsurprising that all of the set operators have used a set implementation internally... but I've been quite surprised at just how simple the implementations all were. Again, the joy of LINQ resides in the ability for such simple operators to be combined in useful ways.

Posted by skeet | 2 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 16 - Intersect (and build fiddling)

Okay, this is more like it - after the dullness of Union, Intersect has a new pattern to offer... and one which we'll come across repeatedly.

First, however, I should explain some more changes I've made to the solution structure...

Building the test assembly

I've just had an irritating time sorting out something I thought I'd fixed this afternoon. Fed up of accidentally testing against the wrong implementation, my two project configurations ("Normal LINQ" and "Edulinq implementation") now target different libraries from the test project: only the "Normal LINQ" configuration refers to System.Core, and only the "Edulinq implementation" configuration refers to the Edulinq project itself. Or so I thought. Unfortunately, msbuild automatically adds System.Core in unless you're careful. I had to add this into the "Edulinq implementation" property group part of my project file to avoid accidentally pulling in System.Core:

<AddAdditionalExplicitAssemblyReferences>false</AddAdditionalExplicitAssemblyReferences>

Unfortunately, at that point all the extension methods I'd written within the tests project - and the references to HashSet<T> - failed. I should have noticed them not failing before, and been suspicious. Hey ho.

Now I'm aware that you can add your own version of ExtensionAttribute, but I believe that can become a problem if at execution time you do actually load System.Core... which I will end up doing, as the Edulinq assembly itself references it (for HashSet<T> aside from anything else).

There may well be multiple solutions to this problem, but I've come up with one which appears to work:

  • Add an Edulinq.TestSupport project, and refer to that from both configurations of Edulinq.Tests
  • Make Edulinq.TestSupport refer to System.Core so it can use whatever collections it likes, as well as extension methods
  • Put all the non-test classes (those containing extension methods, and the weird and wonderful collections) into the Edulinq.TestSupport project.
  • Add a DummyClass to Edulinq.TestSupport in the namespace System.Linq, so that using directives within Edulinq.Tests don't need to be conditionalized

All seems to be working now - and finally, if I try to refer to an extension method I haven't implemented in Edulinq yet, it will fail to compile instead of silently using the System.Core one. Phew. Now, back to Intersect...

What is it?

Intersect has a now-familiar pair of overloads:

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

Fairly obviously, it computes the intersection of two sequences: the elements that occur in both "first" and "second". Points to note:

  • Again, the first overload uses the default equality comparer for TSource, and the second overload does if you pass in "null" as the comparer, too.
  • Neither "first" nor "second" can be null
  • The method does use deferred execution - but in a way which may seem unusual at first sight
  • The result sequence only contains distinct elements - even if an element occurs multiple times in both input sequences, it will only occur once in the result

Now for the interesting bit - exactly when the two sequences are used. MSDN claims this:

When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected.

This is demonstrably incorrect. Indeed, I have a test which would fail under LINQ to Objects if this were the case. What actually happens is this:

  • Nothing happens until the first result element is requested. I know I've said so already, but it's worth repeating.
  • As soon as the first element of the result is requested, the whole of the "second" input sequence is read, as well as the first (and possibly more) elements of the "first" input sequence - enough to return the first result, basically.
  • Results are read from the "first" input sequence as and when they are required. Only elements which were originally in the "second" input sequence and haven't already been yielded are returned.

We'll see this pattern of "greedily read the second sequence, but stream the first sequence" again when we look at Join... but let's get back to the tests.

What are we going to test?

Obviously I've got simple tests including:

  • Argument validation
  • Elements which occur multiple times in each sequence
  • The overload without a comparer specified
  • Specifying a null comparer
  • Specifying a "custom" comparer (I'm using the case-insensitive string comparer again; news at 11)

However, the interesting tests are the ones which show how the sequences are actually consumed. Here they are:

[Test]
public void NoSequencesUsedBeforeIteration()
{
    var first = new ThrowingEnumerable();
    var second = new ThrowingEnumerable();
    // No exceptions!
    var query = first.Union(second);
    // Still no exceptions... we're not calling MoveNext.
    using (var iterator = query.GetEnumerator())
    {
    }
}

[Test]
public void SecondSequenceReadFullyOnFirstResultIteration()
{
    int[] first = { 1 };
    var secondQuery = new[] { 10, 2, 0 }.Select(x => 10 / x);

    var query = first.Intersect(secondQuery);
    using (var iterator = query.GetEnumerator())
    {
        Assert.Throws<DivideByZeroException>(() => iterator.MoveNext());
    }
}

[Test]
public void FirstSequenceOnlyReadAsResultsAreRead()
{
    var firstQuery = new[] { 10, 2, 0, 2 }.Select(x => 10 / x);
    int[] second = { 1 };

    var query = firstQuery.Intersect(second);
    using (var iterator = query.GetEnumerator())
    {
        // We can get the first value with no problems
        Assert.IsTrue(iterator.MoveNext());
        Assert.AreEqual(1, iterator.Current);

        // Getting at the *second* value of the result sequence requires
        // reading from the first input sequence until the "bad" division
        Assert.Throws<DivideByZeroException>(() => iterator.MoveNext());
    }
}

The first test just proves that execution really is deferred. That's straightforward.

The second test proves that the "second" input sequence is completely read as soon as we ask for our first result. If the operator had really read "first" and then started reading "second", it could have yielded the first result (1) without throwing an exception... but it didn't.

The third test proves that the "first" input sequence isn't read in its entirety before we start returning results. We manage to read the first result with no problems - it's only when we ask for the second result that we get an exception.

Let's implement it!

I have chosen to implement the same behaviour as LINQ to Objects rather than the behaviour described by MSDN, because it makes more sense to me. In general, the "first" sequence is given more importance than the "second" sequence in LINQ: it's generally the one which is streamed, with the second sequence being buffered if necessary.

Let's start off with the comparer-free overload - as before, it will just call the other one:

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    return Intersect(first, second, EqualityComparer<TSource>.Default);
}

Now for the more interesting part. Obviously we'll have an argument-validating method, but what should we do in the guts? I find the duality between this and Distinct fascinating: there, we started with an empty set of elements, and tried to add each source element to it, yielding the element if we successfully added it (meaning it wasn't there before).

This time, we've effectively got a limited set of elements which we can yield, but we only want to yield each of them once - so as we see items, we can remove them from the set, yielding only if that operation was successful. The initial set is formed from the "second" input sequence, and then we just iterate over the "first" input sequence, removing and yielding appropriately:

public static IEnumerable<TSource> Intersect<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{           
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return IntersectImpl(first, second, comparer ?? EqualityComparer<TSource>.Default);
}

private static IEnumerable<TSource> IntersectImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> potentialElements = new HashSet<TSource>(second, comparer);
    foreach (TSource item in first)
    {
        if (potentialElements.Remove(item))
        {
            yield return item;
        }
    }
}

Next time we'll see a cross between the two techniques.

Ta-da - it all works as expected. I don't know whether this is how Intersect really works in LINQ to Objects, but I expect it does something remarkably similar.

Conclusion

After suffering from some bugs earlier today where my implementation didn't follow the documentation, it's nice to find some documentation which doesn't follow the real implementation :)

Seriously though, there's an efficiency point to be noted here. If you have two sequences, one long and one short, then it's much more efficient (mostly in terms of space) to use longSequence.Intersect(shortSequence) than shortSequence.Intersect(longSequence). The latter will require the whole of the long sequence to be in memory at once.

Next up - and I might just manage it tonight - our final set operator, Except.

Posted by skeet | 8 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 15 - Union

I'm continuing the set-based operators with Union. I may even have a couple of hours tonight - possibly enough to finish off Intersect and Except as well... let's see.

What is it?

Union is another extension method with two overloads; one taking an equality comparer and one just using the default:

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

This "two overloads, one taking an equality comparer" pattern is familiar from Distinct; we'll see that and Intersect and Except do exactly the same thing.

Simply put, Union returns the union of the two sequences - all items that are in either input sequence. The result sequence has no duplicates in even if one of the input sequences contains duplicates. (I'm using the term "duplicate" here to mean an element which is equal to another according to the equality comparer we're using in the operator.)

Characteristics:

  • Union uses deferred execution: argument validation is basically all that happens when the method is first called; it only starts iterating over the input sequences when you iterate over the result sequence
  • Neither first nor second can be null; the comparer argument can be null, in which case the default equality comparer is used
  • The input sequences are only read as and when they're needed; to return the first result element, only the first input element is read

It's worth noting that the documentation for Union specifies a lot more than the Distinct documentation does:

When the object returned by this method is enumerated, Union enumerates first and second in that order and yields each element that has not already been yielded.

To me, that actually looks like a guarantee of the rules I proposed for Distinct. In particular, it's guaranteeing that the implementation iterates over "first" before "second", and if it's yielding elements as it goes, that guarantees that distinct elements will retain their order from the original input sequences. Whether others would read it in the same way or not, I can't say... input welcome.

What are we going to test?

I've written quite a few tests for Union - possibly more than we really need, but they demonstrate a few points of usage. The tests are:

  • Arguments are validated eagerly
  • Finding the union of two sequences without specifying a comparer; both inputs have duplicate elements, and there's one element in both
  • The same test as above but explicitly specifying null as the comparer, to force the default to be used
  • The same test as above but using a case-insensitive string comparer
  • Taking the union of an empty sequence with a non-empty one
  • Taking the union of a non-empty sequence with an empty one
  • Taking the union of two empty sequences
  • Proving that the first sequence isn't used until we start iterating over the result sequence (using ThrowingEnumerable)
  • Proving that the second sequence isn't used until we've exhausted the first

No new collections or comparers needed this time though - it's all pretty straightforward. I haven't written any tests for null elements this time - I'm convinced enough by what I saw when implementing Distinct to believe they won't be a problem.

Let's implement it!

First things first: we can absolutely implement the simpler overload in terms of the more complex one, and I'll do the same for Except and Intercept. Here's the Union method:

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    return Union(first, second, EqualityComparer<TSource>.Default);
}

So how do we implement the more complex overload? Well, I've basically been a bit disappointed by Union in terms of its conceptual weight. It doesn't really give us anything that the obvious combination of Concat and Distinct doesn't - so let's implement it that way first:

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    return first.Concat(second).Distinct(comparer);
}

The argument validation can be implemented by Concat with no problems here - the parameters have the same name, so any exceptions thrown will be fine in every way.

That's how I think about Union, but as I've mentioned before, I'd rather not actually see Concat and Distinct showing up in the stack trace - so here's the fuller implementation.

public static IEnumerable<TSource> Union<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return UnionImpl(first, second, comparer ?? EqualityComparer<TSource>.Default);
}

private static IEnumerable<TSource> UnionImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> seenElements = new HashSet<TSource>(comparer);
    foreach (TSource item in first)
    {
        if (seenElements.Add(item))
        {
            yield return item;
        }
    }
    foreach (TSource item in second)
    {
        if (seenElements.Add(item))
        {
            yield return item;
        }
    }
}

That feels like an absurd waste of code when we can achieve the same result so simply, admittedly. This is the first time my resolve against implementing one operator in terms of completely different ones has wavered. Just looking at it in black and white (so to speak), I'm close to going over the edge...

Conclusion

Union was a disappointingly bland operator in my view. (Maybe I should start awarding operators marks out of ten for being interesting, challenging etc.) It doesn't feel like it's really earned its place in LINQ, as calls to Concat/Distinct can replace it so easily. Admittedly as I've mentioned in several other places, a lot of operators can be implemented in terms of each other - but rarely quite so simply.

Still, I think Intersect and Except should be more interesting.

Posted by skeet | 5 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 14 - Distinct

I'm going to implement the set-based operators next, starting with Distinct.

What is it?

Distinct has two overloads, which differ only in the simplest possible way:

public static IEnumerable<TSource> Distinct<TSource>(
    this IEnumerable<TSource> source)

public static IEnumerable<TSource> Distinct<TSource>(
    this IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)

The point of the operator is straightforward: the result sequence contains the same items as the input sequence, but with the duplicates removed - so an input of { 0, 1, 3, 1, 5 } will give a result sequence of { 0, 1, 3, 5 } - the second occurrence of 1 is ignored.

This time I've checked and double-checked the documentation - and in this case it really is appropriate to think of the first overload as just a simplified version of the second. If you don't specify an equality comparer, the default comparer for the type will be used. The same will happen if you pass in a null reference for the comparer. The default equality comparer for a type can be obtained with the handy EqualityComparer<T>.Default property.

Just to recap, an equality comparer (represented by the IEqualityComparer<T> interface) is able to do two things:

  • Determine the hash code for a single item of type T
  • Compare any two items of type T for equality

It doesn't have to give any sort of ordering - that's what IComparer<T> is for, although that doesn't have the ability to provide a hash code.

One interesting point about IEqualityComparer<T> is that the GetHashCode() method is meant to throw an exception if it's provided with a null argument, but in practice the EqualityComparer<T>.Default implementations appear not to. This leads to an interesting question about Distinct: how should it handle null elements? It's not documented either way, but in reality both the LINQ to Objects implementation and the simplest way of implementing it ourselves simply throws a NullReferenceException if you use a not-null-safe comparer and have a null element present. Note that the default equality comparer for any type (EqualityComparer<T>.Default) does cope with nulls.

There are other undocumented aspects of Distinct, too. Both the ordering of the result sequence and the choice of which exact element is returned when there are equal options are unspecified. In the case of ordering, it's explicitly unspecified. From the documentation: "The Distinct method returns an unordered sequence that contains no duplicate values." However, there's a natural approach which answers both of these questions. Distinct is specified to use deferred execution (so it won't look at the input sequence until you start reading from the output sequence) but it also streams the results to some extent: to return the first element in the result sequence, it only needs to read the first element from the input sequence. Some other operators (such as OrderBy) have to read all their data before yielding any results.

When you implement Distinct in a way which only reads as much data as it has to, the answer to the ordering and element choice is easy:

  • The result sequence is in the same order as the input sequence
  • When there are multiple equal elements, it is the one which occurs earliest in the input sequence which is returned as part of the result sequence.

Remember that it's perfectly possible to have elements which are considered equal under a particular comparer, but are still clearly different when looked at another way. The simplest example of this is case-insensitive string equality. Taking the above rules into account, the distinct sequence returned for { "ABC", "abc", "xyz" } with a case-insensitive comparer is { "ABC", "xyz" }.

What are we going to test?

All of the above :)

All the tests use sequences of strings for clarity, but I'm using four different comparers:

  • The default string comparer (which is a case-sensitive ordinal comparer)
  • The case-insensitive ordinal comparer
  • A comparer which uses object identity (so will treat two equal but distinct strings as different)
  • A comparer which explicitly doesn't try to cope with null values

The tests assume that the undocumented aspects listed above are implemented with the rules that I've given. This means they're over-sensitive, in that an implementation of Distinct which matches all the documented behaviour but returns elements in a different order would fail the tests. This highlights an interesting aspect of unit testing in general... what exactly are we trying to test? I can think of three options in our case:

  • Just the documented behaviour: anything conforming to that, however oddly, should pass
  • The LINQ to Objects behaviour: the framework implementation should pass all our tests, and then our implementation should as well
  • Our implementation's known (designed) behaviour: we can specify that our implementation will follow particular rules above and beyond the documented contracts

In production projects, these different options are valid in different circumstances, depending on exactly what you're trying to do. At the moment, I don't have any known differences in behaviour between LINQ to Objects and Edulinq, although that may well change later in terms of optimizations.

None of the tests themselves are particularly interesting - although I find it interesting that I had to implement a deliberately fragile (but conformant) implementation of IEqualityComparer<T> in order to test Distinct fully.

Let's implement it!

I'm absolutely confident in implementing the overload that doesn't take a custom comparer using the one that does. We have two options for how to specify the custom comparer in the delegating call though - we could pass null or EqualityComparer<T>.Default, as the two are explicitly defined to behave the same way in the second overload. I've chosen to pass in EqualityComparer<T>.Default just for the sake of clarity - it means that anyone reading the first method doesn't need to check the behavior of the second to understand what it will do.

We need to use the "private iterator block method" approach again, so that the arguments can be evaluated eagerly but still let the result sequence use deferred execution. The real work method uses HashSet<T> to keep track of all the elements we've already returned - it takes an IEqualityComparer<T> in its constructor, and the Add method adds an element to the set if there isn't already an equal one, and returns whether or not it really had to add anything. All we have to do is iterate over the input sequence, call Add, and yield the item as part of the result sequence if Add returned true. Simple!

public static IEnumerable<TSource> Distinct<TSource>(
    this IEnumerable<TSource> source)
{
    return source.Distinct(EqualityComparer<TSource>.Default);
}

public static IEnumerable<TSource> Distinct<TSource>(
    this IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)
{
    if (source == null
    {
        throw new ArgumentNullException("source");
    }
    return DistinctImpl(source, comparer ?? EqualityComparer<TSource>.Default);
}

private static IEnumerable<TSource> DistinctImpl<TSource>(
    IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)
{
    HashSet<TSource> seenElements = new HashSet<TSource>(comparer);
    foreach (TSource item in source)
    {
        if (seenElements.Add(item))
        {
            yield return item;
        }
    }
}

So what about the behaviour with nulls? Well, it seems that HashSet<T> just handles that automatically, if the comparer it uses does. So long as the comparer returns the same hash code each time it's passed null, and considers null and null to be equal, it can be present in the sequence. Without HashSet<T>, we'd have had a much uglier implementation - especially as Dictionary<TKey, TValue> doesn't allow null keys.

Conclusion

I'm frankly bothered by the lack of specificity in the documentation for Distinct. Should you rely on the ordering rules that I've given here? I think that in reality, you're reasonably safe to rely on it - it's the natural result of the most obvious implementation, after all. I wouldn't rely on the same results when using a different LINQ provider, mind you - when fetching the results back from a database, for example, I wouldn't be at all surprised to see the ordering change. And of course, the fact that t documentation explicitly states that the result is unordered should act as a bit of a deterrent from relying on this.

We'll have to make similar decisions for the other set-based operators: Union, Intersect and Except. And yes, they're very likely to use HashSet<T> too...

Posted by skeet | 7 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 13 - Aggregate

EDIT: I've had to edit this quite a bit now that a second bug was discovered... basically my implementation of the first overload was completely broken :(

Last night's tweet asking for suggestions around which operator to implement next resulted in a win for Aggregate, so here we go.

What is it?

Aggregate has three overloads, effectively allow two defaults:

public static TSource Aggregate<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, TSource, TSource> func)

public static TAccumulate Aggregate<TSource, TAccumulate>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func)

public static TResult Aggregate<TSource, TAccumulate, TResult>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func,
    Func<TAccumulate, TResult> resultSelector)

Aggregate is an extension method using immediate execution, returning a single result. The generalised behaviour is as follows:

  • Start off with a seed. For the first overload, this defaults to the first value of the input sequence. The seed is used as the first accumulator value.
  • For each item in the list, apply the aggregation function, which takes the current accumulator value and the newly found item, and returns a new accumulator value.
  • Once the sequence has been exhausted, optionally apply a final projection to obtain a result. If no projection has been specified, we can imagine that the identity function has been provided.

The signatures make all of this look a bit more complicated because of the various type parameters involved. You can consider all the overloads as dealing with three different types, even though the first two actually have fewer type parameters:

  • TSource is the element type of the sequence, always.
  • TAccumulate is the type of the accumulator - and thus the seed. For the first overload where no seed is provided, TAccumulate is effectively the same as TSource.
  • TResult is the return type when there's a final projection involved. For the first two overloads, TResult is effectively the same as TAccumulate (again, think of a default "identity projection" as being used when nothing else is specified)

In the first overload, which uses the first input element as the seed, an InvalidOperationException is thrown if the input sequence is empty.

What are we going to test?

Obviously the argument validation is reasonably simple to test - source, func and resultSelector can't be null. But there are two different approaches to testing the "success" cases.

We could work out exactly when each delegate should be called and with what values - effectively mock every step of the iteration. This would be a bit of a pain, but a very robust way of proceeding.

The alternative approach is just to take some sample data and aggregation function, work out what the result should be, and assert that result. If the result is sufficiently unlikely to be achieved by chance, this is probably good enough - and it's a lot simpler to implement. Here's a sample from the most complicated test, where we have a seed and a final projection:

[Test]
public void SeededAggregationWithResultSelector()
{
    int[] source = { 1, 4, 5 };
    int seed = 5;
    Func<int, int, int> func = (current, value) => current * 2 + value;
    Func<int, string> resultSelector = result => result.ToInvariantString();
    // First iteration: 5 * 2 + 1 = 11
    // Second iteration: 11 * 2 + 4 = 26
    // Third iteration: 26 * 2 + 5 = 57
    // Result projection: 57.ToString() = "57"
    Assert.AreEqual("57", source.Aggregate(seed, func, resultSelector));
}

Now admittedly I'm not testing this to the absolute full - I'm using the same types for TSource and TAccumulate - but frankly it gives me plenty of confidence that the implementation is correct.

EDIT: My result selector now calls ToInvariantString. It used to just call ToString, but as I've now been persuaded that there are some cultures where that wouldn't give us the right results, I've implemented an extension method which effectively means that x.ToInvariantString() is equivalent to x.ToString(CultureInfo.InvariantCulture) - so we don't need to worry about cultures with different numeric representations etc.

Just for the sake of completeness (I've convinced myself to improve the code while writing this blog post), here's an example which sums integers, but results in a long - so it copes with a result which is bigger than Int32.MaxValue. I haven't bothered with a final projection though:

[Test]
public void DifferentSourceAndAccumulatorTypes()
{
    int largeValue = 2000000000;
    int[] source = { largeValue, largeValue, largeValue };
    long sum = source.Aggregate(0L, (acc, value) => acc + value);
    Assert.AreEqual(6000000000L, sum);
    // Just to prove we haven't missed off a zero...
    Assert.IsTrue(sum > int.MaxValue); 
}

Since I first wrote this post, I've also added tests for empty sequences (where the first overload should throw an exception) and a test which relies on the first overload using the first element of the sequence as the seed, rather than the default value of the input sequence's element type.

Okay, enough about the testing... what about the real code?

Let's implement it!

I'm still feeling my way around when it's a good idea to implement one method by using another, but at the moment my gut feeling is that it's okay to do so when:

  • You're implementing one operator by reusing another overload of the same operator; in other words, no unexpected operators will end up in the stack trace of callers
  • There are no significant performance penalties for doing so
  • The observed behaviour is exactly the same - including argument validation
  • The code ends up being simpler to understand (obviously)

Contrary to an earlier version of this post, the first overload can't be implemented in terms of the second or third ones, because of its behaviour regarding the seed and empty sequences.

public static TSource Aggregate<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, TSource, TSource> func)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (func == null)
    {
        throw new ArgumentNullException("func");
    }

    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Source sequence was empty");
        }
        TSource current = iterator.Current;
        while (iterator.MoveNext())
        {
            current = func(current, iterator.Current);
        }
        return current;
    }
}

It still makes sense to share an implementation for the second and third overloads though. There's a choice around whether to implement the second operator in terms of the third (giving it an identity projection) or to implement the third operator in terms of the second (by just calling the second overload and then applying a projection). Obviously applying an unnecessary identity projection has a performance penalty in itself - but it's a tiny penalty. So which is more readable? I'm in two minds about this. I like code where various methods call one other "central" method where all the real work occurs (suggesting implementing the second overload using the third) but equally I suspect I really think about aggregation in terms of getting the final value of the accumulator... with just a twist in the third overload, of an extra projection. I guess it depends on whether you think of the final projection as part of the general form or an "extra" step.

For the moment, I've gone with the "keep all logic in one place" approach:

public static TAccumulate Aggregate<TSource, TAccumulate>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func)
{
    return source.Aggregate(seed, func, x => x);
}

public static TResult Aggregate<TSource, TAccumulate, TResult>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func,
    Func<TAccumulate, TResult> resultSelector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (func == null)
    {
        throw new ArgumentNullException("func");
    }
    if (resultSelector == null)
    {
        throw new ArgumentNullException("resultSelector");
    }
    TAccumulate current = seed;
    foreach (TSource item in source)
    {
        current = func(current, item);
    }
    return resultSelector(current);
}

The bulk of the "real work" method is argument validation - the actual iteration is almost painfully simple.

Conclusion

The moral of today's story is to read the documentation carefully - sometimes there's unexpected behaviour to implement. I still don't really know why this difference in behaviour exists... it feels to me as if the first overload really should behave like the second one, just with a default initial seed. EDIT: it seems that you need to read it really carefully. You know, every word of it. Otherwise you could make an embarrassing goof in a public blog post. <sigh>

The second moral should really be about the use of Aggregate - it's a very generalized operator, and you can implement any number of other operators (Sum, Max, Min, Average etc) using it. In some ways it's the scalar equivalent of SelectMany, just in terms of its diversity. Maybe I'll show some later operators implemented using Aggregate...

Next up, there have been requests for some of the set-based operators - Distinct, Union, etc - so I'll probably look at those soon.

Posted by skeet | 6 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 12 - DefaultIfEmpty

After the masses of code required for all the permutations of First/Last/etc, DefaultIfEmpty is a bit of a relief.

What is it?

Even this simple operator has two overloads:

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
    this IEnumerable<TSource> source)

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
    this IEnumerable<TSource> source,
    TSource defaultValue)

The behaviour is very simple to describe:

  • If the input sequence is empty, the result sequence has a single element in it, the default value. This is default(TSource) for the overload without an extra parameter, or the specified value otherwise.
  • If the input sequence isn't empty, the result sequence is the same as the input sequence
  • The source argument must not be null, and this is validated eagerly
  • The result sequence itself uses deferred execution - the input sequence isn't read at all until the result sequence is read
  • The input sequence is streamed; any values read are yielded immediately; no buffering is used

Dead easy.

What are we going to test?

Despite being relatively late in the day, I decided to test for argument validation - and a good job too, as my first attempt failed to split the implementation into an argument validation method and an iterator block method for the real work! It just shows how easy it is to slip up.

Other than that, I can really only see four cases worth testing:

  • No default value specified, empty input sequence
  • Default value specified, empty input sequence
  • No default value specified, non-empty input sequence
  • Default value specified, non-empty input sequence

So I have tests for all of those, and that's it. I don't have anything testing for streaming, lazy evaluation etc.

Let's implement it!

Despite my reluctance to implement one operator in terms of another elsewhere, this felt like such an obvious case, I figured it would make sense just this once. I even applied DRY to the argument validation aspect. Here's the implementation in all its brief glory:

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
    this IEnumerable<TSource> source)
{
    // This will perform an appropriate test for source being null first.
    return source.DefaultIfEmpty(default(TSource));
}

public static IEnumerable<TSource> DefaultIfEmpty<TSource>(
    this IEnumerable<TSource> source,
    TSource defaultValue)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    return DefaultIfEmptyImpl(source, defaultValue);
}

private static IEnumerable<TSource> DefaultIfEmptyImpl<TSource>(
    IEnumerable<TSource> source,
    TSource defaultValue)
{
    bool foundAny = false;
    foreach (TSource item in source)
    {
        yield return item;
        foundAny = true;
    }
    if (!foundAny)
    {
        yield return defaultValue;
    }
}

Of course, now that I've said how easy it was, someone will find a bug...

Aside from the use of default(TSource) to call the more complex overload from the simpler one, the only aspect of any interest is the bottom method. It irks me slightly that we're assigning "true" to "foundAny" on every iteration... but the alternative is fairly unpleasant:

private static IEnumerable<TSource> DefaultIfEmptyImpl<TSource>(
    IEnumerable<TSource> source,
    TSource defaultValue)
{
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            yield return defaultValue;
            yield break; // Like a "return"
        }
        yield return iterator.Current;
        while (iterator.MoveNext())
        {
            yield return iterator.Current;
        }
    }
}

This may be slightly more efficient, but it feels a little clumsier. We could get rid of the "yield break" by putting the remainder of the method in an "else" block, but I'm not dead keen on that, either. We could use a do/while loop instead of a simple while loop - that would at least remove the repetition of "yield return iterator.Current" but I'm not really a fan of do/while loops. I use them sufficiently rarely that they cause me more mental effort to read than I really like.

If any readers have suggestions which are significantly nicer than either of the above implementations, I'd be interested to hear them. This feels a little inelegant at the moment. It's far from a readability disaster - it's just not quite neat.

Conclusion

Aside from the slight annoyance at the final lack of elegance, there's not much of interest here. However, we could now implement FirstOrDefault/LastOrDefault/SingleOrDefault using DefaultIfEmpty. For example, here's an implementation of FirstOrDefault:

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    return source.DefaultIfEmpty().First();
}

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Can't just use source.DefaultIfEmpty().First(predicate)
    return source.Where(predicate).DefaultIfEmpty().First();
}

Note the comment in the predicated version - the defaulting has to be the very last step after we've applied the predicate... otherwise if we pass in an empty sequence and a predicate which doesn't match with default(TSource), we'll get an exception instead of the default value. The other two ...OrDefault operators could be implemented in the same way, of course. (I haven't done so yet, but the above code is in source control.)

I'm currently unsure what I'll implement next. I'll see whether I get inspired by any particular method in the morning.

Posted by skeet | 9 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 11 - First/Single/Last and the ...OrDefault versions

Today I've implemented six operators, each with two overloads. At first I expected the implementations to be very similar, but they've all turned out slightly differently...

What are they?

We've got three axes of permutation here: {First, Last, Single}, {with/without OrDefault }, {with/without a predicate}. That gives us these twelve signatures:

public static TSource First<TSource>(
    this IEnumerable<TSource> source)

public static TSource First<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source)

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource Last<TSource>(
    this IEnumerable<TSource> source)

public static TSource Last<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source)

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource Single<TSource>(
    this IEnumerable<TSource> source)

public static TSource Single<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source)

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

The shared behaviour is as follows:

  • They're all extension methods with a single generic type argument
  • They're all implemented with immediate execution
  • They all validate that their parameters are non-null
  • The overloads with a predicate are equivalent to calling source.Where(predicate).SameOperator() - in other words, they just add a filter before applying the operator.

With those rules applied, we simply need to consider three possibilities for each operator: what happens if the source sequence is empty, contains a single element, or contains multiple elements. (This is after applying the filter if a predicate is specified, of course.) We can draw the results up in a simple table:

Operator Empty sequence Single element Multiple elements
First Throws exception Returns element Returns first element
FirstOrDefault Returns default(TSource) Returns element Returns first element
Last Throws exception Returns element Returns last element
LastOrDefault Returns default(TSource) Returns element Returns last element
Single Throws exception Returns element Throws exception
SingleOrDefault Returns default(TSource) Returns element Throws exception

As you can see, for an input sequence with a single element, the results are remarkably uniform :) Likewise, for an empty input sequence, any operator without "OrDefault" throws an exception (InvalidOperationException, in fact) and any operator with "OrDefault" returns the default value for the element type (null for reference types, 0 for int etc). The operators really differ if the (potentially filtered) input sequence contains multiple elements - First and Last do the obvious thing, and Single throws an exception. It's worth noting that SingleOrDefault also throws an exception - it's not like it's saying, "If the sequence is a single element, return it - otherwise return the default value." If you want an operator which handles multiple elements, you should be using First or Last, with the "OrDefault" version if the sequence can legitimately have no elements. Note that if you do use an "OrDefault" operator, the result is exactly the same for an empty input sequence as for an input sequence containing exactly one element which is the default value. (I'll be looking at the DefaultIfEmpty operator next.)

Now we know what the operators do, let's test them.

What are we going to test?

This morning I tweeted that I had written 72 tests before writing any implementation. In fact I ended up with 80, for reasons we'll come to in a minute. For each operator, I tested the following 12 cases:

  • Null source (predicate-less overload)
  • Null source (predicated overload)
  • Null predicate
  • Empty sequence, no predicate
  • Empty sequence, with a predicate
  • Single element sequence, no predicate
  • Single element sequence where the element matched the predicate
  • Single element sequence where the element didn't match the predicate
  • Multiple element sequence, no predicate
  • Multiple element
  • Multiple element sequence where one element matched the predicate
  • Multiple element sequence where multiple elements matched the predicate

These were pretty much cut and paste jobs - I used the same data for each test against each operator, and just changed the expected results.

There are two extra tests for each of First and FirstOrDefault, and two for each of Last and LastOrDefault:

  • First/FirstOrDefault should return as soon as they've seen the first element, when there's no predicate; they shouldn't iterate over the rest of the sequence
  • First/FirstOrDefault should return as soon as they've seen the first matching element, when there is a predicate
  • Last/LastOrDefault are optimized for the case where the source implements IList<T> and there's no predicate: it uses Count and the indexer to access the final element
  • Last/LastOrDefault is not optimized for the case where the source implements IList<T> but there is a predicate: it iterates through the entire sequence

The last two tests involved writing a new collection called NonEnumerableList which implements IList<T> by delegating everything to a backing List<T>, except for GetEnumerator() (both the generic and nongeneric forms) which simply throws a NotSupportedException. That should be a handy one for testing optimizations in the future. I'll discuss the optimization for Last when we get there.

Let's implement them!

These operators were more interesting to implement than I'd expect, so I'm actually going to show all twelve methods. It was rarely just a matter of cutting and pasting, other than for the argument validation.

Of course, if we chose to implement the predicated versions using "Where" and the non-predicated form, and the "OrDefault" versions by using "DefaultIfEmpty" followed by the non-defaulting version, we would only have had the three non-predicated, non-defaulting versions to deal with... but as I've said before, there are some virtues to implementing each operator separately.

For the sake of avoiding fluff, I've removed the argument validation from each method - but obviously it's there in the real code. Let's start with First:

public static TSource First<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (iterator.MoveNext())
        {
            return iterator.Current;
        }
        throw new InvalidOperationException("Sequence was empty");
    }
}

public static TSource First<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return item;
        }
    }
    throw new InvalidOperationException("No items matched the predicate");
}

These look surprisingly different - and that was actually a deliberate decision. I could easily have implemented the predicate-less version with a foreach loop as well, just returning unconditionally from its body. However, I chose to emphasize the fact that we're not looping in First: we simply move to the first element if we can, and return it or throw an exception. There's no hint that we might ever call MoveNext again. In the predicated form, of course, we have to keep looping until we find a matching value - only throwing the exception when we've exhausted all possibilities.

Now let's see how it looks when we return a default for empty sequences:

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        return iterator.MoveNext() ? iterator.Current : default(TSource);
    }
}

public static TSource FirstOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return item;
        }
    }
    return default(TSource);
}

Here the predicated form looks very similar to First, but the predicate-less one is slightly different: instead of using an if block (which would be a perfectly valid approach, of course) I've used the conditional operator. We're going to return something, whether we manage to move to the first element or not. Arguably it would be nice if the conditional operator allowed the second or third operands to be "throw" expressions, taking the overall type of the expression from the other result operand... but it's no great hardship.

Next up we'll implement Single, which is actually closer to First than Last is, in some ways:

public static TSource Single<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        TSource ret = iterator.Current;
        if (iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence contained multiple elements");
        }
        return ret;
    }
}

public static TSource Single<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource ret = default(TSource);
    bool foundAny = false;
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            if (foundAny)
            {
                throw new InvalidOperationException("Sequence contained multiple matching elements");
            }
            foundAny = true;
            ret = item;
        }
    }
    if (!foundAny)
    {
        throw new InvalidOperationException("No items matched the predicate");
    }
    return ret;
}

This is already significantly more complex than First. The predicate-less version starts off the same way, but if we manage to move to the first element, we have to remember that value (as we're hoping to return it) and then try to move to the second element. This time, if the move succeeds, we have to throw an exception - otherwise we can return our saved value.

The predicated version is even hairier. We still need to remember the first matching value we find, but this time we're looping - so we need to keep track of whether we've already seen a matching value or not. If we see a second match, we have to throw an exception... and we also have to throw an exception if we reach the end without finding any matches at all. Note that although we assign an initial value of default(TSource) to ret, we'll never reach a return statement without assigning a value to it. However, the rules around definite assignment aren't smart enough to cope with this, so we need to provide a "dummy" value to start with... and default(TSource) is really the only value available. There is an alternative approach without using a foreach statement, where you loop until you find the first match, assign it to a local variable which is only declared at that point, followed by a second loop ensuring that there aren't any other matches. I personally think that's a bit more complex, which is why I've just used the foreach here.

The difference when we implement SingleOrDefault isn't quite as pronounced this time though:

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            return default(TSource);
        }
        TSource ret = iterator.Current;
        if (iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence contained multiple elements");
        }
        return ret;
    }
}

public static TSource SingleOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource ret = default(TSource);
    bool foundAny = false;
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            if (foundAny)
            {
                throw new InvalidOperationException("Sequence contained multiple matching elements");
            }
            foundAny = true;
            ret = item;
        }
    }
    return ret;
}

This time we've just replaced a "throw" statement in the predicate-less method with a "return" statement, and removed the test for no matches being found in the predicated method. Here our assignment of default(TSource) to ret really works in our favour - if we don't end up assigning anything else to it, we've already got the right return value!

Next up is Last:

public static TSource Last<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    IList<TSource> list = source as IList<TSource>;
    if (list != null)
    {
        if (list.Count == 0)
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        return list[list.Count - 1];
    }

    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        TSource last = iterator.Current;
        while (iterator.MoveNext())
        {
            last = iterator.Current;
        }
        return last;
    }
}

public static TSource Last<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    bool foundAny = false;
    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            foundAny = true;
            last = item;
        }
    }
    if (!foundAny)
    {
        throw new InvalidOperationException("No items matched the predicate");
    }
    return last;
}

Let's start off with the optimization at the beginning of the predicate-less method. If we find out we're dealing with a list, we can simply fetch the count, and then either throw an exception or return the element at the final index. As an extra bit of optimization I could store the count in a local variable - but I'm assuming that the count of an IList<T> is cheap to compute. If there are significant objections to that assumption, I'm happy to change it :) Note that this is another situation where I'm assuming that anything implementing IList<T> will only hold at most Int32.MaxValue items - otherwise the optimization will fail.

If we don't follow the optimized path, we simply iterate over the sequence, updating a local variable with the last-known element on every single iteration. This time I've avoided the foreach loop for no particularly good reason - we could easily have had a foundAny variable which was just set to "true" on every iteration, and then tested at the end. In fact, that's exactly the pattern the predicated method takes. Admittedly that decision is forced upon us to some extent - we can't just move once and then take the first value as the "first last-known element", because it might not match the predicate.

There's no optimization for the predicated form of Last. This follows LINQ to Objects, but I don't honestly know the reason for it there. We could easily iterate backwards from the end of the sequence using the indexer on each iteration. One possible reason which makes a certain amount of sense is that when there's a predicate, that predicate could throw an exception for some values - and if we just skipped to the end if the collection implements IList<T>, that would be an observable difference. I'd be interested to know whether or not that is the reason - if anyone has any inside information which they can share, I'll update this post.

From here, we only have one more operator to implement - LastOrDefault:

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    IList<TSource> list = source as IList<TSource>;
    if (list != null)
    {
        return list.Count == 0 ? default(TSource) : list[list.Count - 1];
    }

    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        last = item;
    }
    return last;
}

public static TSource LastOrDefault<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    // Argument validation elided
    TSource last = default(TSource);
    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            last = item;
        }
    }
    return last;
}

This time, aside from the optimization, the predicated and non-predicated forms look very similar... more so than for any of the other operators. In each case, we start with a return value of default(TSource), and iterate over the whole sequence, updating it - only doing so when it matches the predicate, if we've got one.

Conclusion

This was a longer post than I anticipated when I got up this morning, but I hope the slight differences between implementations - and the mystery of the unoptimized predicated "Last/LastOrDefault" operators have made it worth slogging through.

As a contrast - and because I've already mentioned it in this post - I'll implement DefaultIfEmpty next. I reckon I can still do that this evening, if I hurry...

Addendum

It turns out I was missing some tests for Single and SingleOrDefault: what should they do if evaluating the sequence fully throws an exception? It turns out that in LINQ to Objects, the overloads without a predicate throw InvalidOperationException as soon as they see a second element, but the overloads with a predicate keep iterating even when they've seen a second element matching a predicate. This seems ludicrously inconsistent to me - I've opened a Connect issue about it; we'll see what happens.

Posted by skeet | 9 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 10 - Any and All

Another day, another blog post. I should emphasize that this rate of posting is likely to be short-lived... although if I get into the habit of writing a post on the morning commute when I go back to work after the Christmas holidays, I could keep ploughing through until we're done.

Anyway, today we have a pair of operators: Any and All.

What are they?

"Any" has two overloads; there's only one for "All":

public static bool Any<TSource>(
    this IEnumerable<TSource> source)

public static bool Any<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static bool All<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

The names really are fairly self-explanatory:

  • "Any" without a predicate returns whether there are any elements in the input sequence
  • "Any" with a predicate returns whether any elements in the input sequence match the predicate
  • "All" returns whether all the elements in the input sequence match the given predicate

Both operators use immediate execution - they don't return until they've got the answer, basically.

Importantly, "All" has to read through the entire input sequence to return true, but can return as soon as it's found a non-matching element; "Any" can return true as soon as it's found a matching element, but has to iterate over the entire input sequence in order to return false. This gives rise to one very simple LINQ performance tip: it's almost never a good idea to use a query like

// Don't use this
if (query.Count() != 0)

That has to iterate over *all* the results in the query... when you really only care whether or not there are any results. So use "Any" instead:

// Use this instead
if (query.Any())

If this is part of a bigger LINQ to SQL query, it may not make a difference - but in LINQ to Objects it can certainly be a huge boon.

Anyway, let's get on to testing the three methods...

What are we going to test?

Feeling virtuous tonight, I've even tested argument validation again... although it's easy to get that right here, as we're using immediate execution.

Beyond that, I've tested a few scenarios:

  • An empty sequence will return false with Any, but true with All. (Whatever the predicate is for All, there are no elements which fail it.)
  • A sequence with any elements at all will make the predicate-less Any return true.
  • If all the elements don't match the predicate, both Any and All return false.
  • If some elements match the predicate, Any will return true but All will return false.
  • If all elements match the predicate, All will return true.

Those are all straightforward, so I won't give the code. One final test is interesting though: we prove that Any returns as soon as it's got the result by giving it a query which will throw an exception if it's iterated over completely. The easiest way of doing this is to start out with a sequence of integers including 0, and then use Select with a projection which divides some constant value by each element. In this test case, I've given it a value which will match the predicate before the value which will cause the exception to be thrown:

[Test]
public void SequenceIsNotEvaluatedAfterFirstMatch()
{
    int[] src = { 10, 2, 0, 3 };
    var query = src.Select(x => 10 / x);
    // This will finish at the second element (x = 2, so 10/x = 5)
    // It won't evaluate 10/0, which would throw an exception
    Assert.IsTrue(query.Any(y => y > 2));
}

There's an equivalent test for All, where a non-matching element occurs before the exceptional one.

So, with all the tests written, let's get on with the interesting bit:

Let's implement them!

The first thing to note is that all of these could be implemented in terms of either Any-with-a-predicate or All. For example, given All, we could implement Any as:

public static bool Any<TSource>(
    this IEnumerable<TSource> source)
{
    return source.Any(x => true);
}

public static bool Any<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return !source.All(x => !predicate(x));
}

It's simplest to implement the predicate-less Any in terms of the predicated one - using a predicate which returns true for any element means that Any will return true for any element at all, which is what we want.

The inversions in the call to All take a minute to get your head round, but it's basically De Morgan's law in LINQ form: we effectively invert the predicate to find out if all of the elements don't match the original predicate... then return the inverse. Due to the inversion, this still returns early in all the appropriate situations, too.

While we could do that, I've actually preferred a straightforward implementation of all of the separate methods:

public static bool Any<TSource>(
    this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
            
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        return iterator.MoveNext();
    }
}

public static bool Any<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    foreach (TSource item in source)
    {
        if (predicate(item))
        {
            return true;
        }
    }
    return false;
}


public static bool All<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    foreach (TSource item in source)
    {
        if (!predicate(item))
        {
            return false;
        }
    }
    return true;
}

Aside from anything else, this makes it obvious where the "early out" comes in each case - and also means that any stack traces generated are rather easier to understand. It would be quite odd from a client developer's point of view to call Any but see All in the stack trace, or vice versa.

One interesting point to note is that I don't actually use a foreach loop in Any - although I could, of course. Instead, I just get the iterator and then return whether the very first call to MoveNext indicates that there are any elements. I like the fact that reading this method it's obvious (at least to me) that we really couldn't care less what the value of the first element is - because we never ask for it.

Conclusion

Probably the most important lesson here is the advice to use Any (without a predicate) instead of Count when you can. The rest was pretty simple - although it's always fun to see one operator implemented in terms of another.

So, what next? Possibly Single/SingleOrDefault/First/FirstOrDefault/Last/LastOrDefault. I might as well do them all together - partly as they're so similar, and partly to emphasize the differences which do exist.

Posted by skeet | 9 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 9 - SelectMany

The next operator we'll implement is actually the most important in the whole of LINQ. Most (all?) of the other operators returning sequences can be implemented via SelectMany. We'll have a look at that at the end of this post, but let's implement it first.

What is it?

SelectMany has 4 overloads, which look gradually more and more scary:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TResult>> selector)

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)

public static IEnumerable<TResult> SelectMany<TSource, TCollection, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)

These aren't too bad though. Really these are just variations of the same operation, with two "optional" bits.

In every case, we start with an input sequence. We generate a subsequence from each element in the input sequence using a delegate which can optionally take a parameter with the index of the element within the original collection.

Now, we either return each element from each subsequence directly, or we apply another delegate which takes the original element in the input sequence and the element within the subsequence.

In my experience, uses of the overloads where the original selector delegate uses the index are pretty rare - but the others (the first and the third in the list above) are fairly common. In particular, the C# compiler uses the third overload whenever it comes across a "from" clause in a query expression, other than the very first "from" clause.

It helps to put this into a bit more context. Suppose we have a query expression like this:

var query = from file in Directory.GetFiles("logs")
            from line in File.ReadLines(file)
            select Path.GetFileName(file) + ": " + line;

That would be converted into a "normal" call like this:

var query = Directory.GetFiles("logs")
                     .SelectMany(file => File.ReadLines(file),
                                 (file, line) => Path.GetFileName(file) + ": " + line);

In this case the compiler has used our final "select" clause as the projection; if the query expression had continued with "where" clauses etc, it would have created a projection to just pass along "file" and "line" in an anonymous type. This is probably the most confusing bit of the query translation process, involving transparent identifiers. For the moment we'll stick with the simple version above.

So, the SelectMany call above has three arguments really:

  • The source, which is a list of strings (the filenames returned from Directory.GetFiles)
  • An initial projection which converts from a single filename to a list of the lines of text within that file
  • A final projection which converts a (file, line) pair into a single string, just by separating them with ": ".

The result is a single sequence of strings - every line of every log file, prefixed with the filename in which it appeared. So writing out the results of the query might give output like this:

test1.log: foo
test1.log: bar
test1.log: baz
test2.log: Second log file
test2.log: Another line from the second log file

It can take a little while to get your head round SelectMany - at least it did for me - but it's a really important one to understand.

A few more details of the behaviour before we go into testing:

  • The arguments are validated eagerly - everything has to be non-null.
  • Everything is streamed. So only one element of the input is read at a time, and then a subsequence is produced from that. Only one element is then read from the subsequence at a time, yielding the results as we go, before we move onto the next input element and thus the next subsequence etc.
  • Every iterator is closed when it's finished with, just as you'd expect by now.

What are we going to test?

I'm afraid I've become lazy by this point. I can't face writing yet more tests for null arguments. I've written a single test for each of the overloads. I found it hard to come up with a clear way of writing the tests, but here's one example, for the most complicated overload:

[Test]
public void FlattenWithProjectionAndIndex()
{
    int[] numbers = { 3, 5, 20, 15 };
    var query = numbers.SelectMany((x, index) => (x + index).ToString().ToCharArray(),
                                   (x, c) => x + ": " + c);
    // 3 => "3: 3"
    // 5 => "5: 6"
    // 20 => "20: 2", "20: 2"
    // 15 => "15: 1", "15: 8"
    query.AssertSequenceEqual("3: 3", "5: 6", "20: 2", "20: 2", "15: 1", "15: 8");
}

So, to give a bit more explanation to this:

  • Each number is summed with its index (3+0, 5+1, 20+2, 15+3)
  • Each sum is turned into a string, and then converted into a char array. (We don't really need to ToCharArray call as string implements IEnumerable<char> already, but I thought it made it clearer.)
  • We combine each subsequence character with the original element it came from, in the form: "original value: subsequence character"

The comment shows the eventual results from each input, and the final test shows the complete result sequence.

Clear as mud? Hopefully it's not to bad when you look at each step in turn. Okay, now let's make it pass...

Let's implement it!

We could implement the first three overloads in terms of calls to the final one - or more likely, a single "Impl" method without argument validation, called by all four public methods. For example, the simplest method could be implemented like this:

public static IEnumerable<TResult> SelectMany<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return SelectManyImpl(source,
                          (value, index) => selector(value),
                          (originalElement, subsequenceElement) => subsequenceElement);
}

However, I've decided to implement each of the methods separately - splitting them into the public extension method and a "SelectManyImpl" method with the same signature each time. I think that would make it simpler to step through the code if there were ever any problems... and it allows us to see the differences between the simplest and most complicated versions, too:

// Simplest overload
private static IEnumerable<TResult> SelectManyImpl<TSource, TResult>(
    IEnumerable<TSource> source,
    Func<TSource, IEnumerable<TResult>> selector)
{
    foreach (TSource item in source)
    {
        foreach (TResult result in selector(item))
        {
            yield return result;
        }
    }
}

// Most complicated overload:
// - Original projection takes index as well as value
// - There's a second projection for each original/subsequence element pair
private static IEnumerable<TResult> SelectManyImpl<TSource, TCollection, TResult>(
    IEnumerable<TSource> source,
    Func<TSource, int, IEnumerable<TCollection>> collectionSelector,
    Func<TSource, TCollection, TResult> resultSelector)
{
    int index = 0;
    foreach (TSource item in source)
    {
        foreach (TCollection collectionItem in collectionSelector(item, index++))
        {
            yield return resultSelector(item, collectionItem);
        }
    }
}

The correspondence between the two methods is pretty clear... but I find it helpful to have the first form, so that if I ever get confused about the fundamental point of SelectMany, it's really easy to understand it based on the simple overload. It's then not too big a jump to apply the two extra "optional" complications, and end up with the final method. The simple overload acts as a conceptual stepping stone, in a way.

Two minor points to note:

  • The first method could have been implemented with a "yield foreach selector(item)" if such an expression existed in C#. Using a similar construct in the more complicated form would be harder, and involve another call to Select, I suspect... probably more hassle than it would be worth.
  • I'm not explicitly using a "checked" block in the second form, even though "index" could overflow. I haven't looked to see what the BCL does in this situation, to be honest - I think it unlikely that it will come up. For consistency I should probably use a checked block on every method which uses an index like this... or just turn arithmetic checking on for the whole assembly.

Reimplementing operators using SelectMany

I mentioned early on in this post that many of the LINQ operators can be implemented via SelectMany. Just as a quick example of this, here are alternative implementations of Select, Where and Concat:

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return source.SelectMany(x => Enumerable.Repeat(selector(x), 1));
}

public static IEnumerable<TSource> Where<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }
    return source.SelectMany(x => Enumerable.Repeat(x, predicate(x) ? 1 : 0));
}

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return new[] { first, second }.SelectMany(x => x);
}

Select and Where use Enumerable.Repeat as a convenient way of creating a sequence with either a single element or none. You could alternatively create a new array instead. Concat just uses an array directly: if you think of SelectMany in terms of its flattening operation, Concat is a really natural fit. I suspect that Empty and Repeat are probably feasible with recursion, although the performance would become absolutely horrible.

Currently the above implementations are in the code using conditional compilation. If this becomes a popular thing for me to implement, I might consider breaking it into a separate project. Let me know what you think - my gut feeling is that we won't actually gain much more insight than the above methods give us... just showing how flexible SelectMany is.

SelectMany is also important in a theoretical way, in that it's what provides the monadic nature of LINQ. It's the "Bind" operation in the LINQ monad. I don't intend to say any more than that on the topic - read Wes Dyer's blog post for more details, or just search for "bind monad SelectMany" for plenty of posts from people smarter than myself.

Conclusion

SelectMany is one of LINQ's fundamental operations, and at first sight it's a fearsome beast. As soon as you understand that the basic operation is a flattening projection just with a couple of optional twiddles, it's easily tamed.

Next up I'll implement All and Any - which are nice and easy to describe by comparison.

Posted by skeet | 4 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 8 - Concat

After our quick visit to scalar return types with Count and LongCount, we're back to an operator returning a sequence: Concat.

What is it?

Concat only has a single signature, which makes life simple:

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

The return value is simply a sequence containing the elements of the first sequence followed by the elements of the second sequence - the concatenation of the two sequences.

I sometimes think it's a pity that there aren't Prepend/Append methods which do the same thing but for a single extra element - this would be quite useful in situations such as having a drop down list of countries with an extra option of "None". It's easy enough to use Concat for this purpose by creating a single-element array, but specific methods would be more readable in my view. MoreLINQ has extra Concat methods for this purpose, but Edulinq is only meant to implement the methods already in LINQ to Objects.

As ever, some notes on the behaviour of Concat:

  • Arguments are validated eagerly: they must both be non-null
  • The result uses deferred execution: other than validation, the arguments aren't used when the method is first called
  • Each sequence is only evaluated when it needs to be... if you stop iterating over the output sequence before the first input has been exhausted, the second input will remain unused

That's basically it.

What are we going to test?

The actual concatenation part of the behaviour is very easy to test in a single example - we could potentially also demonstrate concatenation using empty sequences, but there's no reason to suspect they would fail.

The argument validation is tested in the same way as normal, by calling the method with invalid arguments but not attempting to use the returned query.

Finally, there are a couple of tests to indicate the point at which each input sequence is used. This is achieved using the ThrowingEnumerable we originally used in the Where tests:

[Test]
public void FirstSequenceIsntAccessedBeforeFirstUse()
{
    IEnumerable<int> first = new ThrowingEnumerable();
    IEnumerable<int> second = new int[] { 5 };
    // No exception yet...
    var query = first.Concat(second);
    // Still no exception...
    using (var iterator = query.GetEnumerator())
    {
        // Now it will go bang
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

[Test]
public void SecondSequenceIsntAccessedBeforeFirstUse()
{
    IEnumerable<int> first = new int[] { 5 };
    IEnumerable<int> second = new ThrowingEnumerable();
    // No exception yet...
    var query = first.Concat(second);
    // Still no exception...
    using (var iterator = query.GetEnumerator())
    {
        // First element is fine...
        Assert.IsTrue(iterator.MoveNext());
        Assert.AreEqual(5, iterator.Current);
        // Now it will go bang, as we move into the second sequence
        Assert.Throws<InvalidOperationException>(() => iterator.MoveNext());
    }
}

I haven't written tests to check that iterators are disposed, etc - but each input sequence's iterator should be disposed appropriately. In particular, it's natural for the first sequence's iterator to be disposed before the second sequence is iterated over at all.

Let's impement it!

The implementation is reasonably simple, but it does make me hanker after F#... it's the normal split between argument validation and iterator block implementation, but each part is really simple:

public static IEnumerable<TSource> Concat<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }
    return ConcatImpl(first, second);
}

private static IEnumerable<TSource> ConcatImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    foreach (TSource item in first)
    {
        yield return item;
    }
    foreach (TSource item in second)
    {
        yield return item;
    }
}

It's worth just remembering at this point that this would still have been very annoying to implement without iterator blocks. Not really difficult as such, but we'd have had to remember which sequence we were currently iterating over (if any) and so on.

However, using F# we could have made this even simpler with the yield! expression, which yields a whole sequence instead of a single item. Admittedly in this case there aren't significant performance benefits to using yield! (which there certainly can be in recursive situations) but it would just be more elegant to have the ability to yield an entire sequence in one statement. (Spec# has a similar construct called nested iterators, expressed using yield foreach.) I'm not going to pretend to know enough about the details of either F# or Spec# to draw more detailed comparisons, but we'll see the pattern of "foreach item in a collection, yield the item" several more times before we're done. Remember that we can't extract that into a library method, as the "yield" expression needs special treatment by the C# compiler.

Conclusion

Even when presented with a simple implementation, I can still find room to gripe :) It would be nice to have nested iterators in C#, but to be honest the number of times I find myself frustrated by their absence is pretty small.

Concat is a useful operator, but it's really only a very simple specialization of another operator: SelectMany. After all, Concat just flattens two sequences into one... whereas SelectMany can flatten a whole sequence of sequences, with even more generality available when required. I'll implement SelectMany next, and show a few examples of how other operators can be implemented simply in terms of SelectMany. (We'll see the same sort of ability for operators returning a single value when we implement Aggregate.)

Addendum: avoiding holding onto references unnecessarily

A comment suggested that we should set first to "null" after we've used it. That way, as soon as we've finished iterating over the collection, it may be eligible for garbage collection. That leads to an implementation like this:

private static IEnumerable<TSource> ConcatImpl<TSource>(
    IEnumerable<TSource> first,
    IEnumerable<TSource> second)
{
    foreach (TSource item in first)
    {
        yield return item;
    }
    // Avoid hanging onto a reference we don't really need
    first = null;
    foreach (TSource item in second)
    {
        yield return item;
    }
}

Now normally I'd say this wouldn't actually help - setting a local variable to null when it's not used in the rest of the method doesn't actually make any difference when the CLR is running in optimized mode, without a debugger attached: the garbage collector only cares about variables whih might still be accessed in the rest of the method.

In this case, however, it makes a difference - because this isn't a normal local variable. It ends up as an instance variable in the hidden class generated by the C# compiler... and the CLR can't tell that the instance variable will never be used again.

Arguably we could remove our only reference to "first" at the start of GetEnumerator. We could write a method of the form:

public static T ReturnAndSetToNull<T>(ref T value) where T : class
{
    T tmp = value;
    value = null;
    return tmp;
}

and then call it like this:

foreach (TSource item in ReturnAndSetToNull(ref first))

I would certainly consider that overkill, particularly as it seems very likely that the iterator will still have a reference to the collection itself - but simply setting "first" to null after iterating over it makes sense to me.

I don't believe that the "real" LINQ to Objects implementation does this, mind you. (At some point I'll test it with a collection which has a finalizer.)

Posted by skeet | 7 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 7 - Count and LongCount

Today's post covers two operators in one, because they're so incredibly similar... to the point cut and paste of implementation, merely changing the name, return type, and a couple of variables.

What are they?

Count and LongCount each have two overloads: one with a predicate, and one without. Here are all four signatures:

public static int Count<TSource>(
    this IEnumerable<TSource> source)

public static int Count<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

public static long LongCount<TSource>(
    this IEnumerable<TSource> source)

public static long LongCount<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)

As you can see, the LongCount signatures are identical to Count except in terms of their return types, which are long (Int64) instead of int (Int32).

The overloads without a predicate parameter simply return the number of items in the source collection; the ones with a predicate return the number of items for which that predicate returns true.

Interesting aspects of the behaviour:

  • These are all extension methods on IEnumerable<T> - you might argue that for the versions without a predicate, it would have been better to extend the non-generic IEnumerable, as nothing actually requires the element type.
  • Where there's no predicate, Count is optimized for ICollection<T> and (in .NET 4) ICollection - both of which have Count properties which are expected to be faster than iterating over the entire collection. LongCount is not optimized in the same way in the .NET implementation - I'll discuss this in the implementation section.
  • No optimization is performed in the overloads with predicates, as basically there's no way of telling how many items will "pass" the predicate without testing them.
  • All methods use immediate execution - nothing is deferred. (If you think about it, there's nothing which can be deferred here, when we're just returning an int or a long.)
  • All arguments are validated simply by testing they're non-null
  • Both methods should throw OverflowException when given a collection with more items than they can return the count of... though this is a considerably larger number in the case of LongCount than Count, of course.

What are we going to test?

In some senses, there are only two "success" tests involved here: one without a predicate and one with. Those are easy enough to deal with, but we also want to exercise the optimized paths. That's actually trickier than it might sound, as we want to test four situations:

  • A source which implements both ICollection<T> and ICollection (easy: use List<T>)
  • A source which implements ICollection<T> but not ICollection (reasonably easy, after a little work finding a suitable type: use HashSet<T>)
  • A source which implements ICollection but not ICollection<T> but still implements IEnumerable<T> (so that we can extend it) - tricky...
  • A source which doesn't implement ICollection or ICollection<T> (easy: use Enumerable.Range which we've already implemented)

The third bullet is the nasty one. Obviously there are plenty of implementations of ICollection but not ICollection<T> (e.g. ArrayList) but because it doesn't implement IEnumerable<T>, we can't call the Count extension method on it. In the end I wrote my own SemiGenericCollection class.

Once we've got sources for all those tests, we need to decide what we're actually testing about them. Arguably we should test that the result is optimized, for example by checking that we never really enumerate the collection. That would require writing custom collections with GetEnumerator() methods which threw exceptions, but still returned a count from the Count property. I haven't gone this far, but it's another step we certainly could take.

For the overloads which take predicates, we don't need to worry about the various collection interfaces as we're not optimizing anyway.

The failure cases for null arguments are very simple, but there's one other case to consider: overflow. For Count, I've implemented a test case to verify the overflow behaviour. Unfortunately we can't run this test in the Edulinq implementation yet, as it requires Enumerable.Concat, but here it is for the record anyway:

[Test]
[Ignore("Takes an enormous amount of time!")]
public void Overflow()
{
    var largeSequence = Enumerable.Range(0, int.MaxValue)
                                  .Concat(Enumerable.Range(0, 1));
    Assert.Throws<OverflowException>(() => largeSequence.Count());
}

This guards against a bad implementation which overflows by simply wrapping the counter round to Int32.MinValue instead of throwing an exception.

As you can see, this test will be disabled even when it's uncommented after we implement Concat, as it requires counting up to 2 billion - not great for a quick set of unit tests. Even that isn't too bad, however, compared with the equivalent in LongCount which would have to count 2^63 items. Generating such a sequence isn't difficult, but iterating over it all would take a very long time. We also need an equivalent test for the overload with a predicate - something I neglected until writing up this blog post, and finding a bug in the implementation as I did so :)

For LongCount, I merely have an equivalent test to the above which checks that the same sequence can have its length expressed as a long value.

Let's implement them!

We'll look at the overload which does have a predicate first - as it's actually simpler:

public static int Count<TSource>(this IEnumerable<TSource> source,
    Func<TSource, bool> predicate)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (predicate == null)
    {
        throw new ArgumentNullException("predicate");
    }

    // No way of optimizing this
    checked
    {
        int count = 0;
        foreach (TSource item in source)
        {
            if (predicate(item))
            {
                count++;
            }
        }
        return count;
    }
}

Note that this time we're not using an iterator block (we're not returning a sequence), so we don't need to split the implementation into two different methods just to get eager argument validation.

After the argument validation, the main part of the method is reasonably simple, with one twist: we're performing the whole iteration within a "checked" context. This means that if the increment of count overflows, it will throw an OverflowException instead of wrapping round to a negative number. There are some other alternatives here:

  • We could have made just the increment statement checked instead of the whole second part of the method
  • We could have explicitly tested for count == int.MaxValue before incrementing, and thrown an exception in that case
  • We could just build the whole assembly in a checked context

I think it's useful for this section of code to be explicitly checked - it makes it obvious that it really is a requirement for general correctness. You may well prefer to make only the increment operation checked - I personally believe that the current approach draws more attention to the checked-ness, but it's definitely a subjective matter. It's also possible that an explicit check could be faster, although I doubt it - I haven't benchmarked either approach.

Other than the predicate-specific parts, all the above code also appears in the optimized Count implementation - so I won't discuss those again. Here's the full method:

public static int Count<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }

    // Optimization for ICollection<T>
    ICollection<TSource> genericCollection = source as ICollection<TSource>;
    if (genericCollection != null)
    {
        return genericCollection.Count;
    }

    // Optimization for ICollection
    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        return nonGenericCollection.Count;
    }

    // Do it the slow way - and make sure we overflow appropriately
    checked
    {
        int count = 0;
        using (var iterator = source.GetEnumerator())
        {
            while (iterator.MoveNext())
            {
                count++;
            }
        }
        return count;
    }
}

The only "new" code here is the optimization. There are effectively two equivalent blocks, just testing for different collection interface types, and using whichever one it finds first (if any). I don't know whether the .NET implementation tests for ICollection or ICollection<T> first - I could test it by implementing both interfaces but returning different counts from each, of course, but that's probably overkill. It doesn't really matter for well-behaved collections other than the slight performance difference - we want to test the "most likely" interface first, which I believe is the generic one.

To optimize or not to optimize?

The LongCount implementations are exactly the same as those for Count, except using long instead of int.

Notably, I still use optimizations for ICollection and ICollection<T> - but I don't believe the .NET implementation does so. (It's easy enough to tell by creating a huge list of bytes and comparing the time taken for Count and LongCount.)

There's an argument for using Array.GetLongLength when the source is an array... but I don't think the current CLR supports arrays with more than Int32.MaxValue elements anyway, so it's a bit of a non-issue other than for future-proofing. Beyond that, I'm not sure why the .NET implementation isn't optimized. It's not clear what an ICollection/ICollection<T> implementation is meant to return from its Count property if it has more than Int32.MaxValue elements anyway, to be honest.

Suggestions as to what I should have done are welcome... but I should probably point out that LongCount is more likely to be used against Queryable than Enumerable - it's easy to imagine a service representing a collection (such as a database table) which can quickly tell you the count even when it's very large. I would imagine that there are relatively few cases where you have a collection to evaluate in-process where you really just want to iterate through the whole lot just to get the count.

Conclusion

These are our first LINQ operators which return scalar values instead of sequences - with the natural consequence that they're simpler to understand in terms of control flow and timing. The methods simply execute - possibly with some optimization - and return their result. Nice and simple. Still, we've seen there can still be a few interesting aspects to consider, including questions around optimization which don't necessarily have a good answer.

Next time, I think I'll implement Concat - mostly so that I can uncomment the overflow tests for Count. That's going back to an operator which returns a sequence, but it's a really simple one...

Posted by skeet | 18 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 6 - Repeat

A trivial method next, with even less to talk about than "Empty"... "Repeat". This blog post is merely a matter of completeness.

What is it?

"Repeat" is a static, generic non-extension method with a single overload:

public static IEnumerable<TResult> Repeat<TResult>(
    TResult element,
    int count)

It simple returns a sequence which contains the specified element, repeated "count" times. The only argument validation is that "count" has to be non-negative.

What are we going to test?

There's really not a lot to test here. I've thought of 4 different scenarios:

  • A vanilla "repeat a string 3 times" sequence
  • An empty sequence (repeat an element 0 times)
  • A sequence containing null values (just to prove that "element" can be null)
  • A negative count to prove that argument validation occurs, and does so eagerly.

None of this is remotely exciting, I'm afraid.

Let's implement it!

Just about the only thing we could do wrong here is to put the argument validation directly in an iterator block... and we've implemented the "split method" pattern so many times already that we wouldn't fall into that trap. So, here's the code in all its tedious lack of glory:

public static IEnumerable<TResult> Repeat<TResult>(TResult element, int count)
{
    if (count < 0)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    return RepeatImpl(element, count);
}

private static IEnumerable<TResult> RepeatImpl<TResult>(TResult element, int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return element;
    }
}

That's it. Um, interesting points to note... none.

Conclusion

There's no sense in dragging this out. That's the lot. Next up, Count and LongCount - which actually do have a couple of interesting points.

Posted by skeet | 2 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 5 - Empty

Continuing with the non-extension methods, it's time for possibly the simplest LINQ operator around: "Empty".

What is it?

"Empty" is a generic, static method with just a single signature and no parameters:

public static IEnumerable<TResult> Empty<TResult>()

It returns an empty sequence of the appropriate type. That's all it does.

There's only one bit of interesting behaviour: Empty is documented to cache an empty sequence. In other words, it returns a reference to the same empty sequence every time you call it (for the same type argument, of course).

What are we going to test?

There are really only two things we can test here:

  • The returned sequence is empty
  • The returned sequence is cached on a per type argument basis

I'm using the same approach as for Range to call the static method, but this time with an alias of EmptyClass. Here are the tests:

[Test]
public void EmptyContainsNoElements()
{
    using (var empty = EmptyClass.Empty<int>().GetEnumerator())
    {
        Assert.IsFalse(empty.MoveNext());
    }
}

[Test]
public void EmptyIsASingletonPerElementType()
{
    Assert.AreSame(EmptyClass.Empty<int>(), EmptyClass.Empty<int>());
    Assert.AreSame(EmptyClass.Empty<long>(), EmptyClass.Empty<long>());
    Assert.AreSame(EmptyClass.Empty<string>(), EmptyClass.Empty<string>());
    Assert.AreSame(EmptyClass.Empty<object>(), EmptyClass.Empty<object>());

    Assert.AreNotSame(EmptyClass.Empty<long>(), EmptyClass.Empty<int>());
    Assert.AreNotSame(EmptyClass.Empty<string>(), EmptyClass.Empty<object>());
}

Of course, that doesn't verify that the cache isn't per-thread, or something like that... but it'll do.

Let's implement it!

The implementation is actually slightly more interesting than the description so far may suggest. If it weren't for the caching aspect, we could just implement it like this:

// Doesn't cache the empty sequence
public static IEnumerable<TResult> Empty<TResult>()
{
    yield break;
}

... but we want to obey the (somewhat vaguely) documented caching aspect too. It's not really hard, in the end. There's a very handy fact that we can use: empty arrays are immutable. Arrays always have a fixed size, but normally there's no way of making an array read-only... you can always change the value of any element. But an empty array doesn't have any elements, so there's nothing to change. So, we can reuse the same array over and over again, returning it directly to the caller... but only if we have an empty array of the right type.

At this point you may be expecting a Dictionary<Type, Array> or something similar... but there's another useful trick we can take advantage of. If you need a per-type cache and the type is being specific as a type argument, you can use static variables in a generic class, because each constructed type will have a distinct set of static variables.

Unfortunately, Empty is a generic method rather than a non-generic method in a generic type... so we've got to create a separate generic type to act as our cache for the empty array. That's easy to do though, and the CLR takes care of initializing the type in a thread-safe way, too. So our final implementation looks like this:

public static IEnumerable<TResult> Empty<TResult>()
{
    return EmptyHolder<TResult>.Array;
}
        
private static class EmptyHolder<T>
{
    internal static readonly T[] Array = new T[0];       
}

That obeys all the caching we need, and is really simple in terms of lines of code... but it does mean you need to understand how generics work in .NET reasonably well. In some ways this is the opposite of the situation in the previous post - this is a sneaky implementation instead of the slower but arguably simpler dictionary-based one. In this case, I'm happy with the trade-off, because once you do understand how generic types and static variables work, this is simple code. It's a case where simplicity is in the eye of the beholder.

Conclusion

So, that's Empty. The next operator - Repeat - is likely to be even simpler, although it'll have to be another split implementation...

Addendum

Due to the minor revolt over returning an array (which I still think is fine), here's an alternative implementation:

public static IEnumerable<TResult> Empty<TResult>()
{
    return EmptyEnumerable<TResult>.Instance;
}

#if AVOID_RETURNING_ARRAYS
private class EmptyEnumerable<T> : IEnumerable<T>, IEnumerator<T>
{
    internal static IEnumerable<T> Instance = new EmptyEnumerable<T>();

    // Prevent construction elsewhere
    private EmptyEnumerable()
    {
    }

    public IEnumerator<T> GetEnumerator()
    {
        return this;
    }

    IEnumerator IEnumerable.GetEnumerator()
    {
        return this;
    }

    public T Current
    {
        get { throw new InvalidOperationException(); }
    }

    object IEnumerator.Current
    {
        get { throw new InvalidOperationException(); }
    }

    public void Dispose()
    {
        // No-op
    }

    public bool MoveNext()
    {
        return false// There's never a next entry
    }

    public void Reset()
    {
        // No-op
    }
}

#else
private static class EmptyEnumerable<T>
{
    internal static readonly T[] Instance = new T[0];       
}
#endif

Hopefully now everyone can build a version they're happy with :)

Posted by skeet | 11 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 4 - Range

This will be a short post, and there'll probably be some more short ones coming up too. I think it makes sense to only cover multiple operators in a single post where they're really similar. (Count and LongCount spring to mind.) I'm in your hands though - if you would prefer "chunkier" posts, please say so in the comments.

This post will deal with the Range generation operator.

What is it?

Range only has a single signature:

public static IEnumerable<int> Range(
    int start,
    int count)

Unlike most of LINQ, this isn't an extension method - it's a plain old static method. It returns an iterable object which will yield "count" integers, starting from "start" and incrementing each time - so a call to Enumerable.Range(6, 3) would yield 6, then 7, then 8.

As it doesn't operate on an input sequence, there's no sense in which it could stream or buffer its input, but:

  • The arguments need to be validated eagerly; the count can't be negative, and it can't be such that any element of the range could overflow Int32.
  • The values will be yielded lazily - Range should be cheap, rather than creating (say) an array of "count" elements and returning that.

How are we going to test it?

Testing a plain static method brings us a new challenge in terms of switching between the "normal" LINQ implementation and the Edulinq one. This is an artefact of the namespaces I'm using - the tests are in Edulinq.Tests, and the implementation is in Edulinq. "Parent" namespaces are always considered when the compiler tries to find a type, and they take priority over anything in using directives - even a using directive which tries to explicitly alias a type name.

The (slightly ugly) solution to this that I've chosen is to include a using directive to create an alias which couldn't otherwise be resolved - in this case, RangeClass. The using directive will either alias RangeClass to System.Linq.Enumerable or Edulinq.Enumerable. The tests then all use RangeClass.Range. I've also changed how I'm switching between implementations - I now have two project configurations, one of which defines the NORMAL_LINQ preprocessor symbol, and the other of which doesn't. The RangeTest class therefore contains:

#if NORMAL_LINQ
using RangeClass = System.Linq.Enumerable;
#else
using RangeClass = Edulinq.Enumerable;
#endif

There are alternatives to this approach, of course:

  • I could move the tests to a different namespace
  • I could make the project references depend on the configuration... so the "Normal LINQ" configuration wouldn't reference the Edulinq implementation project, and the "Edulinq implementation" configuration wouldn't reference System.Core. I could then just use Enumerable.Range with an appropriate using directive for System.Linq conditional on the NORMAL_LINQ preprocessor directive, as per the other tests.

I like the idea of the second approach, but it means manually tinkering with the test project file - Visual Studio doesn't expose any way of conditionally including a reference. I may do this at a later date... thoughts welcome.

What are we going to test?

There isn't much we can really test for ranges - I only have eight tests, none of which are particularly exciting:

  • A simple valid range should look right when tested with AssertSequenceEqual
  • The start value should be allowed to be negative
  • Range(Int32.MinValue, 0) is an empty range
  • Range(Int32.MaxValue, 1) yields just Int32.MaxValue
  • The count can't be negative
  • The count can be zero
  • start+count-1 can't exceed Int32.MaxValue (so Range(Int32.MaxValue, 2) isn't valid)
  • start+count-1 can be Int32.MaxValue (so Range(Int32.MaxValue, 1) is valid)

The last two are tested with a few different examples each - a large start and a small count, a small start and a large count, and "fairly large" values for both start and count.

Note that I don't have any tests for lazy evaluation - while I could test that the returned value doesn't implement any of the other collection interfaces, it would be a little odd to do so. On the other hand, we do have tests which have an enormous count - such that anything which really tried to allocate a collection of that size would almost certainly fail...

Let's implement it!

It will surely be no surprise by now that we're going to use a split implementation, with a public method which performs argument validation eagerly and then uses a private method with an iterator block to perform the actual iteration.

Having validated the arguments, we know that we'll never overflow the bounds of Int32, so we can be pretty casual in the main part of the implementation.

public static IEnumerable<int> Range(int start, int count)
{
    if (count < 0)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    // Convert everything to long to avoid overflows. There are other ways of checking
    // for overflow, but this way make the code correct in the most obvious way.
    if ((long)start + (long)count - 1L > int.MaxValue)
    {
        throw new ArgumentOutOfRangeException("count");
    }
    return RangeImpl(start, count);
}

private static IEnumerable<int> RangeImpl(int start, int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return start + i;
    }
}

Just a few points to note here:

  • Arguably it's the combination of "start" and "count" which is invalid in the second check, rather than just count. It would possibly be nice to allow ArgumentOutOfRangeException (or ArgumentException in general) to blame multiple arguments rather than just one. However, using "count" here matches the framework implementation.
  • There are other ways of performing the second check, and I certainly didn't have to make all the operands in the expression longs. However, I think this is the simplest code which is clearly correct based on the documentation. I don't need to think about all kinds of different situations and check that they all work. The arithmetic will clearly be valid when using the Int64 range of values, so I don't need to worry about overflow, and I don't need to consider whether to use a checked or unchecked context.
  • There are also other ways of looping in the private iterator block method, but I think this is the simplest. Another obvious and easy alternative is to keep two values, one for the count of yielded values and the other for the next value to yield, and increment them both on each iteration. A more complex approach would be to use just one loop variable - but you can't use "value < start + count" in case the final value is exactly Int32.MaxValue, and you can't use "value <= start + count - 1" in case the arguments are (int.MinValue, 0). Rather than consider all the border cases, I've gone for an obviously-correct solution. If you really, really cared about the performance of Range, you'd want to investigate various other options.

Prior to writing up this post, I didn't have good tests for Range(Int32.MaxValue, 1) and Range(Int32.MinValue, 0)... but as they could easily go wrong as mentioned above, I've now included them. I find it interesting how considering alternative implementations suggests extra tests.

Conclusion

"Range" was a useful method to implement in order to test some other operators - "Count" in particular. Now that I've started on the non-extension methods though, I might as well do the other two (Empty and Repeat). I've already implemented "Empty", and will hopefully be able to write it up today. "Repeat" shouldn't take much longer, and then we can move on to "Count" and "LongCount".

I think this code is a good example of situations where it's worth writing "dumb" code which looks like the documentation, rather than trying to write possibly shorter, possibly slightly more efficient code which is harder to think about. No doubt there'll be more of that in later posts...

Posted by skeet | 12 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 3 - "Select" (and a rename...)

It's been a long time since I wrote part 1 and part 2 of this blog series, but hopefully things will move a bit more quickly now.

The main step forward is that the project now has a source repository on Google Code instead of just being a zip file on each blog post. I had to give the project a title at that point, and I've chosen Edulinq, hopefully for obvious reasons. I've changed the namespaces etc in the code, and the blog tag for the series is now Edulinq too. Anyway, enough of the preamble... let's get on with reimplementing LINQ, this time with the Select operator.

What is it?

Like Where, Select has two overloads:

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, int, TResult> selector)

Again, they both operate the same way - but the second overload allows the index into the sequence to be used as part of the projection.

Simple stuff first: the method projects one sequence to another: the "selector" delegate is applied to each input element in turn, to yield an output element. Behaviour notes, which are exactly the same as Where (to the extent that I cut and paste these from the previous blog post, and just tweaked them):

  • The input sequence is not modified in any way.
  • The method uses deferred execution - until you start trying to fetch items from the output sequence, it won't start fetching items from the input sequence.
  • Despite deferred execution, it will validate that the parameters aren't null immediately.
  • It streams its results: it only ever needs to look at one result at a time.
  • It will iterate over the input sequence exactly once each time you iterate over the output sequence.
  • The "selector" function is called exactly once per yielded value.
  • Disposing of an iterator over the output sequence will dispose of the corresponding iterator over the input sequence.

What are we going to test?

The tests are very much like those for Where - except that in cases where we tested the filtering aspect of Where, we're now testing the projection aspect of Select.

There are a few tests of some interest. Firstly, you can tell that the method is generic with 2 type parameters instead of 1 - it has type parameters of TSource and TResult. They're fairly self-explanatory, but it means it's worth having a test for the case where the type arguments are different - such as converting an int to a string:

[Test]
public void SimpleProjectionToDifferentType()
{
    int[] source = { 1, 5, 2 };
    var result = source.Select(x => x.ToString());
    result.AssertSequenceEqual("1", "5", "2");
}

Secondly, I have a test that shows what sort of bizarre situations you can get into if you include side effects in your query. We could have done this with Where as well of course, but it's clearer with Select:

[Test]
public void SideEffectsInProjection()
{
    int[] source = new int[3]; // Actual values won't be relevant
    int count = 0;
    var query = source.Select(x => count++);
    query.AssertSequenceEqual(0, 1, 2);
    query.AssertSequenceEqual(3, 4, 5);
    count = 10;
    query.AssertSequenceEqual(10, 11, 12);
}

Notice how we're only calling Select once, but the results of iterating over the results change each time - because the "count" variable has been captured, and is being modified within the projection. Please don't do things like this.

Thirdly, we can now write query expressions which include both "select" and "where" clauses:

[Test]
public void WhereAndSelect()
{
    int[] source = { 1, 3, 4, 2, 8, 1 };
    var result = from x in source
                 where x < 4
                 select x * 2;
    result.AssertSequenceEqual(2, 6, 4, 2);
}

There's nothing mind-blowing about any of this, of course - hopefully if you've used LINQ to Objects at all, this should all feel very comfortable and familiar.

Let's implement it!

Surprise surprise, we go about implementing Select in much the same way as Where. Again, I simply copied the implementation file and tweaked it a little - the two methods really are that similar. In particular:

  • We're using iterator blocks to make it easy to return sequences
  • The semantics of iterator blocks mean that we have to separate the argument validation from the real work. (Since I wrote the previous post, I've learned that VB11 will have anonymous iterators, which will avoid this problem. Sigh. It just feels wrong to envy VB users, but I'll learn to live with it.)
  • We're using foreach within the iterator blocks to make sure that we dispose of the input sequence iterator appropriately - so long as our output sequence iterator is disposed or we run out of input elements, of course.

I'll skip straight to the code, as it's all so similar to Where. It's also not worth showing you the version with an index - because it's such a trivial difference.

public static IEnumerable<TResult> Select<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    return SelectImpl(source, selector);
}

private static IEnumerable<TResult> SelectImpl<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    foreach (TSource item in source)
    {
        yield return selector(item);
    }
}

Simple, isn't it? Again, the real "work" method is even shorter than the argument validation.

Conclusion

While I don't generally like boring my readers (which may come as a surprise to some of you) this was a pretty humdrum post, I'll admit. I've emphasized "just like Where" several times to the point of tedium very deliberately though - because it makes it abundantly clear that there aren't really as many tricky bits to understand as you might expect.

Something slightly different next time (which I hope will be in the next few days). I'm not quite sure what yet, but there's an awful lot of methods still to choose from...

Posted by skeet | 9 comment(s)
Filed under: , ,

A Model/View to a Kill (Naked came the null delegate, part 5)

(I suggest you read the earlier parts of the story first. I'm not claiming it'll make any more sense afterwards, mind you.)

Even though Seymour Sharpton's brain was in a spinlock, a low-level interrupt brought him out of his stupor - namely, an enormous motorcycle bursting through the floor near the daemon. It was impossible to tell the form of the rider under the leather and helmet. When the biker spoke, the voice was digitally disguised but its authority was clear:

"Sharpton. Here, now. The rest of you: you know me. Follow us, and there'll be trouble."

Algol hissed sharply, and then started cajoling Seymour: "Don't go. Stay with us. My master didn't mean what he said about, um, deletion. It was just a little joke. You're safe here..."

But Seymour was already running towards the motorcycle. The biker had never stopped revving the engine, and the moment Seymour jumped on the rear seat, the wheels span briefly, kicking up clouds of dust before the bike raced through the warehouse and through the (fortunately still open) door.

The ride was fast, bumpy, and terrifying. Seymour hadn't felt like this since he'd given up C, years ago. The roar of the engine drowned out any conversation, and he was left holding on for dear life until the bike came to a screeching halt in a deserted street. The biker dismounted and offered a hand to Seymour, who shook it nervously.

"Who are you?" he murmured, almost afraid to find out.

"My name is unimportant," replied the metallic voice, still masked by the helmet.

"It's not Slartibartfast, is it?" Seymour had visions of being whisked away to see fjords. That would just about fit in with the rest of his strange evening.

"No, it's... unspeakable."

"Ah, so you're an anonymous type of guy, huh?"

"Anonymous, yes... guy, no." The biker removed her helmet and shook her head. Her long blonde hair swooshed from side to side, and time seemed to slow for Seymour as he watched her. She was a model he could view forever... although the idea of trying to control her seemed out of the question. Then several strands of hair were caught in the anonymous girl's gently pouting mouth, and she spat them out hurriedly. "Damn it, I hate it when that happens. Anyway, you are lucky to be alive. You have no idea what our shady underworld contains... those zombies aren't the worst of it by a long chalk."

"There's more?" Seymour gulped.

"Worse than you can imagine. We're lucky it's a new moon tonight, for example. Otherwise this place would be heaving with were-clauses. Most of the month they just filter as normal, but come the full moon... urgh." She shuddered, and Seymour didn't want to ask any further. The biker paused, and then continued.

"Then there's the mutants. They're harmless enough, but not for want of trying. They'll lope after you, but they mutate almost without thinking about it. Totally dysfunctional. A quick kick to the monads will usually despatch them... But tonight, we have something else to fear." She looked around, cautiously. "The word on the street is that the Vimpires are in town. Every few years we think we've got rid of them... and then they come back, with their long and surprisingly dexterous fingers. You know how you can tell when the Vimpires are around?"

Seymour was spellbound. "How?"

"The mice all leave, in droves. The rats don't care, but a Vimpires will torture a mouse just for the fun of it. But this time, there are rumours. There's talk of a bigger plan afoot. The one thing the Vimpires are still afraid of is bright light. During the day, we're safe... but imagine if there were no more days? Perpetual twilight - like "Breaking Dawn part 1" but forever."

"They wouldn't!" Seymour gasped. He remembered that long night in the cinema only too well.

"They would. And they have allies... for the first time, the Eclipse posse and the Vimpires are joining forces. So we have to fight them. You're not the first innocent man I've rescued tonight, and you won't be the last. But I need to be sure of you... do you have references?"

"Um, what kind of references?"

"Anything to prove your identity. It's a class war out there, Seymour... now what type of man are you? Where do your values lie? Oh, never mind... I'll trust you for now. But Seymour, you need to be ready. Brace yourself. Are you in The Zone?"

"I don't know what you mean... what zone are you talking about?"

"Ah, true, that was ambiguous. UTC or not UTC, that is the question. Whether 'tis nobler in in the mind to suffer the leap seconds and missing hours of outrageous chronology, or to take ARM against a sea of doubles, and by opposing end them?"

"What on earth are you babbling about?"

"No matter. All you need to know is this... the Vimpires are trying to extinguish the sun, but we're going to stop them. It's daylight saving time."

Continued in part 6 - The Great Destructor...

Posted by skeet | 4 comment(s)
Filed under: ,

Creative freedom, control, and the balance of power

Stephen Colebourne's comment on my last blog post (adding 1 month -1 day to January 29th) have knocked me for six. To avoid thinking about how I might implement the preferred behaviour in Noda Time while still using Joda Time's "engine" I've decided to write about something else which has been niggling at me.

For a long time, I've espoused the idea of "design for inheritance or prohibit it" - in other words, default to sealing classes and making methods non-virtual unless you have a good reason to do otherwise. I've usually attributed this phrase to Josh Bloch writing in Effective Java, but it could well have been round for a long time.

Whenever I've espoused this in public, it's caused disagreement. Not universal disagreement (which would be easy to cope with; if everyone else thinks I'm wrong, that's a very strong indicator that I'm really missing something) but a fairly even distribution of "absolutely" and "hell no". Most people seem to feel passionately one way or the other. This has led me to water down my line of "the C# designers should have made classes sealed by default" to "the C# designers shouldn't have included a default for classes being sealed or not - make the developer specify it".

One thing I've found interesting is that the split between "make everything non-virtual" and "make everything virtual" isn't one of "smart" vs "dumb". There are plenty of publically admired developers on both sides of the fence, and my colleagues are reasonably evenly split too. However, I have detected a correlation in terms of programming preferences around type systems: I've generally found that those who are in favour of making everything virtual by default are generally more likely to also be enthusiastic around dynamic typing. That won't be universally true of course, but I think one is likely to be a reasonably good predictor of the other.

Ultimately I think it's about a balance, and different people place different amounts of value on the various pros and cons. It's also about the relationship between different parties. Different pros and cons affect different parties in different ways.

A relatively inflexible API with a flexible implementation

I'm happy when I know everything that is going on in my code. I interact with other code through obvious dependencies: they are provided to me explicitly. You're welcome to modify my code's visible behaviour by implementing those dependencies in different ways, but my code should be fine as long as you abide by the contracts expressed in the dependencies (typically interfaces).

If I call one of my own non-virtual methods from within my code, I know what it's going to do. If I have two non-virtual methods which could be implemented by one calling the other either way round, then it doesn't matter which option I pick. I can change my mind later on, and no-one will be any the wiser. All the externally visible behaviour will be exactly the same. I don't need to document which method calls which - just what the final results are.

If I create an immutable type and seal it, then all the immutability is within my control. If I've picked immutable types for my member variables, have Object as a base class, and make sure I don't mutate anything myself, I'm fine. I can rely on my values being unchanging, and so can my callers. They can cache a value with impunity.

Basically, everything is simple... unless you want to make one of my types behave slightly differently.

Flexibility with the risk of breakage

The above section sounds all nice and rosy... but what if you want to just tweak my type slightly? You only want to override one method - how much harm can that do? You've looked at the implementation and seen that nothing actually relies on it working exactly the way it does... and it doesn't call any other public members. If my type implements an interface including the member you want to tweak, then you could potentially implement the interface and delegate almost all calls to an instance of the original type, but implement that one call differently. Of course, delegation is great in many cases - but can fail when there are complex relationships involved (such as when the delegated instance passes itself to something else). Basically there are identity issues.

It would be much simpler in this case if you could override my method. That might help your testing, or allow you to use my type in a way I'd never anticipated, achieving fabulous things. As Stroustrup wrote, "I wouldn't like to build a tool that could only do what I had been able to imagine for it." Now I believe there's a big difference between imagining a "big picture" which my component may be some tiny part of, and imagining a crazy use for the type itself, but the sentiment is still worth considering. Creative freedom is a nice thing to have, after all... who am I to stop you from achieving greatness?

The downside is that you're opening yourself to the risk of your code breaking if I change my implementation details in another release. Maybe it would only mean your tests breaking - maybe it's your production code. (While I don't want to put too many words in the mouths of those who hold the opposite opinion to me, I believe a lot of their reason for wanting to be able to override methods is to make testing easier. Personally I prefer to construct test doubles which implement interfaces directly, but I do understand that's not always feasible - especially if the component in question hasn't been designed with testability in mind to start with.)

In many cases there's genuinely little risk of that actually happening... but how tolerant are you of that risk?

Risk evaluation and propagation

When I wrote about what might break if the library producer changes their code, I mentioned your production code and your test code. There's a much nastier risk though: you break someone else's code which is relying on yours.

Suppose I produce library X, and you use it in library Y. Now Joe Developer uses both of our libraries in his application... except he uses a new version of X. Maybe it's a bug-fix version, which is meant to have no externally-visible changes... except it changes how it uses its own methods, in a way which will break if you've overridden some methods in a particular way... and you've done so in library Y. As far as Joe Developer is concerned, his combination of X + Y is broken. Who's fault is it?

  • Mine for changing the behaviour of X in a seemingly sensible way?
  • Yours for overriding a member of X in Y in a way I hadn't anticipated?
  • Joe's for using a version of X which you hadn't developed Y against?

Maybe all three. The trouble is, as the developer of Y you have no way of knowing how likely it is that I'll change the details of my implementation in "seemingly harmless" ways. Indeed, you may even have performed some testing of Y against the version of X that Joe is using... but maybe Joe's overriding some other members of the types from X and Y in ways that neither you nor I expected... and the combination could be complex to work out.

Now this all sounds like doom and gloom - but you need to remember that there must have been reasons for overriding those members to start with. Achieving the same goals without using inheritance could certainly have been considerably more complex, and introduced other bugs along the way. Using inheritance could have been a big win all round, right up until the point where everything screwed up... at which point it may still be recoverable, or it could be a knot which can't be untied. You probably won't know until the breakage happens, and you probably can't accurately gauge the likelihood of it happening in the first place. It may well never happen.

Three options as library providers and consumers

It seems to me that when you're building an API, there are three broad options available (obviously with nuanced positions between them):

  • Make every type unsealed, and every method virtual - but don't make any guarantees about what will happen if someone overrides methods.
  • Make every type unsealed and every method virtual - but document/guarantee every internal interaction, so that anyone deriving from your class can predict how it will behave.
  • Make everything sealed or non-virtual unless you can foresee a reason for overriding it. Document what sort of overriding you expect to handle, and where the overridden methods will be called.

As the consumer of an API, you have various choices too:

  • Rely on undocumented behaviour, betting that you'll save more time by doing and fixing breakage later
  • Only rely on documented behaviour when calling code, but rely on undocumented behaviour when overriding code, as this is typically less well documented anyway (very few APIs will specify exactly what's deemed acceptable)
  • Only rely on documented behaviour

While these options are reasonably easy to describe, they again miss the oh-so-common situation: I'm consuming someone else's types, but providing my own types to other developers. This mixed behaviour is where a lot of the complexity comes in, increasing the risk of breakage and increasing the cost of fixing the problem.

Conclusion

I still believe that designing for inheritance or prohibiting it makes sense if you want to provide a robust library which makes it hard for the consumer to abuse. However, I appreciate that others want the ability to abuse a library - and are willing to pay the price for that down the line. I'm concerned by the "3 party" scenario though - where developer X can shoot your foot off by abusing my code.

Sadly, I can't see this long-running argument coming any closer to resolution. Better mix-in support within C# would at least help, I believe - but delegation is no panacea either.

I want to be a pragmatic developer: I dislike the dogmatic prohibition of convenient practices for the sake of purely theoretical reasons as much as the next person... and I genuinely can see where it can be a pain not being able to override behaviour at times. However, I have yet to be convinced that a gung-ho, "It probably won't break, honest!" attitude is really a good option in the long term. I hope I'm gaining an increasingly open mind though - and I hope that at least by discussing this from slightly less religious viewpoints from normal, both camps can learn something from each other.

Posted by skeet | 12 comment(s)
Filed under: ,

The joys of date/time arithmetic

(Cross-posted to my main blog and the Noda Time blog, in the hope that the overall topic is still of interest to those who aren't terribly interested in Noda Time per se.)

I've been looking at the "period" part of Noda Time recently, trying to redesign the API to simplify it somewhat. This part of the API is what we use to answer questions such as:

  • What will the date be in 14 days?
  • How many hours are there between now and my next birthday?
  • How many years, months and days have I been alive for?

I've been taking a while to get round to this because there are some tricky choices to make. Date and time arithmetic is non-trivial - not because of complicated rules which you may be unaware of, but simply because of the way calendaring systems work. As ever, time zones make life harder too. This post won't talk very much about the Noda Time API details, but will give the results of various operations as I currently expect to implement them.

The simple case: arithmetic on the instant time line

One of the key concepts to understand when working with time is that the usual human "view" on time isn't the only possible one. We don't have to break time up into months, days, hours and so on. It's entirely reasonable (in many cases, at least) to consider time as just a number which progresses linearly. In the case of Noda Time, it's the number of ticks (there are 10 ticks in a microsecond, 10,000 ticks in a millisecond, and 10 million ticks in a second) since midnight on January 1st 1970 UTC.

Leaving relativity aside, everyone around the world can agree on an instant, even if they disagree about everything else. If you're talking over the phone (using a magic zero-latency connection) you may think you're in different years, using different calendar systems, in different time zones - but still both think of "now" as "634266985845407773 ticks".

That makes arithmetic really easy - but also very limited. You can only add or subtract numbers of ticks, effectively. Of course you can derive those ticks from some larger units which have a fixed duration - for example, you could convert "3 hours" into ticks - but some other concepts don't really apply. How would you add a month? The instant time line has no concept of months, and in most calendars different months have different durations (28-31 days in the ISO calendar, for example). Even the idea of a day is somewhat dubious - it's convenient to treat a day as 24 hours, but you need to at least be aware that when you translate an instant into a calendar that a real person would use, days don't always last for 24 hours due to daylight savings.

Anyway, the basic message is that it's easy to do arithmetic like this. In Noda Time we have the Instant structure for the position on the time line, and the Duration structure as a number of ticks which can be added to an Instant. This is the most appropriate pair of concepts to use to measure how much time has passed, without worrying about daylight savings and so on: ideal for things like timeouts, cache purging and so on.

Things start to get messy: local dates, times and date/times

The second type of arithmetic is what humans tend to actually think in. We talk about having a meeting in a month's time, or how many days it is until Christmas (certainly my boys do, anyway). We don't tend to consciously bring time zones into the equation - which is a good job, as we'll see later.

Now just to make things clear, I'm not planning on talking about recurrent events - things like "the second Tuesday and the last Wednesday of every month". I'm not planning on supporting recurrences in Noda Time, and having worked on the calendar part of Google Mobile Sync for quite a while, I can tell you that they're not fun. But even without recurrences, life is tricky.

Introducing periods and period arithmetic

The problem is that our units are inconsistent. I mentioned before that "a month" is an ambiguous length of time... but it doesn't just change by the month, but potentially by the year as well: February is either 28 or 29 days long depending on the year. (I'm only considering the ISO calendar for the moment; that gives enough challenges to start with.)

If we have inconsistent units, we need to keep track of those units during arithmetic, and even request that the arithmetic be performed using specific units. So, it doesn't really make sense to ask "how long is the period between June 10th 2010 and October 13th 2010" but it does make sense to ask "how many days are there between June 10th 2010 and October 13th 2010" or "how many years, months and days are there between June 10th 2010 and October 13th 2010".

Once you've got a period - which I'll describe as a collection of unit/value pairs, e.g. "0 years, 4 months and 3 days" (for the last example above) you can still give unexpected behaviour. If you add that period to your original start date, you should get the original end date... but if you advance the start date by one day, you may not advance the end date by one day. It depends on how you handle things like "one month after January 30th 2010" - some valid options are:

  • Round down to the end of the month: February 28th
  • Round up to the start of the next month: March 1st
  • Work out how far we've overshot, and apply that to the next month: March 2nd
  • Throw an exception

All of these are justifiable. Currently, Noda Time will always take the first approach. I believe that JSR-310 (the successor to Joda Time) will allow the behaviour to be resolved according to a strategy provided by the user... it's unclear to me at the moment whether we'll want to go that far in Noda Time.

Arithmetic in Noda Time is easily described, but the consequences can be subtle. When adding or subtracting a period from something like a LocalDate, we simply iterate over all of the field/value pairs in the period, starting with the most significant, and add each one in turn. When finding the difference between two LocalDate values with a given set of field types (e.g. "months and days") we get as close as we can without overshooting using the most significant field, then the next field etc.

The "without overshooting" part means that if you add the result to the original start value, the result will always either be the target end value (if sufficiently fine-grained fields are available) or somewhere between the original start and the target end value. So "June 2nd 2010 to October 1st 2010 in months" gives a result of "3 months" even though if we chose "4 months" we'd only overshoot by a tiny amount.

Now we know what approach we're taking, let's look at some consequences.

Asymmetry and other oddities

It's trivial to show some assymetry just using a period of a single month. For example:

  • January 28th 2010 + 1 month = February 28th 2010
  • January 29th 2010 + 1 month = February 28th 2010
  • January 30th 2010 + 1 month = February 28th 2010
  • February 28th 2010 - 1 month = January 28th 2010

It gets even more confusing when we add days into the mix:

  • January 28th 2010 + 1 month + 1 day = March 1st 2010
  • January 29th 2010 + 1 month + 1 day = March 1st 2010
  • March 1st 2010 - 1 month - 1 day = January 31st 2010

And leap years:

  • March 30th 2013 - 1 year - 1 month - 10 days = February 19th 2012 (as "February 30th 2012" is truncated to February 29th 2012)
  • March 30th 2012 - 1 year - 1 month - 10 days = February 18th 2012 (as "February 30th 2011" is truncated to February 28th 2011)

Then we need to consider how rounding works when finding the difference between days... (forgive the pseudocode):

  • Between(January 31st 2010, February 28th 2010, Months & Days) = ?
  • Between(February 28th 2010, January 31st 2010, Months & Days) = -28 days

The latter case is relatively obvious - because if you take a whole month of February 28th 2010 you end up with January 28th 2010, which is an overshoot... but what about the first case?

Should we return the determine the number of months by "the largest number such that start + period <= end"? If so, we get a result of "1 month" - which makes sense given the first set of results in this section.

What worries me most about this situation is that I honestly don't know offhand what the current implementation will do. I think it would be best to return "28 days" as there isn't genuinely a complete month between the two... <tappety tappety>

Since writing the previous paragraph, I've tested it, and it returns 1 month and 0 days. I don't know how hard it would be to change this behaviour or whether we want to. Whatever we do, however, we need to document it.

That's really at the heart of this: we must make Noda Time predictable. Where there are multiple feasible results, there should be a simple way of doing the arithmetic by hand and getting the same results as Noda Time. Of course, picking the best option out of the ones available would be good - but I'd rather be consistent and predictable than "usually right" be unpredictably so.

Think it's bad so far? It gets worse...

ZonedDateTime: send in the time zones... (well maybe next year?)

I've described the "instant time line" and its simplicity.

I've described the local date/time complexities, where there's a calendar but there's no time zone.

So far, the two worlds have been separate: you can't add a Duration to a LocalDateTime (etc), and you can't add a Period to an Instant. Unfortunately, sooner or later many applications will need ZonedDateTime.

Now, you can think of ZonedDateTime in two different ways:

  • It's an Instant which knows about a calendar and a time zone
  • It's a LocalDateTime which knows about a time zone and the offset from UTC

The "offset from UTC" part sounds redundant at first - but during daylight saving transitions the same LocalDateTime occurs at two different instants; the time zone is the same in both cases, but the offset is different.

The latter way of thinking is how we actually represent a ZonedDateTime internally, but it's important to know that a ZonedDateTime still unambiguously maps to an Instant.

So, what should we be able to do with a ZonedDateTime in terms of arithmetic? I think the answer is that we should be able to add both Periods and Durations to a ZonedDateTime - but expect them to give different results.

When we add a Duration, that should work out the Instant represented by the current DateTime, advance it by the given duration, and return a new ZonedDateTime based on that result with the same calendar and time zone. In other words, this is saying, "If I were to wait for the given duration, what date/time would I see afterwards?"

When we add a Period, that should add it to the LocalDateTime represented by the ZonedDateTime, and then return a new ZonedDateTime with the result, the original time zone and calendar, and whatever offset is suitable for the new LocalDateTime. (That's deliberately woolly - I'll come back to it.) This is the sort of arithmetic a real person would probably perform if you asked them to tell you what time it would be "three hours from now". Most people don't take time zones into account...

In most cases, where a period can be represented as a duration (for example "three hours") the two forms of addition will give the same result. Around daylight saving transitions, however, they won't. Let's consider some calculations on Sunday November 7th 2010 in the "Pacific/Los_Angeles" time zone. It had a daylight saving transition from UTC-7 to UTC-8 at 2am local time. In other words, the clock went 1:58, 1:59, 1:00. Let's start at 12:30am (local time, offset = -7) and add a few different values:

  • 12:30am + 1 hour duration = 1:30am, offset = -7
  • 12:30am + 2 hours duration = 1:30am, offset = -8
  • 12:30am + 3 hours duration = 2:30am, offset = -8
  • 12:30am + 1 hour period = 1:30am, offset = ???
  • 12:30am + 2 hour period = 2:30am, offset = -8
  • 12:30am + 3 hour period = 3:30am, offset = -8

The ??? value is the most problematic one, because 1:30 occurs twice... when thinking of the time in a calendar-centric way, what should the result be? Options here:

  • Always use the earlier offset
  • Always use the later offset
  • Use the same offset as the start date/time
  • Use the offset in the direction of travel (so adding one hour from 12:30am would give 1:30am with an offset of -7, but subtracting one hour from 2:30am would give 1:30am with an offset of -8)
  • Throw an exception
  • Allow the user to pass in an argument which represents a strategy for resolving this

This is currently unimplemented in Noda Time, so I could probably choose whatever behaviour I want, but frankly none of them has much appeal.

At the other daylight saving transition, when the clocks go forward, we have the opposite problem: adding one hour to 12:30am can't give 1:30am because that time never occurs. Options in this case include:

  • Return the first valid time after the transition (this has problems if we're subtracting time, where we'd presumably want to return the latest valid time before the transition... but the transition has an exclusive lower bound, so there's no such "latest valid time" really)
  • Add the offset difference, so we'd skip to 2:30am
  • Throw an exception
  • Allow the user to pass in a strategy

Again, nothing particularly appeals.

All of this is just involved in adding a period to a ZonedDateTime - then the same problems occur all over again when trying to find the period between them. What's the difference (as a Period rather than a simple Duration) between 1:30am with an offset of -7 and 1:30am with an offset of -8? Nothing, or an hour? Again, at the moment I really don't know the best course of action.

Conclusion

This post has ended up being longer than I'd expected, but hopefully you've got a flavour of the challenges we're facing. Even without time zones getting involved, date and time arithmetic is pretty silly - and with time zones, it becomes very hard to reason about - and to work out what the "right" result to be returned by an API should be, let alone implement it.

Above all, it's important to me that Noda Time is predictable and clearly documented. Very often, if a library doesn't behave exactly the way you want it to, but you can tell what it's going to do, you can work around that - but if you're having to experiment to guess the behaviour, you're on a hiding to nothing.

Posted by skeet | 26 comment(s)
Filed under: