Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

Last time we looked at IOrderedEnumerable<TElement> and I gave an implementation we could use in order to implement the public extension methods within LINQ. I'm still going to do that in this post, but it's worth mentioning something else that's coming up in another part (26d) - I'm going to revisit my OrderedEnumerable implementation.

There may be trouble ahead...

A comment on the previous post mentioned how my comparer executes the keySelector on each element every time it makes a comparison. I didn't think of that as a particularly awful problem, until I thought of this sample query to rank people's favourite colours:

var query = people.GroupBy(p => p.FavouriteColour)
                  .OrderByDescending(g => g.Count())
                  .Select(g => g.Key);

Eek. Now every time we compare two elements, we have to count everything in a group. Ironically, I believe that counting the items in a group is fast using the LINQ to Objects implementation, but not in mine - something I may fix later on. But with LINQ to Objects, this wouldn't cause a problem in the first place!

There are ways to make this use an efficient key selector, of course - a simple Select before the OrderByDescending call would do fine... but it would be nicer if it wasn't a problem in the first place. Basically we want to extract the keys for each element once, and then compare them repeatedly when we need to. This would also allow us to shuffle a sequence using code such as this:

Random rng = new Random(); // Or get it from elsewhere...
var shuffled = collection.OrderBy(x => rng.NextDouble());

I'm not advocating that way of shuffling, admittedly - but it would be nice if it didn't cause significant problems, which it currently would, as the key selector is non-deterministic.

The interesting thing is that when I've finished today's post, I believe the code will obey all the documented behaviour of LINQ to Objects: there's nothing in the documentation about how often the key selector will be called. That doesn't mean it's a good idea to ignore this problem though, which is why I'll revisit OrderedEnumerable later. However, that's going to complicate the code somewhat... so while we're still getting to grips with how everything hangs together, I'm going to stick to my inefficient implementation.

Meanwhile, back to the actual LINQ operators for the day...

What are they?

OrderBy, OrderByDescending, ThenBy and ThenByDescending all have very similar overloads:

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

public static IOrderedEnumerable<TSource> ThenByDescending<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> ThenByDescending<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

They're all extension methods, but ThenBy/ThenByDescending are extension methods on IOrderedEnumerable<T> instead of IEnumerable<T>.

We've already talked about what they do to some extent - each of them returns a sequence which is ordered according to the specified key. However, in terms of details:

  • The source and keySelector parameters can't be null, and are validated eagerly.
  • The comparer parameter (where provided) can be null, in which case the default comparer for the key type is used.
  • They use deferred execution - the input sequence isn't read until it has to be.
  • They read and buffer the entire input sequence when the result is iterated. Or rather, they buffer the original input sequence - as I mentioned last time, when a compound ordered sequence (source.OrderBy(...).ThenBy(...).ThenBy(...)) is evaluated, the final query will go straight to the source used for OrderBy, rather than sorting separately for each key.

What are we going to test?

I have tests for the following:

  • Deferred execution (using ThrowingEnumerable)
  • Argument validation
  • Ordering stability
  • Simple comparisons
  • Custom comparers
  • Null comparers
  • Ordering of null keys

In all of the tests which don't go bang, I'm using an anonymous type as the source, with integer "Value" and "Key" properties. I'm ordering using the key, and then selecting the value - like this:

[Test]
public void OrderingIsStable()
{
    var source = new[]
    {
        new { Value = 1, Key = 10 },
        new { Value = 2, Key = 11 },
        new { Value = 3, Key = 11 },
        new { Value = 4, Key = 10 },
    };
    var query = source.OrderBy(x => x.Key)
                      .Select(x => x.Value);
    query.AssertSequenceEqual(1, 4, 2, 3);
}

For ThenBy/ThenByDescending I have multiple key properties so I can test the interaction between the primary and secondary orderings. For custom key comparer tests, I have an AbsoluteValueComparer which simply compares the absolute values of the integers provided.

The "Value" property is always presented in ascending order (from 1) to make it easier to keep track of, and the "Key" properties are always significantly larger so we can't get confused between the two. I originally used strings for the keys in all tests, but then I found out that the default string comparer was culture-sensitive and didn't behave how I expected it to. (The default string equality comparer uses ordinal comparisons, which are rather less brittle...) I still use strings for the keys in nullity tests, but there I'm specifying the ordinal comparer.

I wouldn't claim the tests are exhaustive - by the time you've considered multiple orderings with possibly equal keys, different comparers etc the possibilities are overwhelming. I'm reasonably confident though (particularly after the tests found some embarrassing bugs in the implementation). I don't think they're hugely readable either - but I was very keen to keep the value separated from the key, rather than just ordering by "x => x" in tests. If anyone fancies cloning the repository and writing better tests, I'd be happy to merge them :)

What I deliberately don't have yet is a test for how many times the key selector is executed: I'll add one before post 26d, so I can prove we're doing the right thing eventually.

Let's implement them!

We've got two bits of implementation to do before we can run the tests:

  • The extension methods
  • The GetEnumerator() method of OrderedEnumerable

The extension methods are extremely easy. All of the overloads without comparers simply delegate to the ones with comparers (using Comparer<TKey>.Default) and the remaining methods look like this:

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    return new OrderedEnumerable<TSource>(source,
        new ProjectionComparer<TSource, TKey>(keySelector, comparer));
}

public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    IComparer<TSource> sourceComparer = new ProjectionComparer<TSource, TKey>(keySelector, comparer);
    sourceComparer = new ReverseComparer<TSource>(sourceComparer);
    return new OrderedEnumerable<TSource>(source, sourceComparer);
}

public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    return source.CreateOrderedEnumerable(keySelector, comparer, false);
}

(To get ThenByDescending, just change the name of the method and change the last argument of CreateOrderedEnumerable to true.)

All very easy. I'm pretty sure I'm going to want to change the OrderedEnumerable constructor to accept the key selector and key comparer in the future (in 26d), which will make the above code even simpler. That can wait a bit though.

Now for the sorting part in OrderedEnumerable. Remember that we need a stable sort, so we can't just delegate to List<T>.Sort - at least, not without a bit of extra fiddling. (We could project to a type which contained the index, and add that onto the end of the comparer as a final tie-breaker.)

For the minute - and I swear it won't stay like this - here's the horribly inefficient (but easy to understand) implementation I've got:

public IEnumerator<TElement> GetEnumerator()
{
    // This is a truly sucky way of implementing it. It's the simplest I could think of to start with.
    // We'll come back to it!
    List<TElement> elements = source.ToList();
    while (elements.Count > 0)
    {
        TElement minElement = elements[0];
        int minIndex = 0;
        for (int i = 1; i < elements.Count; i++)
        {
            if (currentComparer.Compare(elements[i], minElement) < 0)
            {
                minElement = elements[i];
                minIndex = i;
            }
        }
        elements.RemoveAt(minIndex);
        yield return minElement;
    }
}

We simply copy the input to a list (which is something we may well do in the final implementation - we certainly need to suck it all in somehow) and then repeatedly find the minimum element (favouring earlier elements over later ones, in order to achieve stability), removing them as we go. It's an O(n2) approach, but hey - we're going for correctness first.

Conclusion

This morning, I was pretty confident this would be an easy and quick post to write. Since then, I've been found pain in the following items:

  • Calling key selectors only once per element is more important than it might sound at first blush
  • The default sort order for string isn't what I'd have guessed
  • My (committed!) extension methods were broken, because I hadn't edited them properly after a cut and paste
  • Writing tests for situations where there are lots of combinations is irritating

So far these have only extended my estimated number of posts for this group of operators to 4 (26a-26d) but who knows what the next few days will bring...

Published Wed, Jan 5 2011 19:18 by skeet
Filed under: , ,

Comments

# re: Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

There's a relatively easy way to achieve stable sort using a non-stable algorithm : just associate an index with each item, and use that index as the last sort key:

       public IEnumerator<TElement> GetEnumerator()

       {

           var list = _source.Select((x, i) => new IndexedElement(x, i)).ToList();

           var elementComparer = new ProjectionComparer<IndexedElement, TElement>(i => i.Element, _comparer);

           var indexComparer = new ProjectionComparer<IndexedElement, int>(i => i.Index, null);

           var actualComparer = new CompoundComparer<IndexedElement>(elementComparer, indexComparer);

           list.Sort(actualComparer);

           return list.Select(i => i.Element).GetEnumerator();

       }

       private class IndexedElement

       {

           private readonly TElement _element;

           private readonly int _index;

           public IndexedElement(TElement element, int index)

           {

               _element = element;

               _index = index;

           }

           public int Index { get { return _index; } }

           public TElement Element { get { return _element; } }

       }

That's probably not very efficient, but probably better than the naive sort algorithm you posted ;)

Wednesday, January 05, 2011 3:32 PM by Thomas Levesque

# re: Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

@Thomas: that was what I meant by: "We could project to a type which contained the index, and add that onto the end of the comparer as a final tie-breaker."

I suspect it is indeed faster than my algorithm - but that won't be there for long :)

Wednesday, January 05, 2011 3:40 PM by skeet

# re: Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

Sorry, I missed the part where you mentioned that approach... In my hurry to see the code, I must have read the text a little too fast ;)

Wednesday, January 05, 2011 3:50 PM by Thomas Levesque

# re: Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

You can always delegate to List.Sort by using ToLookup

public IEnumerator<TElement> GetEnumerator() {

 var list = source.ToLookup(comparer.KeySelector).ToList();

 list.Sort(comparer);

 return list.SelectMany(group => group);

}

Now you just have to modify your comparers to have an option to get a KeySelector (the original one for projection comparer, a composite of all the ones for compound comparer, and the original comparer's key selector for reverse comparers), and your sorting is done I think.

Wednesday, January 05, 2011 7:25 PM by configurator

# re: Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

@configurator: Nope - your code assumes that any two values which compare equal under the comparer will also compare equal under the default IEqualityComparer for the key type. You can't build a sensible IEqualityComparer for an arbitrary IComparer, as you can't get a hash code. (You can compare for equality easily enough, but you'd have to return a constant hash code, making the whole thing pointlessly slow.)

Thursday, January 06, 2011 1:45 AM by skeet

# re: Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

You write: "I'm pretty sure I'm going to want to change the OrderBy constructor to accept the key selector and key comparer in the future (in 26d)"

I assume you mean the OrderedEnumerable constructor? The amount of passes it took me to sort that one out would make Knuth cry. ;-)

Thursday, January 06, 2011 3:00 AM by Patrick Huizinga

# re: Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

@Patrick: Yes, that was a typo. Fixing now.

Thursday, January 06, 2011 3:29 AM by skeet

# re: Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

Good point about the hash code; I thought at first using an identity comparer would be enough but that's obviously wrong.

The annoying thing about my solution is that it would work great for the most common case but break horribly for anything else...

Thursday, January 06, 2011 7:48 AM by configurator

# re: Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

I was thinking again about the issue of calling the key selector multiple times on the same item. Actually, there's a very easy way to work around this issue: memoization.

You just need to modify the ProjectionComparer so that it memoizes the key selector, with a method like this:

   private static Func<TElement, TKey> Memoize(Func<TElement, TKey> keySelector)

   {

       var cache = new Dictionary<TElement, TKey>();

       return element =>

                   {

                       TKey comparisonKey;

                       if (!cache.TryGetValue(element, out comparisonKey))

                       {

                           comparisonKey = keySelector(element);

                           cache[element] = comparisonKey;

                       }

                       return comparisonKey;

                   };

   }

That way, the key selector is only called once for a given item: subsequent access to the key will fetch it from the dictionary.

I'm not sure of the effects of this technique on performance, but at least it doesn't require a heavy refactoring of OrderedEnumerable...

Thursday, January 06, 2011 1:49 PM by Thomas Levesque

# re: Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

@Thomas: I could certainly do that (or even use `Lazy<T>`) but I think I'd rather do it up front. Aside from anything else, given that we definitely *will* need all the keys at some point, it makes sense to compute them eagerly. Hopefully there will be other benefits too... we'll see how it looks :)

Thursday, January 06, 2011 1:54 PM by skeet