Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}
Last time we looked at IOrderedEnumerable<TElement> and I gave an implementation we could use in order to implement the public extension methods within LINQ. I'm still going to do that in this post, but it's worth mentioning something else that's coming up in another part (26d) - I'm going to revisit my OrderedEnumerable implementation.
There may be trouble ahead...
A comment on the previous post mentioned how my comparer executes the keySelector on each element every time it makes a comparison. I didn't think of that as a particularly awful problem, until I thought of this sample query to rank people's favourite colours:
var query = people.GroupBy(p => p.FavouriteColour)
.OrderByDescending(g => g.Count())
.Select(g => g.Key);
Eek. Now every time we compare two elements, we have to count everything in a group. Ironically, I believe that counting the items in a group is fast using the LINQ to Objects implementation, but not in mine - something I may fix later on. But with LINQ to Objects, this wouldn't cause a problem in the first place!
There are ways to make this use an efficient key selector, of course - a simple Select before the OrderByDescending call would do fine... but it would be nicer if it wasn't a problem in the first place. Basically we want to extract the keys for each element once, and then compare them repeatedly when we need to. This would also allow us to shuffle a sequence using code such as this:
Random rng = new Random();
var shuffled = collection.OrderBy(x => rng.NextDouble());
I'm not advocating that way of shuffling, admittedly - but it would be nice if it didn't cause significant problems, which it currently would, as the key selector is non-deterministic.
The interesting thing is that when I've finished today's post, I believe the code will obey all the documented behaviour of LINQ to Objects: there's nothing in the documentation about how often the key selector will be called. That doesn't mean it's a good idea to ignore this problem though, which is why I'll revisit OrderedEnumerable later. However, that's going to complicate the code somewhat... so while we're still getting to grips with how everything hangs together, I'm going to stick to my inefficient implementation.
Meanwhile, back to the actual LINQ operators for the day...
What are they?
OrderBy, OrderByDescending, ThenBy and ThenByDescending all have very similar overloads:
public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector)
public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector)
public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
this IOrderedEnumerable<TSource> source,
Func<TSource, TKey> keySelector)
public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
this IOrderedEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
public static IOrderedEnumerable<TSource> ThenByDescending<TSource, TKey>(
this IOrderedEnumerable<TSource> source,
Func<TSource, TKey> keySelector)
public static IOrderedEnumerable<TSource> ThenByDescending<TSource, TKey>(
this IOrderedEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
They're all extension methods, but ThenBy/ThenByDescending are extension methods on IOrderedEnumerable<T> instead of IEnumerable<T>.
We've already talked about what they do to some extent - each of them returns a sequence which is ordered according to the specified key. However, in terms of details:
- The source and keySelector parameters can't be null, and are validated eagerly.
- The comparer parameter (where provided) can be null, in which case the default comparer for the key type is used.
- They use deferred execution - the input sequence isn't read until it has to be.
- They read and buffer the entire input sequence when the result is iterated. Or rather, they buffer the original input sequence - as I mentioned last time, when a compound ordered sequence (source.OrderBy(...).ThenBy(...).ThenBy(...)) is evaluated, the final query will go straight to the source used for OrderBy, rather than sorting separately for each key.
What are we going to test?
I have tests for the following:
- Deferred execution (using ThrowingEnumerable)
- Argument validation
- Ordering stability
- Simple comparisons
- Custom comparers
- Null comparers
- Ordering of null keys
In all of the tests which don't go bang, I'm using an anonymous type as the source, with integer "Value" and "Key" properties. I'm ordering using the key, and then selecting the value - like this:
[Test]
public void OrderingIsStable()
{
var source = new[]
{
new { Value = 1, Key = 10 },
new { Value = 2, Key = 11 },
new { Value = 3, Key = 11 },
new { Value = 4, Key = 10 },
};
var query = source.OrderBy(x => x.Key)
.Select(x => x.Value);
query.AssertSequenceEqual(1, 4, 2, 3);
}
For ThenBy/ThenByDescending I have multiple key properties so I can test the interaction between the primary and secondary orderings. For custom key comparer tests, I have an AbsoluteValueComparer which simply compares the absolute values of the integers provided.
The "Value" property is always presented in ascending order (from 1) to make it easier to keep track of, and the "Key" properties are always significantly larger so we can't get confused between the two. I originally used strings for the keys in all tests, but then I found out that the default string comparer was culture-sensitive and didn't behave how I expected it to. (The default string equality comparer uses ordinal comparisons, which are rather less brittle...) I still use strings for the keys in nullity tests, but there I'm specifying the ordinal comparer.
I wouldn't claim the tests are exhaustive - by the time you've considered multiple orderings with possibly equal keys, different comparers etc the possibilities are overwhelming. I'm reasonably confident though (particularly after the tests found some embarrassing bugs in the implementation). I don't think they're hugely readable either - but I was very keen to keep the value separated from the key, rather than just ordering by "x => x" in tests. If anyone fancies cloning the repository and writing better tests, I'd be happy to merge them :)
What I deliberately don't have yet is a test for how many times the key selector is executed: I'll add one before post 26d, so I can prove we're doing the right thing eventually.
Let's implement them!
We've got two bits of implementation to do before we can run the tests:
- The extension methods
- The GetEnumerator() method of OrderedEnumerable
The extension methods are extremely easy. All of the overloads without comparers simply delegate to the ones with comparers (using Comparer<TKey>.Default) and the remaining methods look like this:
public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
if (keySelector == null)
{
throw new ArgumentNullException("keySelector");
}
return new OrderedEnumerable<TSource>(source,
new ProjectionComparer<TSource, TKey>(keySelector, comparer));
}
public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
this IEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
if (keySelector == null)
{
throw new ArgumentNullException("keySelector");
}
IComparer<TSource> sourceComparer = new ProjectionComparer<TSource, TKey>(keySelector, comparer);
sourceComparer = new ReverseComparer<TSource>(sourceComparer);
return new OrderedEnumerable<TSource>(source, sourceComparer);
}
public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
this IOrderedEnumerable<TSource> source,
Func<TSource, TKey> keySelector,
IComparer<TKey> comparer)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
if (keySelector == null)
{
throw new ArgumentNullException("keySelector");
}
return source.CreateOrderedEnumerable(keySelector, comparer, false);
}
(To get ThenByDescending, just change the name of the method and change the last argument of CreateOrderedEnumerable to true.)
All very easy. I'm pretty sure I'm going to want to change the OrderedEnumerable constructor to accept the key selector and key comparer in the future (in 26d), which will make the above code even simpler. That can wait a bit though.
Now for the sorting part in OrderedEnumerable. Remember that we need a stable sort, so we can't just delegate to List<T>.Sort - at least, not without a bit of extra fiddling. (We could project to a type which contained the index, and add that onto the end of the comparer as a final tie-breaker.)
For the minute - and I swear it won't stay like this - here's the horribly inefficient (but easy to understand) implementation I've got:
public IEnumerator<TElement> GetEnumerator()
{
List<TElement> elements = source.ToList();
while (elements.Count > 0)
{
TElement minElement = elements[0];
int minIndex = 0;
for (int i = 1; i < elements.Count; i++)
{
if (currentComparer.Compare(elements[i], minElement) < 0)
{
minElement = elements[i];
minIndex = i;
}
}
elements.RemoveAt(minIndex);
yield return minElement;
}
}
We simply copy the input to a list (which is something we may well do in the final implementation - we certainly need to suck it all in somehow) and then repeatedly find the minimum element (favouring earlier elements over later ones, in order to achieve stability), removing them as we go. It's an O(n2) approach, but hey - we're going for correctness first.
Conclusion
This morning, I was pretty confident this would be an easy and quick post to write. Since then, I've been found pain in the following items:
- Calling key selectors only once per element is more important than it might sound at first blush
- The default sort order for string isn't what I'd have guessed
- My (committed!) extension methods were broken, because I hadn't edited them properly after a cut and paste
- Writing tests for situations where there are lots of combinations is irritating
So far these have only extended my estimated number of posts for this group of operators to 4 (26a-26d) but who knows what the next few days will bring...