January 2011 - Posts

Reimplementing LINQ to Objects: Part 42 - More optimization

A few parts ago, I jotted down a few thoughts on optimization. Three more topics on that general theme have occurred to me, one of them prompted by the comments.

User-directed optimizations

I mentioned last time that for micro-optimization purposes, we could derive a tiny benefit if there were operators which allowed us to turn off potential optimizations - effectively declare in the LINQ query that we believed the input sequence would never be an IList<T> or an ICollection<T>, so it wasn't worth checking it. I still believe that level of optimization would be futile.

However, going the other way is entirely possible. Imagine if we could say, "There are probably a lot of items in this collection, and the operations I want to perform on them are independent and thread-safe. Feel free to parallelize them."

That's exactly what Parallel LINQ gives you, of course. A simple call to AsParallel() somewhere in the query - often at the start, but it doesn't have to be - enables parallelism. You need to be careful how you use this, of course, which is why it's opt-in... and it gives you a fair amount of control in terms of degrees of potential parallelism, whether the results are required in the original order and so on.

In some ways my "TopBy" proposal is similar in a very small way, in that it gives information relatively early in the query, allowing the subsequent parts (ThenBy clauses) to take account of the extra information provided by the user. On the other hand, the effect is extremely localized - basically just for the sequence of clauses to do with ordering.

Related to the idea of parallelism is the idea of side-effects, and how they affect LINQ to Objects itself.

Side-effects and optimization

The optimizations in LINQ to Objects appear to make some assumptions about side-effects:

  • Iterating over a collection won't cause any side-effects
  • Predicates may cause side-effects

Without the first point, all kinds of optimizations would effectively be inappropriate. As the simplest example, Count() won't use an iterator - it will just take the count of the collection. What if this was an odd collection which mutated something during iteration, though? Or what if accessing the Count property itself had side-effects? At that point we'd be violating our principle of not changing observable behaviour by optimizing. Again, the optimizations are basically assuming "sensible" behaviour from collections.

There's a rather more subtle possible cause of side-effects which I've never seen discussed. In some situations - most obviously Skip - an operator can be implemented to move over an iterator for a time without taking each "current" value. This is due to the separation of MoveNext() from Current. What if we were dealing with an iterator which had side-effects only when Current was fetched? It would be easy to write such a sequence - but again, I suspect there's an implicit assumption that such sequences simply don't exist, or that it's reasonable for the behaviour of LINQ operators with respect to them to be left unspecified.

Predicates, on the other hand, might not be so sensible. Suppose we were computing "sequence.Last(x => 10 / x > 1)" on the sequence { 5, 0, 2 }. Iterating over the sequence forwards, we end up with a DivideByZeroException - whereas if we detected that the sequence was a list, and worked our way backwards from the end, we'd see that 10 / 2 > 1, and return that last element (2) immediately. Of course, exceptions aren't the only kind of side-effect that a predicate can have: it could mutate other state. However, it's generally easier to spot that and cry foul of it being a proper functional predicate than notice the possibility of an exception.

I believe this is the reason the predicated Last overload isn't optimized. It would be nice if these assumptions were documented, however.

Assumptions about performance

There's a final set of assumptions which the common ICollection<T>/IList<T> optimizations have all been making: that using the more "direct" members of the interfaces (specifically Count and the indexer) are more efficient than simply iterating. The interfaces make no such declarations: there's no requirement that Count has to be O(1), for example. Indeed, it's not even the case in the BCL. The first time you ask a "view between" on a sorted set for its count after the underlying set has changed, it has to count the elements again.

I've had this problem before, removing items from a HashSet in Java. The problem is that there's no way of communicating this information in a standardized way. We could use attributes for everything, but it gets very complicated, and I strongly suspect it would be a complete pain to use. Basically, performance is one area where abstractions just don't hold up - or rather, the abstractions aren't designed to include performance characteristics.

Even if we knew the complexity of (say) Count that still wouldn't help us necessarily. Suppose it's an O(n) operation - that sounds bad, until you discover that for this particular horrible collection, each iteration step is also O(n) for some reason. Or maybe there's a collection with an O(1) count but a horrible constant value, whereas iterating is really quick per item... so for small values of O(n), iteration would be faster. Then you've got to bear in mind how much processor time is needed trying to work out the fastest approach... it's all bonkers.

So instead we make these assumptions, and for the most part they're correct. Just be aware of their presence.

Conclusion

I have reached the conclusion that I'm tired, and need sleep. I might write about Queryable, IQueryable and query expressions next time.

Posted by skeet | 4 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 41 - How query expressions work

Okay, first a quick plug. This won't be in as much detail as chapter 11 of C# in Depth. If you want more, buy a copy. (Until Feb 1st, there's 43% off it if you buy it from Manning with coupon code j2543.) Admittedly that chapter has to also explain all the basic operators rather than just query expression translations, but that's a fair chunk of it.

If you're already familiar with query expressions, don't expect to discover anything particularly insightful here. However, you might be interested in the cheat sheet at the end, in case you forget some of the syntax occasionally. (I know I do.)

What is this "query expression" of which you speak?

Query expressions are a little nugget of goodness hidden away in section 7.16 of the C# specification. Unlike some language features like generics and dynamic typing, query expressions keep themselves to themselves, and don't impinge on the rest of the spec. A query expression is a bit of C# which looks a bit like a mangled version of SQL. For example:

from person in people
where person.FirstName.StartsWith("J")
orderby person.Age
select person.LastName

It looks somewhat unlike the rest of C#, which is both a blessing and a curse. On the one hand, queries stand out so it's easy to see they're queries. On the other hand... they stand out rather than fitting in with the rest of your code. To be honest I haven't found this to be an issue, but it can take a little getting used to.

Every query expression can be represented in C# code, but the reverse isn't true. Query expressions only take in a subset of the standard query operators - and only a limited set of the overloads, at that. It's not unusual to see a query expression followed by a "normal" query operator call, for example:

var list = (from person in people
            where person.FirstName.StartsWith("J")
            orderby person.Age
            select person.LastName)
           .ToList();

So, that's very roughly what they look like. That's the sort of thing I'm dealing with in this post. Let's start dissecting them.

Compiler translations

First it's worth introducing the general principle of query expressions: they effectively get translated step by step into C# which eventually doesn't contain any query expressions. To stick with our first example, that ends up be translated into this code:

people.Where(person => person.FirstName.StartsWith("J"))
      .OrderBy(person => person.Age)
      .Select(person => person.LastName)

It's important to understand that the compiler hasn't done anything apart from systematic translation to get to this point. In particular, so far we haven't depended on what "people" is, nor "Where", "OrderBy" or "Select".

Can you tell what this code does yet? You can probably hazard a pretty good guess, but you can't tell. Is it going to call Edulinq.Enumerable.Select, or System.Linq.Enumerable.Select, or something entirely different? It depends on the context. Heck, "people" could be the name of a type which has a static Where method. Or maybe it could be a reference to a class which has an instance method called Where... the options are open.

Of course, they don't stay open for long: the compiler takes that expression and compiles it applying all the normal rules. It converts the lambda expression into either a delegate or an expression tree, tries to resolve Where, OrderBy and Select as normal, and life continues. (Don't worry if you're not sure about expression trees yet - I'll come to them in another post.)

The important point is that the query expression translations don't know about System.Linq. The spec barely mentioned IEnumerable<T>, and certainly doesn't rely on it. The whole thing is pattern based. If you have an API which provides some or all of the operators used by the pattern, in an appropriate way, you can use query expressions with it. That's the secret sauce that allows you to use the same syntax for LINQ to Objects, LINQ to SQL, LINQ to Entities, Reactive Extensions, Parallel Extensions and more.

Range variables and the initial "from" clause

The first part of the query to look at is the first "from" clause, at the start of the query. It's worth mentioning upfront that this is handled somewhat differently to any later "from" clauses - I'll explain how they're translated later.

So we have an expression of the form:

from [type] identifier in expression

The "expression" part is just any expression. In most cases there isn't a type specified, in which case the translated version is simply the expression, but with the compiler remembering the identifier as a range variable. I'll do my best to explain what range variables are in a minute :)

If there is a type specified, that represents a call to Cast<type>(). So examples of the two translations so far are:

// Query (incomplete)
from x in people

// Translation (+ range variable of "x")
people


// Query (incomplete)
from Person x in people

// Translation (+ range variable of "x")
(people).Cast<Person>()

These aren't complete query expressions - queries have very precise rules about how they can start and end. They always start with a "from" clause like this, and always end either with a "group by" clause or a "select" clause.

So what's the point of the range variable? Well, that's what gets used as the name of the lambda expression parameter used in all the later clauses. Let's add a select clause to create a complete expression and demonstrate how the variable could be used.

A "select" clause

A select clause is usually translated into a call to Select, using the "body" of the clause as the body of the lambda expression... and the range variable as the parameter. So to expand our previous query, we might have this translation:

// Query
from x in people
select x.Name

// Translation
people.Select(x => x.Name)

That's all that range variables are used for: to provide placeholders within lambda expressions, effectively. They're quite unlike normal variables in most senses. It only makes sense to talk about the "value" of a range variable within a particular clause at a particular point in time when the clause is executing, for one particular value. Their nearest conceptual neighbour is probably the iteration variable declared in a foreach statement, but even that's not really the same - particularly given the way iteration variables are captured, often to the surprise of developers.

The body part has to be a single expression - you can't use "statement lambdas" in query expressions. For example, there's no query expression which would translate to this:

// Can't express this in a query expression
people.Select(x => { 
                     Console.WriteLine("Got " + x);
                     return x.Name;
                   })

That's a perfectly valid C# expression, it's just there's now way of expressing it directly as a query expression.

I mentioned that a select clause usually translates into a Select call. There are two cases where it doesn't:

  • If it's the sole clause after a secondary "from" clause, or a "group by", "join" or "join ... into" clause, the body is used in the translation of that clause
  • If it's an "identity" projection coming after another clause, it's removed entirely.

I'll deal with the first point when we reach the relevant clauses. The second point leads to these translations:

// Query
from x in people
where x.IsAdult
select x

// Translation: Select is removed
people.Where(x => x.IsAdult)


// Query
from x in people
select x

// Translation: Select is *not* removed
people.Select(x => x)

The point of including the "pointless" select in the second translation is to hide the original source sequence; it's assumed that there's no need to do this in the first translation as the "Where" call will already have protected the source sufficiently.

The "where" clause

This one's really simple - especially as we've already seen it! A where clause always just translates into a Where call. Sample translation, this time with no funny business removing degenerate query expressions:

// Query
from x in people
where x.IsAdult
select x.Name

// Translation
people.Where(x => x.IsAdult)
      .Select(x => x.Name)

Note how the range variable is propagated through the query.

The "orderby" clause

Here's a secret: I can never remember offhand whether it's "orderby" or "order by" - it's confusing because it really is "group by", but "orderby" is actually just a single word. Of course, Visual Studio gives a pretty unsubtle hint in terms of colouring.

In the simplest form, an orderby clause might look like this:

// Query
from x in people
orderby x.Age
select x.Name

// Translation
people.OrderBy(x => x.Age)
      .Select(x => x.Name)

There are two things which can add complexity though:

  • You can order by multiple expressions, separating them by commas
  • Each expression can be ordered ascending implicitly, ascending explicitly or descending explicitly.

The first sort expression is always translated into OrderBy or OrderByDescending; subsequent ones always become ThenBy or ThenByDescending. It makes no difference whether you explicitly specify "ascending" or not - I've very rarely seen it in real queries. Here's an example putting it all together:

// Query
from x in people
orderby x.Age, x.FirstName descending, x.LastName ascending
select x.LastName

// Translation
people.OrderBy(x => x.Age)
      .ThenByDescending(x => x.FirstName)
      .ThenBy(x => x.LastName)
      .Select(x => x.LastName)

Top tip: don't use multiple "orderby" clauses consecutively. This query is almost certainly not what you want:

// Don't do this!
from x in people 
orderby x.Age
orderby x.FirstName
select x.LastName

That will end up sorting by FirstName and then Age, and doing so rather slowly as it has to sort twice.

The "group by" clause

Grouping is another alternative to "select" as the final clause in a query. There are two expressions involved: the element selector (what you want to get in each group) and the key selector (how you want the groups to be organized). Unsurprisingly, this uses the GroupBy operator. So you might have a query to group people in families by their last name, with each group containing the first names of the family members:

// Query expression
from x in people 
group x.FirstName by x.LastName

// Translation
people.GroupBy(x => x.LastName, x => x.FirstName)

If the element selector is trivial, it isn't specified as part of the translation:

// Query expression
from x in people 
group x by x.LastName

// Translation
people.GroupBy(x => x.LastName)

Query continuations

Both "select" and "group by" can be followed by "into identifier". This is known as a query continuation, and it's really simple. Its translation in the specification isn't in terms of a method call, but instead it transforms one query expression into another, effectively nesting one query as the source of another. I find that translation tricky to think about, personally... I prefer to think of it as using a temporary variable, like this:

// Original query
var query = from x in people
            select x.Name into y
            orderby y.Length
            select y[0];

// Query continuation translation
var tmp = from x in people
          select x.Name;

var query = from y in tmp
            orderby y.Length
            select y[0];

// Final translation into methods
var query = people.Select(x => x.Name)
                  .OrderBy(y => y.Length)
                  .Select(y => y[0]);

Obviously that final translation could have been expressed in terms of two statements as well... they'd be equivalent. This is why it's important that LINQ uses deferred execution - you can split up a query as much as you like, and it won't alter the execution flow. The query wouldn't actually execute when the value is assigned to "tmp" - it's just preparing the query for execution.

Transparent identifiers and the "let" clause

The rest of the query expression clauses all introduce an extra range variable in some form or other. This is the part of query expression translation which is hardest to understand, because it affects how any usage of the range variable in the query expression is translated.

We'll start with probably the simplest of the remaining clauses: the "let" clause. This simply introduces a new range variable based upon a projection. It's a bit like a "select", but after a "let" clause both the original range variable and the new one are in scope for the rest of the query. They're typically used to avoid redundant computations, or simply to make the code simpler to read. For example, suppose computing an employee's tax is a complicated operation, and we want to display a list of employees and the tax they pay, with the higher tax-payer first:

from x in employees
let tax = x.ComputeTax()
orderby tax descending
select x.LastName + ": " + tax

That's pretty readable, and we've managed to avoid computing the tax twice (once for sorting and once for display).

The problem is, both "x" and "tax" are in scope at the same time... so what are we going to pass to the Select method at the end? We need one entity to pass through our query, which knows the value of both "x" and "tax" at any point (after the "let" clause, obviously). This is precisely the point of a transparent identifier. You can think of the above query as being translated into this:

// Translation from "let" clause to another query expression
from x in employees
select new { x, tax = x.ComputeTax() } into z
orderby z.tax descending
select z.x.LastName + ": " + z.tax

// Final translated query
employees.Select(x => new { x, tax = x.ComputeTax() })
         .OrderByDescending(z => z.tax)
         .Select(z => z.x.LastName + ": " + z.tax)

Here "z" is the transparent identifier - which I've made somewhat more opaque by giving it a name. In the specification, the query translations are performed in terms of "*" - which clearly isn't a valid identifier, but which stands in for the transparent one.

The good news about transparent identifiers is that most of the time you don't need to think of them at all. They simply let you have multiple range variables in scope at the same time. I find myself only bothering to think about them explicitly when I'm trying to work out the full translation of a query expression which uses them. It's worth knowing about them to avoid being stumped by the concept of (say) a select clause being able to use multiple range variables, but that's all.

Now that we've got the basic concept, we can move onto the final few clauses.

Secondary "from" clauses

We've seen that the introductory "from" clause isn't actually translated into a method call, but any subsequent ones are. The syntax is still the same, but the translation uses SelectMany. In many cases this is used just like a cross-join (Cartesian product) but it's more flexible than that, as the "inner" sequence introduced by the secondary "from" clause can depend on the current value from the "outer" sequence. Here's an example of that. with the call to SelectMany in the translation:

// Query expression
from parent in adults
from child in parent.Children
where child.Gender == Gender.Male
select child.Name + " is a son of " + parent.Name

// Translation (using z for the transparent identifier)
adults.SelectMany(parent => parent.Children,
                  (parent, child) => new { parent, child })
      .Where(z => z.child.Gender == Gender.Male)
      .Select(z => z.child.Name + " is a son of " + z.parent.Name;

Again we can see the effect of the transparent identifier - an anonymous type is introduced to propagate the { parent, child } tuple through the rest of the query.

There's a special case, however - if "the rest of the query" is just a "select" clause, we don't need the anonymous type. We can just apply the projection directly in the SelectMany call. Here's a similar example, but this time without the "where" clause:

// Query expression
from parent in adults
from child in parent.Children
select child.Name + " is a child of " + parent.Name

// Translation (using z for the transparent identifier)
adults.SelectMany(parent => parent.Children,
                  (parent, child) => child.Name + " is a child of " + parent.Name)

This same trick is used in GroupJoin and Join, but I won't go into the details there. It's simpler to just provide examples which use the shortcut, instead of including unnecessary extra clauses just to force the transparent identifier to appear in the translation.

Note that just like the introductory "from" clause, you can specify a type for the range variable, which forces a call to "Cast<>".

Simple "join" clauses (no "into")

A "join" clause without an "into" part corresponds to a call to the Join method, which represents an inner equijoin. In some ways this is like an extra "from" clause with a "where" clause to provide the relevant filtering, but there's a significant difference: while the "from" clause (and SelectMany) allow you to project each element in the outer sequence to an inner sequence, in Join you merely provide the inner sequence directly, once. You also have to specify the two key selectors - one for the outer sequence, and one for the inner sequence. The general syntax is:

join identifier in inner-sequence on outer-key-selector equals inner-key-selector

The identifier names the extra range variable introduced. Here's an example including the translation:

// Query expression
from customer in customers
join order in orders on customer.Id equals order.CustomerId
select customer.Name + ": " + order.Price

// Translation
customers.Join(orders,
               customer => customer.Id,
               order => order.CustomerId,
               (customer, order) => customer.Name + ": " + order.Price)

Note how if you put the key selectors the wrong way round, it's highly unlikely that the result will compile - the lambda expression for the outer sequence doesn't "know about" the inner sequence element, and vice versa. The C# compiler is even nice enough to guess the probable cause, and suggest the fix.

Group joins - "join ... into"

Group joins look exactly the same as inner joins, except they have an extra "into identifier" part at the end. Again, this introduces an extra range variable - but it's the identifier after the "into" which ends up in scope, not the one after "join"; that one is only used in the key selector. This is easier to see when we look at a sample translation:

// Query expression
from customer in customers
join order in orders on customer.Id equals order.CustomerId into customerOrders
select customer.Name + ": " + customerOrders.Count()

// Translation
customers.GroupJoin(orders,
                    customer => customer.Id,
                    order => order.CustomerId,
                    (customer, customerOrders) => customer.Name + ": " + customerOrders.Count())

If we had tried to refer to "order" in the select clause, the result would have been an error: it's not in scope any more. Note that this is not a query continuation unlike "select ... into" and "group ... into". It introduces a new range variable, but all the previous range variables are still in scope.

That's it! That's all the translations that the C# compiler supports. VB's query expressions are rather richer - but I suspect that's at least partly because it's more painful to write the "dot notation" syntax in VB, as the lambda expression syntax isn't as nice as C#'s.

Translation cheat sheet

I thought it would be useful to produce a short table of the kinds of clauses supported in query expressions, with the translation used by the C# compiler. The translation is given assuming a single range variable named "x" is in scope. I haven't given the alternative options where transparent identifiers are introduced - this table isn't meant to be a replacement for all the information above! (Likewise this doesn't mention the optimizations for degenerate query expressions or "identity projection" groupings.)

Query expression clause Translation
First "from [type] x in sequence" Just "sequence" or "sequence.Cast<type>()", but with the introduction of a range variable
Subsequent "from" clauses:
"from [type] y in projection"
SelectMany(x => projection, (x, y) => new { x, y })
or SelectMany(x => projection.Cast<type>(), (x, y) => new { x, y })
where predicate Where(x => predicate)
select projection Select(x => projection)
let y = projection Select(x => new { x, y = projection })
orderby o1, o2 ascending, o3 descending
(Each ordering may have descending or ascending specified explicitly; the default is ascending)
OrderBy(x => o1)
.ThenBy(x => o2)
.ThenByDescending(x => o3)
group projection by key-selector GroupBy(x => key-selector, x => projection)
join y in inner-sequece
on outer-key-selector equals inner-key-selector
Join(x => outer-key-selector,
    y => inner-key-selector,
    (x, y) => new { x, y })
join y in inner-sequece
on outer-key-selector equals inner-key-selector
into z
GroupJoin(x => outer-key-selector,
    y => inner-key-selector,
    (x, z) => new { x, z })
query1 into y
query2
(Translation in terms of a new query expression)
from y in (query1)
query2

Conclusion

Hopefully that's made a certain amount of sense out of a fairly complicated topic. I find it's one of those "aha!" things - at some point it clicks, and then seems reasonably simple (aside from transparent identifiers, perhaps). Until that time, query expressions can be a bit magical.

As an aside, I have a sneaking suspicion that one of my first blog posts consisted of my initial impressions of LINQ, written in a plane on the way to the MVP conference in Seattle in September 2005. I would check, but I'm finishing this post in another plane, this time on the way to San Francisco. I think I'd have been somewhat surprised to be told in 2005 that I'd still be writing blog posts about LINQ over five years later. Mind you, I can think of any number of things which have happened in the intervening years which would have astonished me to about the same degree.

Next time: some more thoughts on optimization. Oh, and I'm likely to update my wishlist of extra operators as well, but within the existing post.

Posted by skeet | 4 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 40 - Optimization

I'm not an expert in optimization, and most importantly I don't have any real-world benchmarks to support this post, so please take it with a pinch of salt. That said, let's dive into what optimizations are available in LINQ to Objects.

What do we mean by optimization?

Just as we think of refactoring as changing the internal structure of code without changing its externally visible behaviour, optimization is the art/craft/science/voodoo of changing the performance of code without changing its externally visible behaviour. Sort of.

This requires two definitions: "performance" and "externally visible behaviour". Neither are as simple as they sound. In almost all cases, performance is a trade-off, whether in speed vs memory, throughput vs latency, big-O complexity vs the factors within that complexity bound, and best case vs worst case.

In LINQ to Objects the "best case vs worst case" is the balance which we need to consider most often: in the cases where we can make a saving, how significant is that saving? How much does it cost in every case to make a saving some of the time? How often do we actually take the fast path?

Externally visible behaviour is even harder to pin down, because we definitely don't mean all externally visible behaviour. Almost all the optimizations in Edulinq are visible if you work hard enough - indeed, that's how I have unit tests for them. I can test that Count() uses the Count property of an ICollection<T> instead of iterating over it by creating an ICollection<T> implementation which works with Count but throws an exception if you try to iterate over it. I don't think we care about that sort of change to externally visible behaviour.

What we really mean is, "If we use the code in a sensible way, will we get the same results with the optimized code as we would without the optimization?" Much better. Nothing woolly about the term "sensible" at all, is there? More realistically, we could talk about a system where every type adheres to the contracts of every interface it implements - that would at least get rid of the examples used for unit testing. Still, even the performance of a system is externally visible... it's easy to tell the difference between an implementation of Count() which is optimized and one which isn't, if you've got a list of 10 million items.

How can we optimize in LINQ to Objects?

Effectively we have one technique for significant optimization in LINQ to Objects: finding out that a sequence implements a more capable interface than IEnumerable<T>, and then using that interface. (Or in the case of my optimization for HashSet<T> and Contains, using a concrete type in the same way.) Count() is the most obvious example of this: if the sequence implements ICollection or ICollection<T>, then we can use the Count property on that interface and we're done. No need to iterate at all.

These are generally good optimizations because they allow us to transform an O(n) computation into an O(1) computation. That's a pretty big win when it's applicable, and the cost of checking is reasonably small. So long as we hit the right type once every so often - and particularly if in those cases the sequences are long - then it's a net gain. The performance characteristics are likely to be reasonably fixed here for any one program - it's not like we'll sometimes win and sometimes lose for a specific query... it's that for some queries we win and some we won't. If we were micro-optimizing, we might want a way of calling a "non-optimized" version which didn't even bother trying the optimization, because we know it will always fail. I would regard such an approach as a colossal waste of effort in the majority of cases.

Some optimizations are slightly less obvious, particularly because they don't offer a change in the big-O complexity, but can still make a significant difference. Take ToArray, for example. If we know the sequence is an IList<T> we can construct an array of exactly the right size and ask the list to copy the elements into it. Chances are that copy can be very efficient indeed, basically copying a whole block of bits from one place to another - and we know we won't need any resizing. Compare that with building up a buffer, resizing periodically including copying all the elements we've already discovered. Every part of that process is going to be slower, but they're both O(n) operations really. This is a good example of where big-O notation doesn't tell the whole story. Again, the optimization is almost certainly a good one to make.

Then there are distinctly dodgy optimizations which can make a difference, but are unlikely to apply. My optimization for ElementAt and ElementAtOrDefault comes into play here. It's fine to check whether an object implements IList<T>, and use the indexer if so. That's an obvious win. But I have an extra optimization to exit quickly if we can find out that the given index is out of the bounds of the sequence. Unfortunately that optimization is only useful when:

  • The sequence implements ICollection<T> or ICollection (but remember it has to implement IEnumerable<T> - there aren't many collections implementing only the non-generic ICollection, but the generic IEnumerable<T>)
  • The sequence doesn't implement IList<T> (which gets rid of almost all implementations of ICollection<T>)
  • The given index is actually greater than or equal to the size of the collection

All that comes at the cost of a couple of type checks... not a great cost, and we do potentially save an O(n) check for being given an index out of the bounds of the collection... but how often are we really going to make that win? This is where I'd love to have something like Dapper, but applied to LINQ to Objects and running in a significant number of real-world projects, just logging in as light a way as possible how often we win, how often we lose, and how big the benefit is.

Finally, we come to the optimizations which don't make sense to me... such as the optimization for First in both Mono and LinqBridge. Both of these projects check whether the sequence is a list, so that they check the count and then use the indexer to fetch item 0 instead of calling GetEnumerator()/MoveNext()/Current. Now yes, there's a chance this avoids creating an extra object (although not always, as we've seen before) - but they're both O(1) operations which are likely to be darned fast. At this point not only is the payback very small (if it even exists) but the whole operation is likely to be so fast that the tiny check for whether the object implements IList<T> is likely to become more significant. Oh, and then there's the extra code complexity - yes, that's only relevant to the implementers, but I'd personally rather they spent their time on other things (like getting OrderByDescending to work properly... smirk). In other words, I think this is a bad target for optimization. At some point I'll try to do a quick analysis of just how often the collection has to implement IList<T> in order for it to be worth doing this - and whether the improvement is even measurable.

Of course there are other micro-optimizations available. When we don't need to fetch the current item (e.g. when skipping over items) let's just call MoveNext() instead of also assigning the return value of a property to a variable. I've done that in various places in Edulinq, but not as an optimization strategy, which I suspect won't make a significant difference, but for readability - to make it clearer to the reader that we're just moving along the iterator, not examining the contents as we go.

The only other piece of optimization I think I've performed in Edulinq is the "yield the first results before sorting the rest" part of my quicksort implementation. I'm reasonably proud of that, at least conceptually. I don't think it really fits into any other bucket - it's just a matter of thinking about what we really need and when, deferring work just in case we never need to do it.

What can we not optimize in LINQ to Objects?

I've found a few optimizations in both Edulinq and other implementations which I believe to be invalid.

Here's an example I happened to look at just this morning, when reviewing the code for Skip:

var list = source as IList<TSource>;
if (list != null)
{
    count = Math.Max(count, 0);
    // Note that "count" is the count of items to skip
    for (int index = count; index < list.Count; index++)
    {
        yield return list[index];
    }
    yield break;
}

If our sequence is a list, we can just skip straight to the right part of it and yield the items one at a time. That sounds great, but what if the list changes (or is even truncated!) while we're iterating over it? An implementation working with the simple iterator would usually throw an exception, as the change would invalidate the iterator. This is definitely a behavioural change. When I first wrote about Skip, I included this as a "possible" optimization - and actually turned it on in the Edulinq source code. I now believe it to be a mistake, and have removed it completely.

Another example is Reverse, and how it should behave. The documentation is fairly unclear, but when I ran the tests, the Mono implementation used an optimization whereby if the sequence is a list, it will just return items from the tail end using the indexer. (This has now been fixed - the Mono team is quick like that!) Again, that means that changes made to the list while iterating will be reflected in the reversed sequence. I believe the documentation for Reverse ought to be clear that:

  • Execution is deferred: the input sequence isn't read when the method is called.
  • When the result sequence is first read by the caller, a snapshot is taken, and that's what's used to return the data.
  • If the result sequence is read more than once (i.e. GetEnumerator is called more than once) then a new snapshot is created each time - so changes to the input sequence between calls to GetEnumerator on the result sequence will be observed.

Now this is still not as precise as it might be in terms of what "reading" a sequence entails - in particular, a simple implementation of Reverse (as per Edulinq) will actually take the snapshot on the first call to MoveNext() on the iterator returned by GetEnumerator() - but that's probably not too bad. The snapshotting behaviour itself is important though, and should be made explicit in my opinion.

The problem with both of these "optimizations" is arguably that they're applying list-based optimizations within an iterator block used for deferred execution. Optimizing for lists either upfront at the point of the initial method call or within an immediate execution operator (Count, ToList etc) is fine, because we assume the sequence won't change during the course of the method's execution. We can't make that assumption with an iterator block, because the flow of the code is very different: our code is visited repeatedly based on the caller's use of MoveNext().

Sequence identity

Another aspect of behaviour which isn't well-specified is that of identity. When is it valid for an operator to return the input sequence itself as the result sequence?

In the Microsoft implementation, this can occur in two operators: AsEnumerable (which always returns the input sequence reference, even if it's null) and Cast (which returns the original reference only if it actually implements IEnumerable<TResult>).

In Edulinq, I have two other operators which can return the input sequence: OfType (only if the original reference implements IEnumerable<TResult> and TResult is a non-nullable value type) and Skip (if you provide a count which is zero or negative). Are these valid optimizations? Let's think about why we might not want them to be...

If you're returning a sequence from one layer of your code to another, you usually want that sequence to be viewed only as a sequence. In particular, if it's backed by a List<T>, you don't want callers casting to List<T> and modifying the list. With any operator implemented by an iterator block, that's fine - the object returned from the operator has no accessible reference to its input, and the type itself only implements IEnumerable<T> (and IEnumerator<T>, and IDisposable, etc - but not IList<T>). It's not so good if the operator decides it's okay to return the original reference.

The C# language specification refers to this in the section about query expression translation: a no-op projection at the end of a query can be omitted if and only if there are other operators in the query. So a query expression of "from foo in bar select foo" will translate to "bar.Select(foo => foo)" but if we had a "where" clause in the query, the Select call would be removed. It's worth noting that the call to "Cast" generated when you explicitly specify the type of a range variable is not enough to prevent the "no-op" projection from being generated... it's almost as if the C# team "knows" that Cast can leak sequence identity whereas Where can't.

Personally I think that the "hiding" of the input sequence should be guaranteed where it makes sense to do so, and explicitly not guaranteed otherwise. We could also add an operator of something like "HideIdentity" which would simply (and unconditionally) add an extra iterator block into the pipeline. That way library authors wouldn't have to guess, and would have a clear way of expressing their intention. Using Select(x => x) or Skip(0) is not clear, and in the case of Skip it would even be pointless when using Edulinq.

As for whether my optimizations are valid - that's up for debate, really. It seems hard to justify why leaking sequence identity would be okay for Cast but not okay for OfType, whereas I think there's a better case for claiming that Skip should always hide sequence identity.

The Contains issue...

If you remember, I have a disagreement around what Contains should do when you don't provide an equality comparer, and when the sequence implements ICollection<T>. I believe it should be consistent with the rest of LINQ to Objects, which always uses the default equality comparer for the element type when it needs one but the user hasn't specified one. Everyone else (Microsoft, Mono, LinqBridge) has gone with delegating to the collection's implementation of ICollection<T>.Contains. That plays well in terms of consistency of what happens if you call Contains on that object, so that it doesn't matter what the compile-time type is. That's a debate to go into in another post, but I just want to point out that this is not an example of optimization. In some cases it may be faster (notably for HashSet<T>) but it stands a very good chance of changing the behaviour. There is absolutely nothing to suggest that the equality comparer used by ICollection<T> should be the default one for the type - and in some cases it definitely isn't.

It's therefore a matter of what result we want to get, not how to get that result faster. It's correctness, not optimization - but both the LinqBridge and Mono tests which fail for Edulinq are called "Contains_CollectionOptimization_ReturnsTrueWithoutEnumerating" - and I think that shows a mistaken way of thinking about this.

Can we go further?

I've been considering a couple of optimizations which I believe to be perfectly legitimate, but which none of the implementations I've seen have used. One reason I haven't implemented them myself yet is that they will reduce the effectiveness of all my unit tests. You see, I've generally used Enumerable.Range as a good way of testing a non-list-based sequence... but what's to stop Range and Repeat being implemented as IList<T> implementations?

All the non-mutating members are easy to implement, and we can just throw exceptions from the mutating members (as other read-only collections do).

Would this be more efficient? Well yes, if you ever performed a Count(), ElementAt(), ToArray(), ToList() etc operation on a range or a repeated element... but how often is that going to happen? I suspect it's pretty rare - probably rare enough not to make it worth my time, particularly when you then consider all the tests that would have to be rewritten to use something other than Range when I wanted a non-list sequence...

Conclusion

Surprise, surprise - doing optimization well is difficult. When it's obvious what can be done, it's not obvious what should be done... and sometimes it's not even what is valid in the first place.

Note that none of this has really talked about data structures and algorithms. I looked at some options when implementing ordering, and I'm still thinking about the best approach for implementing TopBy (probably either a heap or a self-balancing tree - something which could take advantage of the size being constant would be nice) - but in general the optimizations here haven't required any cunning knowledge of computer science. That's quite a good thing, because it's many years since I've studied CS seriously...

I suspect that with this post more than almost any other, I'm likely to want to add extra items in the future (or amend mistakes which reveal my incompetence). Watch this space.

Next up, I think it would be worth revisiting query expressions from scratch. Anyone who's read C# in Depth or has followed this blog for long enough is likely to be able to skip it, but I think the series would be incomplete without a quick dive into the compiler translations involved.

Posted by skeet | 8 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 39 - Comparing implementations

While implementing Edulinq, I only focused on two implementations: .NET 4.0 and Edulinq. However, I was aware that there were other implementations available, notably LinqBridge and the one which comes with Mono. Obviously it's interesting to see how other implementations behave, so I've now made a few changes in order to make the test code run in these different environments.

The test environments

I'm using Mono 2.8 (I can't remember the minor version number offhand) but I tend to think of it as "Mono 3.5" or "Mono 4.0" depending on which runtime I'm using and which base libraries I'm compiling against, to correspond with the .NET versions. Both runtimes ship as part of Mono 2.8. I will use these version numbers for this post, and ask forgiveness for my lack of precision: whenever you see "Mono 3.5" please just think "Mono 2.8 running against the 2.0 runtime, possibly using some of the class libraries normally associated with .NET 3.5".

LinqBridge is a bit like Edulinq - a clean room implementation of LINQ to Objects, but built against .NET 2.0. It contains its own Func delegate declarations and its own version of ExtensionAttribute for extension methods. In my experience this makes it difficult to use with the "real" .NET 3.5, so my build targets .NET 2.0 when running against LinqBridge. This means that tests using HashSet had to be disabled. The version of LinqBridge I'm running against is 1.2 - the latest binary available on the web site. This has AsEnumerable as a plain static method rather than an extension method; the code has been fixed in source control, but I wanted to run against a prebuilt binary, so I've just disabled my own AsEnumerable tests for LinqBridge. Likewise the tests for Zip are disabled both for LinqBridge and the "Mono 3.5" tests as Zip was only introduced in .NET 4.

The other issue of not having .NET 4 available in the tests is that the string.Join<T>(string, IEnumerable<T>) overload is unavailable - something I'd used quite a lot in the test code. I've created a new static class called "StringEx" and replaced string.Join with StringEx.Join everywhere.

There are batch files under a new "testing" directory which will build and run:

  • Microsoft's LINQ to Objects and Edulinq under .NET
  • LinqBridge, Mono 3.5's LINQ to Objects and Edulinq under Mono 3.5
  • Mono 4.0's LINQ to Objects and Edulinq under Mono 4.0

Although I have LinqBridge running under .NET 2.0 in Visual Studio, it's a bit of a pain building the tests from a batch file (at least without just calling msbuild). The failures running under Mono 3.5 are the same as those running under .NET 2.0 as far as I can tell, so I'm not too worried.

Note that while I have built the Mono tests under both the 3.5 and 4.0 profiles, the results were the same other than due to generic variance, so I've only included the results of the 4.0 profile below.

What do the tests cover?

Don't forget that the Edulinq tests were written in the spirit of investigation. They cover aspects of LINQ's behaviour which are not guaranteed, both in terms of optimization and simple correctness of behaviour. I have included a test which demonstrates the "issue" with calling Contains on an ICollection<T> which uses a non-default equality comparer, as well as the known issue with OrderByDescending using a comparer which returns int.MinValue. There are optimizations which are present in Edulinq but not in LINQ to Objects, and I have tests for those, too.

The tests which fail against Microsoft's implementation (for known reasons) are normally marked with an [Ignore] attribute to prevent them from alarming me unduly during development. NUnit categories would make more sense here, but I don't believe ReSharper supports them, and that's the way I run the tests normally. Likewise the tests which take a very long time (such as counting more than int.MaxValue elements) are normally suppressed.

In order to truly run all my tests, I now have a horrible hack using conditional compilation: if the ALL_TESTS preprocessor symbol is defined, I build my own IgnoreAttribute class in the Edulinq.Tests namespace, which effectively takes precedence over the NUnit one... so NUnit will ignore the [Ignore], so to speak. Frankly all this conditional compilation is pretty horrible, and I wouldn't use it for a "real" project, but this is a slightly unusual situation.

EDIT: It turns out that ReSharper does support categories. I'm not sure how far that support goes yet, but at the very least there's "Group by categories" available. I may go through all my tests and apply a category to each one: optimization, execution mode, time-consuming etc. We'll see whether I can find the energy for that :)

So, let's have a look at what the test results are...

Edulinq

Unsurprisingly, Edulinq passes all its own tests, with the minor exception of CastTest.OriginalSourceReturnedDueToGenericCovariance running under Mono 3.5, which doesn't include covariance. Arguably this test should be conditionalised to not even run in that situation, as it's not expected to work.

Microsoft's LINQ to Objects

8 failures, all expected:

  • Contains delegates to the ICollection<T>.Contains implementation if it exists, rather than using the default comparer for the type. This is a design and documentation issue which I've discussed in more detail in the Contains part of this series.
  • Optimization: ElementAt and ElementAtOrDefault don't validate the specified index eagerly when the input sequence implements ICollection<T> but not IList<T>.
  • Optimization: OfType always uses an intermediate iterator even when the input sequence already implements IEnumerable<T> and T is a non-nullable value type.
  • Optimization: SequenceEqual doesn't compare the counts of the sequences eagerly even when both sequences implement ICollection<T>
  • Correctness: OrderByDescending doesn't work if you use a key comparer which returns int.MinValue
  • Consistency: Single and SingleOrDefault (with a predicate) don't throw InvalidOperationException as soon as they encounter a second element matching the predicate; the predicate-less overloads do throw as soon as they see a second element.

All of these have been discussed already, so I won't go into them now.

LinqBridge

LinqBridge had a total of 33 failures. I haven't looked into them in detail, but just going from the test output I've broken them down into the following broad categories:

  • Optimization:
    • Cast never returns the original source, presumably always introducing an intermediate iterator.
    • All three of Microsoft's "missed opportunities" listed above are also missed in LinqBridge
  • Use of input sequences:
    • Except and Intersect appear to read the first sequence first (possibly completely?) and then the second sequence. Edulinq and LINQ to Objects read the second sequence completely and then stream the first sequence. This behaviour is undocumented.
    • Join, GroupBy and GroupJoin appear not to be deferred at all. If I'm right, this is a definite bug.
    • Aggregation accuracy: both Average and Sum over an IEnumerable<float> appear to use a float accumulator instead of a double. This is probably worth fixing for the sake of both range and accuracy, but isn't specified in the documentation.
    • OrderBy (etc) appears to apply the key selector multiple times while sorting. The behaviour here isn't documented, but as I mentioned before, it could produce performance issues unnecessarily.
  • Exceptions:
    • ToDictionary should throw an exception if you give it duplicate keys; it appears not to - at least when a custom comparer is used. (It's possible it's just not passing the comparer along.)
    • The generic Max and Min methods don't return the null value for the element type when that type is nullable. Instead, they throw an exception - which is the normal behaviour if the element type is non-nullable. This behaviour isn't well documented, but is consistent with the behaviour of the non-generic overloads. See the Min/Max post for more details.
  • General bugs:
    • The generic form of Min/Max appears not to ignore null values when the element type is nullable.
    • OrderByDescending appears to be broken in the same way as Microsoft's implementation
    • Range appears to be broken around its boundary testing.
    • Join, GroupJoin, GroupBy and ToLookup break when presented with null keys

Mono 4.0 (and 3.5, effectively)

Mono failed 18 of the tests. There are fewer definite bugs than in LinqBridge, but it's definitely not perfect. Here's the breakdown:

  • Optimization:
    • Mono misses the same three opportunities that LinqBridge and Microsoft miss.
    • Contains(item) delegates to ICollection<T> when it's implemented, just like in the Microsoft implementation. (I assume the authors would call this an "optimization", hence its location in this section.) I believe that LinqBridge has the same behaviour, but that test didn't run in the LinqBridge configuration as it uses HashSet.
  • Average/Sum accumulator types:
    • Mono appears to use float when working with float values, leading to more accumulator error than is necessary.
  • Average overflow for integer types
    • Mono appears to use checked arithmetic when summing a sequence, but not when taking the average of a sequence. So the average of { long.MaxValue, long.MaxValue, 2 } is 0. (This originally confused me into thinking it was using floating point types during the summation, but I now believe it's just a checked/unchecked issue.)
  • Bugs:
    • Count doesn't overflow either with or without a predicate
    • The Max handling of double.NaN isn't in line with .NET. I haven't investigated the reason for this yet.
    • OrderByDescending is broken in the same way as for LinqBridge and the Microsoft implementation.
    • Range is broken for both Range(int.MinValue, 0) and Range(int.MaxValue, 1). Test those boundary cases, folks :)
    • When reversing a list, Mono doesn't buffer the current contents. In other words, changes made while iterating over the reversed list are visible in the returned sequence. The documentation isn't very clear about the desired behaviour here, admittedly.
    • GroupJoin and Join match null keys, unlike Microsoft's implementation.

How does Edulinq fare against other unit tests?

It didn't seem fair to only test other implementations against the Edulinq tests. After all, it's only natural that my tests should work against my own code. What happens if we run the Mono and LinqBridge tests against my code?

The LinqBridge tests didn't find anything surprising. There were two failures:

  • I don't have the "delegate Contains to ICollection<T>.Contains" behaviour, which the tests check for.
  • I don't optimize First in the case of the collection implementing IList<T>. I view this as a pretty dubious optimization to be honest - I doubt that creating an iterator to get to the first item is going to be much slower than checking for IList<T>, fetching the count, and then fetching the first item via the indexer... and it means that all non-list implementations also have to check whether the sequence implements IList<T>. I don't intend to change Edulinq for this.

The Mono tests picked up the same two failures as above, and two genuine bugs:

  • By implementing Take via TakeWhile, I was iterating too far: in order for the condition to become false, we had to iterate to the first item we wouldn't return.
  • ToLookup didn't accept null keys - a fault which propagated to GroupJoin, Join and GroupBy too. (EDIT: It turns out that it's more subtle than that. Nothing should break, but the MS implementation ignores null keys for Join and GroupJoin. Edulinq now does the same, but I've raised a Connect issue to suggest this should at least be documented.)

I've fixed these in source control, and will add an addendum to each of the relevant posts (Take, ToLookup) when I have a moment spare.

There's one additional failure, trying to find the average of a sequence of two Int64.MaxValue values. That overflows on both Edulinq and LINQ to Objects - that's the downside of using an Int64 to sum the values. As mentioned, Mono suffers a degree of inaccuracy instead; it's all a matter of trade-offs. (A really smart implementation might use Int64 while possible, and then go up to using Double where necessary, I suppose.)

Unfortunately I don't have the tests for the Microsoft implementation, of course... I'd love to know whether there's anything I've failed with there.

Conclusion

This was very interesting - there's a mixture of failure conditions around, and plenty of "non-failures" where each implementation's tests are enforcing their own behaviour.

I do find it amusing that all three of the "mainstream" implementations have the same OrderByDescending bug though. Other than that, the clear bugs between Mono and LinqBridge don't intersect, which is slightly surprising.

It's nice to see that despite not setting out to create a "production-quality" implementation of LINQ to Objects, that's mostly what I've ended up with. Who knows - maybe some aspects of my implementation or tests will end up in Mono in the future :)

Given the various different optimizations mentioned in this post, I think it's only fitting that next time I'll discuss where we can optimize, where it's worth optimizing, and some more tricks we could still pull out of the bag...

Posted by skeet | 11 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 38 - What's missing?

I mentioned before that the Zip operator was only introduced in .NET 4, so clearly there's a little wiggle room for LINQ to Object's query operators to grow in number. This post mentions some of the ones I think are most sorely lack - either because I've wanted them myself, or because I've seen folks on Stack Overflow want them for entirely reasonable use cases.

There is an issue with respect to other LINQ providers, of course: as soon as some useful operators are available for LINQ to Objects, there will be people who want to apply them to LINQ to SQL, the Entity Framework and the like. Worse, if they're not included in Queryable with overloads based on expression trees, the LINQ to Objects implementation will silently get picked - leading to what looks like a lovely query performing like treacle while the client slurps over the entire database. If they are included in Queryable, then third party LINQ providers could end up with a nasty versioning problem. In other words, some care is needed and I'm glad I'm not the one who has to decide how new features are introduced.

I've deliberately not looked at the extra set of operators introduced in the System.Interactive part of Reactive Extensions... nor have I looked back over what we've implemented in MoreLINQ (an open source project I started specifically to create new operators). I figured it would be worth thinking about this afresh - but look at both of those projects for actual implementations instead of just ideas.

Currently there's no implementation of any of this in Edulinq - but I could potentially create an "Edulinq.Extras" assembly which made it all available. Let me know if any of these sounds particularly interesting to see in terms of implementation.

FooBy

I love OrderBy and ThenBy, with their descending cousins. They're so much cleaner than building a custom comparer which just performs a comparison between two properties. So why stop with ordering? There's a whole bunch of operators which could do with some "FooBy" love. For example, imagine we have a list of files, and we want to find the longest one. We don't want to perform a total ordering by size descending, nor do we want to find the maximum file size itself: we want the file with the maximum size. I'd like to be able to write that query as:

FileInfo biggestFile = files.MaxBy(file => file.Length);

Note that we can get a similar result by performing one pass to find the maximum length, and then another pass to find the file with that length. However, that's inefficient and assumes we can read the sequence twice (and get the same results both times). There's no need for that. We could get the same result using Aggregate with a pretty complicated aggregation, but I think this is a sufficiently common case to deserve its own operator.

We'd want to specify which value would be returned if multiple files had the same length (my suggestion would be the first one we encountered with that length) and we could also specify a key comparer to use. The signatures would look like this:

public static TSource MaxBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static TSource MaxBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

Now it's not just Max and Min that gain from this "By" idea. It would be useful to apply the same idea to the set operators. The simplest of these to think about would be DistinctBy, but UnionBy, IntersectBy and ExceptBy would be reasonable too. In the case of ExceptBy and IntersectBy we could potentially take the key collection to indicate the keys of the elements we wanted to exclude/include, but it would probably be more consistent to force the two input sequences to be of the same type (as they would have to be for UnionBy and IntersectBy of course). ContainsBy might be useful, but that would effectively be a Select followed by a normal Contains - possibly not useful enough to merit its own operator.

TopBy and TopByDescending

These may sound like they belong in the FooBy section, but they're somewhat different: they're effectively specializations of OrderBy and OrderByDescending where you already know how many elements you want to preserve. The return type would be IOrderedEnumerable<T> so you could still use ThenBy/ThenByDescending as normal. That would make the following two queries equivalent - but the second might be a lot more efficient than the first:

var takeQuery = people.OrderBy(p => p.LastName)
                      .ThenBy(p => p.FirstName)
                      .Take(3);

var topQuery = people.TopBy(p => p.LastName, 3)
                     .ThenBy(p => p.FirstName);

An implementation could easily delegate to various different strategies depending on the number given - for example, if you asked for more than 10 values, it may not be worth doing anything more than a simple sort and restrict the output. If you asked for just the top 3 values, that could return an IOrderedEnumerable implementation specifically hard-coded to 3 values, etc.

Aside from anything else, if you were confident in what the implementation did (and that's a very big "if") you could use a potentially huge input sequence with such a query - larger than you could fit into memory in one go. That's fine if you're only keeping the top three values you've seen so far, but would fail for a complete ordering, even one which was able to yield results before performing all the ordering: if it doesn't know you're going to stop after three elements, it can't throw anything away.

Perhaps this is too specialized an operator - but it's an interesting one to think about. It's worth noting that this probably only makes sense for LINQ to Objects, which never gets to see the whole query in one go. Providers like LINQ to SQL can optimize queries of the form OrderBy(...).ThenBy(...).Take(...) because by the time they need to translate the query into SQL, they will have an expression tree representation which includes the "Take" part.

TryFastCount and TryFastElementAt

One of the implementation details of Edulinq is its TryFastCount method, which basically encapsulates the logic around attempting to find the count of a sequence if it implements ICollection or ICollection<T>. Various built-in LINQ operators find this useful, and anyone writing their own operators has a reasonable chance of bumping into it as well. It seems pointless to duplicate the code all over the place... why not expose it? The signatures might look something like this:

public static bool TryFastCount<TSource>(
    this IEnumerable<TSource> source,
    out int count)

public static bool TryFastElementAt<TSource>(
    this IEnumerable<TSource> source,
    int index,
    out TSource value)

I would expect TryFastElementAt to use the indexer if the sequence implemented IList<T> without performing any validation: that ought to be the responsibility of the caller. TryFastCount could use a Nullable<int> return type instead of the return value / out parameter split, but I've kept it consistent with the methods which exist elsewhere in the framework

Scan and SelectAdjacent

These are related operators in that they deal with wanting a more global view than just the current element. Scan would act similarly to Aggregate - except that it would yield the accumulator value after each element. Here's an example of keeping a running total:

// Signature:
public static IEnumerable<TAccumulate> Scan<TSource, TAccumulate>(
    this IEnumerable<TSource> source,
    TAccumulate seed,
    Func<TAccumulate, TSource, TAccumulate> func)
    

int[] source = new int[] { 3, 5, 2, 1, 4 };
var query = source.Scan(0, (current, item) => current + item);
query.AssertSequenceEqual(3, 8, 10, 11, 15);

There could be a more complicated overload with an extra conversion from TAccumulate to an extra TResult type parameter. That would let us write a Fibonacci sequence query in one line, if we really wanted to...

The SelectAdjacent operator would simply present a selector function with pairs of adjacent items. Here's a similar example, this time calculating the difference between each pair:

// Signature:
public static IEnumerable<TResult> SelectAdjacent<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TSource, TResult> selector)
    

int[] source = new int[] { 3, 5, 2, 1, 4 };
var query = source.SelectAdjacent((current, next) => next - current);
query.AssertSequenceEqual(2, -3, -1, 3);

One oddity here is that the result sequence always contains one item fewer than the source sequence. If we wanted to keep the length the same, there are various approaches we could take - but the best one would depend on the situation.

This sounds like a pretty obscure operator, but I've actually seen quite a few LINQ questions on Stack Overflow where it could have been useful. Is it useful often enough to deserve its own operator? Maybe... maybe not.

DelimitWith

This one is really just a bit of a peeve - but again, it's a pretty common requirement. We often want to take a sequence and create a single string which is (say) a comma-delimited version. Yay, String.Join does exactly what we need - particularly in .NET 4, where there's an overload taking IEnumerable<T> so you don't need to convert it to a string array first. However, it's still a static method on string - and the name "Join" also looks slightly odd in the context of a LINQ query, as it's got nothing to do with a LINQ-style join.

Compare these two queries: which do you think reads better, and feels more "natural" in LINQ?

// Current state of play...
var names = string.Join(",",
                        people.Where(p => p.Age < 18)
                              .Select(p => p.FirstName));

// Using DelimitWith
var names = people.Where(p => p.Age < 18)
                  .Select(p => p.FirstName)
                  .DelimitWith(",");

I know which I prefer :)

ToHashSet

(Added on February 23rd 2011.)

I'm surprised I missed this one first time round - I've bemoaned its omission in various places before now. It's easy to create a list, dictionary, lookup or array from an anonymous type, but you can't create a set that way. That's mad, given how simple the relevant operator is, even with an overload for a custom equality comparer:

public static HashSet<TSource> ToHashSet<TSource>(
    this IEnumerable<TSource> source)
{
    return source.ToHashSet(EqualityComparer<TSource>.Default);
}

public static HashSet<TSource> ToHashSet<TSource>(
    this IEnumerable<TSource> source,
    IEqualityComparer<TSource> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    return new HashSet<TSource>(source, comparer ?? EqualityComparer<TSource>.Default);
}

This also makes it much simpler to create a HashSet in a readable way from an existing query expression, without either wrapping the whole query in the constructor call or using a local variable.

Conclusion

These are just the most useful extra methods I thought of, based on the kinds of query folks on Stack Overflow have asked about. I think it's interesting that some are quite general - MaxBy, ExceptBy, Scan and so on - whereas others (TopBy, SelectAdjacent and particularly DelimitWith) are simply aimed at making some very specific but common situations simpler. It feels to me like the more general operators really are missing from LINQ - they would fit quite naturally - but the more specific ones probably deserve to be in a separate static class, as "extras".

This is only scratching the surface of what's possible, of course - System.Interactive.EnumerableEx in Reactive Extensions has loads of options. Some of them are deliberate parallels of the operators in Observable, but plenty make sense on their own too.

One operator you may have expected to see in this list is ForEach. This is a controversial topic, but Eric Lippert has written about it very clearly (no surprise there, then). Fundamentally LINQ is about querying a sequence, not taking action on it. ForEach breaks that philosophy, which is why I haven't included it here. Usually a foreach statement is a perfectly good alternative, and make the "action" aspect clearer.

Posted by skeet | 25 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 37 - Guiding principles

Now that I'm "done" reimplementing LINQ to Objects - in that I've implemented all the methods in System.Linq.Enumerable - I wanted to write a few posts looking at the bigger picture. I'm not 100% sure of what this will consist of yet; I want to avoid this blog series continuing forever. However, I'm confident it will contain (in no particular order):

  • This post: principles governing the behaviour of LINQ to Objects
  • Missing operators: what else I'd have liked to see in Enumerable
  • Optimization: where the .NET implementation could be further optimized, and why some obvious-sounding optimizations may be inappropriate
  • How query expression translations work, in brief (and with a cheat sheet)
  • The difference between IQueryable<T> and IEnumerable<T>
  • Sequence identity, the "Contains" issue, and other knotty design questions
  • Running the Edulinq tests against other implementations

If there are other areas you want me to cover, please let me know.

The principles behind the LINQ to Objects implementation

The design LINQ to Objects is built on a few guiding principles, both in terms of design and implementation details. You need to understand these, but also implementations should be clear about what they're doing in these terms too.

Extension method targets and argument validation

IEnumerable<T> is the core sequence type, not just for LINQ but for .NET as a whole. Almost everything is written in terms of IEnumerable<T> at least as input, with the following exceptions:

  • Empty, Range and Repeat don't have input sequences (these are the only non-extension methods)
  • OfType and Cast work on the non-generic IEnumerable type instead
  • ThenBy and ThenByDescending work on IOrderedEnumerable<T>

All operators other than AsEnumerable verify that any input sequence is non-null. This validation is performed eagerly (i.e. when the method is called) even if the operator uses deferred execution for the results. Any delegate used (typically a projection or predicate of some kind) must be non-null. Again, this validation is performed eagerly.

IEqualityComparer<T> is used for all custom equality comparisons. Any parameter of this type may be null, in which case the default equality comparer for the type is used. In most cases the default equality comparer for the type is also used when no custom equality comparer is used, but Contains has some odd behaviour around this. Equality comparers are expected to be able to handle null values. IComparer<T> is only used by the OrderBy/ThenBy operators and their descending counterparts - and only then if you want custom comparisons between keys. Again, a null IComparer<T> means "use the default for the type"

Timing of input sequence "opening"

Any operator with a return type of IEnumerable<T> or IOrderedEnumerable<T> uses deferred execution. This means that the method doesn't read anything from any input sequences until someone starts reading from the result sequence. It's not clearly defined exactly when input sequences will first be accessed - for some operators if may be when GetEnumerator() is called; for others it may be on the first call to MoveNext() on the resulting iterator. Callers should not depend on these slight variations. Deferred execution is common for operators in the middle of queries. Operators which use deferred execution effectively represent queries rather than the results of queries - so if you change the contents of the original source of the query and then iterate over the query itself again, you'll see the change. For example:

List<string> source = new List<string>();
var query = source.Select(x => x.ToUpper());
        
// This loop won't write anything out
foreach (string x in query)
{
    Console.WriteLine(x);
}
        
source.Add("foo");
source.Add("bar");

// This loop will write out "FOO" and "BAR" - even
// though we haven't changed the value of "query"
foreach (string x in query)
{
    Console.WriteLine(x);
}

Deferred execution is one of the hardest parts of LINQ to understand, but once you do, everything becomes somewhat simpler.

All other operators use immediate execution, fetching all the data they need from the input before they return a value... so that by the time they do return, they will no longer see or care about changes to the input sequence. For operators returning a scalar value (such as Sum and Average) this is blatantly obvious - the value of a variable of type double isn't going to change just because you've added something to a list. However, it's slightly less for the "ToXXX" methods: ToLookup, ToArray, ToList and ToDictionary. These do not return views on the original sequence, unlike the "As" methods: AsEnumerable which we've seen, and Queryable.AsQueryable which I didn't implement. Focus on the prefix part of the name: the "To" part indicates a conversion to a particular type. The "As" prefix indicates a wrapper of some kind. This is consistent with other parts of the framework, such as List<T>.AsReadOnly and Array.AsReadOnly<T>.

Very importantly, LINQ to Objects only iterates over any input sequence at most once, whether the execution is deferred or immediate. Some operators would be easier to implement if you could iterate over the input twice - but it's important that they don't do so. Of course if you provide the same sequence for two inputs, it will treat those as logically different sequences. Similarly if you iterate over a result sequence more than once (for operators that return IEnumerable<T> or a related interface, rather than List<T> or an array etc), that will iterate over the input sequence again.

This means it's fine to use LINQ to Objects with sequences which may only be read once (such as a network stream), or which are relatively expensive to reread (imagine a log file reader over a huge set of logs) or which give inconsistent results (imagine a sequence of random numbers). In some cases it's okay to use LINQ to Objects with an infinite sequence - in others it's not. It's usually fairly obvious which is the case.

Timing of input sequence reading, and memory usage

Where possible within deferred execution, operators act in a streaming fashion, only reading from the input sequence when they have to, and "forgetting" data as soon as they can. This allows for long - potentially infinite - sequences to be handled elegantly without memory running out.

Some operators naturally need to read all the data in before they can return anything. The most obvious example of this is Reverse, which will always yield the last element of the input stream as the first element in the result stream.

A third pattern occurs with operators such as Distinct, which yield data as they go, but accumulate elements too, taking more and more memory until the caller stops iterating (usually either by jumping out of the foreach loop, or letting it terminate naturally).

Where an operator takes two input sequences - such as Join - you need to understand the consumption of each one separately. For example, Join uses deferred execution, but as soon as you ask for the first element of the result set, it will read the "second" sequence completely and buffer it - whereas the "first" sequence is streamed. This isn't the case for all operators with two inputs, of course - Zip streams both input sequences, for example. Check the documentation - and the relevant Edulinq blog post - for details.

Obviously any operator which uses immediate execution has to read all the data it's interested in before it returns. This doesn't necessarily mean they will read to the end of the sequence though, and they may not need to buffer the data they read. (Simple examples are ToList which has to keep everything, and Sum which doesn't.)

Queries vs data

Closely related to the details of when the input is read is the concept of what the result of an operator actually represents. Operators which use deferred execution return queries: each time you iterate over the result sequence, the query will look at the input sequence again. The query itself doesn't contain the data - it just knows how to get at the data.

Operators which use immediate execution work the other way round: they read all the data they need, and then forget about the input sequence. For operators like Average and Sum this is obvious as it's just a simple scalar value - but for operators like ToList, ToDictionary, ToLookup and ToArray, it means that the operator has to make a copy of everything it needs. (This is potentially a shallow copy of course - depending on what user-defined projections are applied. The normal behaviour of mutable reference types is still valid.)

I realise that in many ways I've just said the same thing multiple times now - but hopefully that will help this crucial aspect of LINQ behaviour sink in, if you were still in any doubt.

Exception handling

I'm unaware of any situation in which LINQ to Objects will catch an exception. If your predicate or projection throws an exception, it will propagate in the obvious way.

However, LINQ to Objects does ensure that any iterator it reads from is disposed appropriately - assuming that the caller disposes of any result sequences properly, of course. Note that the foreach statement implicitly disposes of the iterator in a finally block.

Optimization

Various operators are optimized when they detect at execution time that the input sequence they're working on offers a shortcut.

The types most commonly detected are:

  • ICollection<T> and ICollection for their Count property
  • IList<T> for its random access indexer

I'll look at optimization in much more detail in a separate post.

Conclusion

This post has not been around the guiding principles behind LINQ itself - lambda calculus or anything like that. It's more been a summary of the various aspects of behaviour we've seen across the various operators we've implemented. They're the rules I've had to follow in order to make Edulinq reasonably consistent with LINQ to Objects.

Next time I'll talk about some of the operators which I think should have made it into the core framework, at least for LINQ to Objects.

Posted by skeet | 5 comment(s)
Filed under: , ,

Gotcha around iterator blocks

This will be a short interlude from Edulinq, although as it will become obvious, it was in the process of writing the next Edulinq post that I made this discovery. It may be known about elsewhere, but it's the first time I'd come across it - or rather, it's the first time I've considered the impact of the compiler's decisions in this area.

Sequences and iterators

This post is about details of iterator blocks and memory usage. It's going to be quite important that you can easily follow exactly what I'm talking about, so I'm going to define a few things up-front.

When I talk about a sequence, I mean an IEnumerable<T>. So a List<T> is a sequence, as is the return value of Enumerable.Range, or many of the other LINQ queries.

When I talk about an iterator, I mean the result of calling GetEnumerator() on a sequence - it's an IEnumerator<T>. In particular, there can be many iterators corresponding to a single sequence at the same time. A foreach loop will take a sequence and hide the mechanics of obtaining the iterator and calling MoveNext()/Current on it repeatedly.

The C# language specification calls a sequence an enumerable object and it calls an iterator an enumerator object. However, those terms are easy to get mixed up - partly because they differ in so few letters - hence my use of sequence and iterator.

An iterator block is a method declared to return IEnumerable, IEnumerator, IEnumerable<T> or IEnumerator<T>, and containing one or more "yield" statements. The compiler deals with iterator blocks differently to normal methods. It creates an extra type behind the scenes, effectively converting local variables of the iterator block into instance variables in the extra type. The method itself is stripped bare, so that it just creates an object of the new type, without running any of the original source code until MoveNext() is first called on the corresponding iterator (either immediately or after a call to GetEnumerator(), depending on whether the method is declared to return a sequence or an iterator).

Compiler optimizations

The details of exactly how the compiler transforms iterator blocks is outside the scope of the C# specification. It does talk about some details, and gives an implementation example, but there's plenty of scope for compiler-specific variation.

The Microsoft C# compiler does a "neat trick" which is the source of the gotcha I'm describing today. When it creates a hidden type to represent a sequence, the same type is used to represent the iterator, and the first time that GetEnumerator() is called on the same thread as the one which created the sequence, it returns a reference to "this" rather than creating a new object.

This is easier to follow with an example. Consider this method:

static IEnumerable<int> GetSimpleSequence(int count)
{
    for (int i = 0; i < count; i++)
    {
        yield return i;
    }
}

After compilation, my method ended up looking like this:

private static IEnumerable<int> GetSimpleSequence(int count)
{
    <GetSimpleSequence>d__0 d__ = new <GetSimpleSequence>d__0(-2);
    d__.<>3__count = count;
    return d__;
}

We can see that the <>3__count field of the generated type represents the initial value for the "count" parameter - although my code happens not to change the parameter value, it's important that it's preserved for all iterators generated from the same sequence. The -2 argument to the constructor is to represent the state of the object: -2 is used to represent a sequence rather than an iterator.

I'm not going into all the details of the generated type in this post (I have an article which goes into some more details) but the important part is the GetEnumerator() method:

[DebuggerHidden]
IEnumerator<int> IEnumerable<int>.GetEnumerator()
{
    Test.<GetSimpleSequence>d__0 d__;
    if ((Thread.CurrentThread.ManagedThreadId == this.<>l__initialThreadId
        && (this.<>1__state == -2))
    {
        this.<>1__state = 0;
        d__ = this;
    }
    else
    {
        d__ = new Test.<GetSimpleSequence>d__0(0);
    }
    d__.count = this.<>3__count;
    return d__;
}

The line of "d__ = this;" is the crucial one for the purpose of this post: a sequence's first iterator is itself.

We'll look into the negative ramifications of this in a minute, but first let's think about the upsides. Basically, it means that if we only need to iterate over the sequence once, we only need a single object. It's reasonable to suspect that the "fetch and iterate once" scenario is an extremely common one, so it's fair to optimize for it. I'm not sure it's actually worth the extra complexity introduced though - especially when there can be a heavy penalty in some odd cases. I assume Microsoft has performed more performance testing of this than I have, but they may not have considered the specific issue mentioned in this post. Let's look at it now.

The bulky iterator problem

I'll present the problem using the context in which I discovered it: the LINQ "Distinct" operator. First I'll demonstrate it using the LINQ to Objects implementation, then show a few fixes we can make if we use our own one.

Here's the sample program we'll be using throughout, sometimes with a few modifications:

using System;
using System.Collections.Generic;
using System.Linq;

class Test
{
    static void Main()
    {
        var source = Enumerable.Range(0, 1000000);
        var query = source.Distinct();
        
        ShowMemory("Before Count()");
        query.Count();
        ShowMemory("After Count()");
        GC.KeepAlive(query);
        ShowMemory("After query eligible for GC");
    }
    
    static void ShowMemory(string explanation)
    {
        long memory = GC.GetTotalMemory(true);
        Console.WriteLine("{0,-30} {1}",
                          explanation + ":",
                          memory);
    }
}

Obviously ShowMemory is just a diagnostic method. What we're interested in is what happens in Main:

  • We create a source sequence. In this case it's a range, but it could be anything that returns a lot of distinct elements.
  • We create a query by calling Distinct. So far, so good - we haven't done anything with the range yet, or iterated over anything. We dump the total memory used.
  • We count the elements in the query results: this has to iterate over the query, and will dispose of the iterator afterwards. We dump the total memory used.
  • We call GC.KeepAlive simply to make sure that the query sequence itself isn't eligible for garbage collection until this point. We no longer explicitly have a reference to the iterator used by Count, but we do have a reference to the sequence up until now.
  • We dump memory one final time... but because this occurs after the GC.KeepAlive call (and there are no later references to query) the sequence can be disposed.

Note that the GetTotalMemory(true) call performs a full GC each time. I dare say it's not foolproof, but it gives a pretty reasonable indication of memory that can't be collected yet.

Now in a sensible world, I would expect the memory to stay reasonably constant. The Count() method may well generate a lot of garbage as it iterates over the query, but that's fine - it should all be transient, right?

Wrong. Here's one sample set of results:

Before Count():                47296
After Count():                 16828480
After query eligible for GC:   51140

Until we allow the sequence to be garbage collected, we're left with all the garbage which was associated with the iterator used by Count. "All the garbage" is a set of elements which have previously been returned - Distinct needs to remember them all to make sure it doesn't return any duplicates.

The problem is the "optimization" performed by the C# compiler: the iterator used by Count() is the same object that "query" refers to, so it's got a reference to that large set of integers... a set we no longer actually care about.

As soon as the sequence can be garbage collected, we're back down to modest memory use - the difference between the first and third lines of output is probably due to the code being loaded and JIT compiled over the course of the program.

Note that if we extended our query to call Where, Select etc after Distinct, we'd have the same problem - each sequence in the chain would keep a reference to its own "source" sequence, and eventually you'd get back to the Distinct sequence and all its post-iteration garbage.

Okay, enough doom and gloom... how can we fix this?

Solution #1: undo the compiler optimization (caller code)

If we're aware of the problem, we can sometimes fix it ourselves, even if we only own the "client" code. Our sample program can be fixed with a single line:

using (query.GetEnumerator());

Just put that anywhere before the call to Count(), and we're in the clear. The compiler gives a warning due to a possibly-accidental empty block, mind you... why would we want to ask for a value only to dispose it?

If you remember, the GetEnumerator method only returns "this" the first time you call it. So we're just creating that iterating and immediately disposing of it. None of the code that was originally within the Distinct implementation will be run, so we won't generate any garbage.

Sure enough, the results bear out the theory:

Before Count():                47296
After Count():                 51228
After query eligible for GC:   51140

Unfortunately, this relies on two things:

  • Every caller has to know this is a potential problem. I've never heard anyone mention it before and I only discovered it myself yesterday. Note that it's only likely to be a problem for sequence types which are generated by the C# compiler... how much do you want to be at the mercy of the implementation details of whatever you're calling?
  • It only works if you can force GetEnumerator() to be called on the problematic sequence. If we add a call to Select(x => x) to the end of our query declaration, we're back to square one - because calling GetEnumerator() on the Select sequence doesn't force GetEnumerator() to be called on the Distinct sequence.

It's an interesting fix, but basically not a good idea in the long run.

Solution #2: clean up in the iterator block (implementation code)

Obviously I can't change LINQ to Objects, so at this point I'll resort to a very simplistic custom implementation of Distinct. I'm not going to bother about the overload with a custom equality comparer, and I won't even perform argument validation - although I will split the implementation into two methods as we'd have to for normal argument validation to work eagerly. That will be useful later. Here's the code in its broken state:

public static class CustomEnumerable
{
    public static IEnumerable<TSource> Distinct<TSource>(
        this IEnumerable<TSource> source)
    {
        // Argument validation elided
        return DistinctImpl(source);
    }
    
    private static IEnumerable<TSource> DistinctImpl<TSource>(
        IEnumerable<TSource> source)
    {
        HashSet<TSource> seen = new HashSet<TSource>();
        foreach (TSource item in source)
        {
            if (seen.Add(item))
            {
                yield return item;
            }
        }
    }
}

Readers who've been following my Edulinq series should be pretty familiar with this. It suffers from the same problem as the LINQ to Objects implementation, which is what I'd expect. Now let's do something really horrible - set a "local" variable to null when we know it will no longer be referenced within the method:

private static IEnumerable<TSource> DistinctImpl<TSource>(
    IEnumerable<TSource> source)
{
    HashSet<TSource> seen = new HashSet<TSource>();
    try
    {
        foreach (TSource item in source)
        {
            if (seen.Add(item))
            {
                yield return item;
            }
        }
    }
    finally
    {
        seen = null;
    }
}

This is horrible code. If I hadn't come across this problem, and someone presented me with the above code in a code review, I would by sighing and shaking my head. Setting variables to null at the end of the method for the sake of garbage collection normally shows a hideous misunderstanding of how variables and the GC work.

In this case, it fixes the problem... but at the cost of the ability to look our fellow developers in the eye.

Solution #3: Side-step the compiler optimization (implementation code)

The problem we're facing stems from the compiler optimization used when an iterator block returns a sequence type. What if it only returns an iterator? Well, we need something to implement IEnumerable<T>, so how about we create a type which just delegates the responsibility:

public sealed class DelegatingSequence<T> : IEnumerable<T>
{
    private readonly Func<IEnumerator<T>> func;
    
    public DelegatingSequence(Func<IEnumerator<T>> func)
    {
        this.func = func;
    }
    
    public IEnumerator<T> GetEnumerator()
    {
        return func();
    }
    
    IEnumerator IEnumerable.GetEnumerator()
    {
        return GetEnumerator();
    }
}

Now we know that the sequence will be separate from the iterator. Of course we could be given a delegate which does the wrong thing, but it's easy to use correctly. Here's the corresponding code for Distinct:

public static IEnumerable<TSource> Distinct<TSource>(
    this IEnumerable<TSource> source)
{
    // Argument validation elided
    return new DelegatingSequence<TSource>(() => DistinctImpl(source));
}
    
private static IEnumerator<TSource> DistinctImpl<TSource>(
    IEnumerable<TSource> source)
{
    // Original code without try/finally
}

This now solves the "bulky iterator problem", and it's a pretty easy fix to propagate wherever it's necessary... but again, only if you're aware of the issue to begin with.

Solution #4: Make the optimization optional (compiler code)

There are already a few attributes which change the generated IL: we could add another one to suppress this optimization, and apply it to any iterator blocks we want to behave differently:

[AliasSequenceAndIterator(false)]
private static IEnumerable<TSource> DistinctImpl<TSource>(
    IEnumerable<TSource> source)

I very much doubt that this will make it into the C# compiler any time soon - and frankly I'm not too happy with it anyway, especially as it still requires anyone using an iterator block to be aware of the issue.

Solution #5: Make the compiler perform solution #2 automatically (compiler code)

The compiler knows what "local variables" we're using. It's already implementing a Dispose method for various reasons, including setting the iterator state appropriately. Why can't it just set all the fields associated with local variables from the source code to their default values as part of disposal?

There's one reason I can think of, although it would come up rarely: the variables might be captured by anonymous functions within the iterator block. The delegates created from those anonymous functions may still be alive somewhere. Fortunately, the compiler is presumably aware of which variables have been captured. It could simply have a policy of turning off the optimization completely for iterator blocks which capture variables in anonymous functions, and clearing the appropriate variables for all the other situations.

At the same time, the generated code could be changed to clear the field which underlies the Current property - at the moment, if you ask an auto-generated iterator for Current after you've finished iterating over it, the result will be the last value yielded. This could be another subtle pseudo-leak, unlikely as it is. (Aside from anything else, it's just not neat at the moment.)

Solution #6: Just disable the optimization completely (compiler code)

I'd love to see the data on which Microsoft based the decision to optimize the generated code this way. I have confidence that there is such data somewhere - although I wonder whether it has been updated since the advent of LINQ, which I suspect is one of the heaviest users of autogenerated iterators. It's quite possible that this is still a "win" in most cases - just like the dodgy-sounding mutable struct iterator returned by the public List<T>.GetEnumerator is a win in very many cases, and only causes issues if you're being awkward. I think it's worth a re-evaluation though :)

If the optimization were removed, there would be two options:

  • Stick with a single generated type impementing both the sequence and iterator interfaces, but always create a new instance for iterators
  • Create one sequence type and one iterator type

Personally the second sounds cleaner to me - and it means that each instance would only need the fields relevant to it. But cleanliness isn't terribly important for generated code, so long as it doesn't have significant issues, so I wouldn't be too upset if the team took the first approach. I expect that in either case it would simplify the code - there'd be no need to remember the thread a sequence was created on, and so on.

Conclusion

The very fact that I've never heard of this problem before suggests that it's not very serious - but it feels serious. While it's possibly rare to hold onto a query for a long time, it ought to be reasonable to do so without incurring a memory penalty related to the size of the query the first time it was executed. This doesn't affect all query operators, of course - and I haven't investigated which ones it does affect. The ones I'd look at are those which have to buffer their data for some reason - including ones which operate on two sequences, buffering one and streaming the other.

Out of all the possible solutions, I would personally favour 5 or 6. This really ought to be a compiler fix - it's unreasonable to suggest that all developers should become aware of this idiosyncrasy and work round it. The part of me which longs for simplicity prefers solution 6; in reality I suspect that 5 is more likely. It's quite possible that the C# team will decide that it's a sufficiently harmless situation that they won't fix it at all - and I'm not going to claim to know their product and priorities better than they do. (Obviously I'm making them aware of my findings, but not in a fix-this-or-else way.)

Whatever happens, it's been a real blast investigating this... I never get tired of learning more about the details of source transformations and their (sometimes inadvertent) side-effects.

Posted by skeet | 14 comment(s)
Filed under:

Reimplementing LINQ to Objects: Part 36 - AsEnumerable

Our last operator is the simplest of all. Really, really simple.

What is it?

AsEnumerable has a single signature:

public static IEnumerable<TSource> AsEnumerable<TSource>(this IEnumerable<TSource> source)

I can describe its behaviour pretty easily: it returns source.

That's all it does. There's no argument validation, it doesn't create another iterator. It just returns source.

You may well be wondering what the point is... and it's all about changing the compile-time type of the expression. I'm going to take about IQueryable<T> in another post (although probably not implement anything related to it) but hopefully you're aware that it's usually used for "out of process" queries - most commonly in databases.

Now it's not entirely uncommon to want to perform some aspects of the query in the database, and then a bit more manipulation in .NET - particularly if there are aspects you basically can't implement in LINQ to SQL (or whatever provider you're using). For example, you may want to build a particular in-memory representation which isn't really amenable to the provider's model.

In that case, a query can look something like this:

var query = db.Context
              .Customers
              .Where(c => some filter for SQL)
              .OrderBy(c => some ordering for SQL)
              .Select(c => some projection for SQL)
              .AsEnumerable() // Switch to "in-process" for rest of query
              .Where(c => some extra LINQ to Objects filtering)
              .Select(c => some extra LINQ to Objects projection);

All we're doing is changing the compile-time type of the sequence which is propagating through our query from IQueryable<T> to IEnumerable<T> - but that means that the compiler will use the methods in Enumerable (taking delegates, and executing in LINQ to Objects) instead of the ones in Queryable (taking expression trees, and usually executing out-of-process).

Sometimes we could do this with a simple cast or variable declaration. However, for one thing that's ugly, whereas the above query is fluent and quite readable, so long as you appreciate the importance of AsEnumerable. The more important point is that it's not always possible, because we may very well be dealing with a sequence of an anonymous type. An extension method lets the compiler use type inference to work out what the T should be for IEnumerable<T>, but you can't actually express that in your code.

In short - it's not nearly as useless an operator as it seems at first sight. That doesn't make it any more complicated to test or implement though...

What are we going to test?

In the spirit of exhaustive testing, I have actually tested:

  • A normal sequence
  • A null reference
  • A sequence which would throw an exception if you actually tried to use it

The tests just assert that the result is the same reference as we've passed in.

I have one additional test which comes as close as I can to demonstrating the point of AsEnumerable without using Queryable:

[Test]
public void AnonymousType()
{
    var list = new[] { 
        new { FirstName = "Jon", Surname = "Skeet" },
        new { FirstName = "Holly", Surname = "Skeet" }
    }.ToList();

    // We can't cast to IEnumerable<T> as we can't express T.
    var sequence = list.AsEnumerable();
    // This will now use Enumerable.Contains instead of List.Contains
    Assert.IsFalse(sequence.Contains(new { FirstName = "Tom", Surname = "Skeet" }));
}

And finally...

Let's implement it!

There's not much scope for an interesting implementation here I'm afraid. Here it is, in its totality:

public static IEnumerable<TSource> AsEnumerable<TSource>(this IEnumerable<TSource> source)
{
    return source;
}

It feels like a fittingly simple end to the Edulinq implementation.

Conclusion

I think that's all I'm going to actually implement from LINQ to Objects. Unless I've missed something, that covers all the methods of Enumerable from .NET 4.

That's not the end of this series though. I'm going to take a few days to write up some thoughts about design choices, optimizations, other operators which might have been worth including, and a little bit about how IQueryable<T> works.

Don't forget that the source code is freely available on Google Code. I'll be happy to patch any embarrassing bugs :)

Posted by skeet | 2 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 35 - Zip

Zip will be a familiar operator to any readers who use Python. It was introduced in .NET 4 - it's not entirely clear why it wasn't part of the first release of LINQ, to be honest. Perhaps no-one thought of it as a useful operator until it was too late in the release cycle, or perhaps implementing it in the other providers (e.g. LINQ to SQL) took too long. Eric Lippert blogged about it in 2009, and I find it interesting to note that aside from braces, layout and names we've got exactly the same code. (I read the post at the time of course, but implemented it tonight without looking back at what Eric had done.) It's not exactly surprising, given how trivial the implementation is. Anyway, enough chit-chat...

What is it?

Zip has a single signature, which isn't terribly complicated:

public static IEnumerable<TResult> Zip<TFirst, TSecond, TResult>(
    this IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)

Just from the signature, the name, and experience from the rest of this blog series it should be easy enough to guess what Zip does:

  • It uses deferred execution, not reading from either sequence until the result sequence is read
  • All three parameters must be non-null; this is validated eagerly
  • Both sequences are iterated over "at the same time": it calls GetEnumerator() on each sequence, then moves each iterator forward, then reads from it, and repeats.
  • The result selector is applied to each pair of items obtained in this way, and the result yielded
  • It stops when either sequence terminates
  • As a natural consequence of how the sequences are read, we don't need to perform any buffering: we only care about one element from each sequence at a time.

There are really only two things that I could see might have been designed differently:

  • It could have just returned IEnumerable<Tuple<TFirst, TSecond>> but that would have been less efficient in many cases (in terms of the GC) and inconsistent with the rest of LINQ
  • It could have provided different options for what to do with sequences of differents lengths. For example:
    • Throw an exception
    • Use the default value of the shorter sequence type against the remaining items of the longer sequence
    • Use a specified default value of the shorter sequence in the same way

I don't have any problem with the design that's been chosen here though.

What are we going to test?

There are no really interesting test cases here. We test argument validation, deferred execution, and the obvious "normal" cases. I do have tests where "first" is longer than "second" and vice versa.

The one test case which is noteworthy isn't really present for the sake of testing at all - it's to demonstrate a technique which can occasionally be handy. Sometimes we really want to perform a projection on adjacent pairs of elements. Unfortunately there's no LINQ operator to do this naturally (although it's easy to write one) but Zip can provide a workaround, so long as we don't mind evaluating the sequence twice. (That could be a problem in some cases, but is fine in others.)

Obviously if you just zip a sequence with itself directly you get each element paired with the same one. We effectively need to "shift" or "delay" one sequence somehow. We can do this using Skip, as shown in this test:

[Test]
public void AdjacentElements()
{
    string[] elements = { "a", "b", "c", "d", "e" };
    var query = elements.Zip(elements.Skip(1), (x, y) => x + y);
    query.AssertSequenceEqual("ab", "bc", "cd", "de");
}

It always takes me a little while to work out whether I want to make first skip or second - but if we want the second element as the first element of second (try getting that right ten times in a row - it makes sense, honest!) means that we want to call Skip on the sequence used as the argument for second. Obviously it would work the other way round too - we'd just get the pairs presented with the values switched, so the results of the query above would be "ba", "cb" etc.

Let's implement it!

Guess what? It's yet another operator with a split implementation between the argument validation and the "real work". I'll skip argument validation, and get into the tricky stuff. Are you ready? Sure you don't want another coffee?

private static IEnumerable<TResult> ZipImpl<TFirst, TSecond, TResult>(
    IEnumerable<TFirst> first,
    IEnumerable<TSecond> second,
    Func<TFirst, TSecond, TResult> resultSelector)
{
    using (IEnumerator<TFirst> iterator1 = first.GetEnumerator())
    using (IEnumerator<TSecond> iterator2 = second.GetEnumerator())
    {
        while (iterator1.MoveNext() && iterator2.MoveNext())
        {
            yield return resultSelector(iterator1.Current, iterator2.Current);
        }
    }
}

Okay, so possibly "tricky stuff" was a bit of an overstatement. Just about the only things to note are:

  • I've "stacked" the using statements instead of putting the inner one in braces and indenting it. For using statements with different variable types, this is one way to keep things readable, although it can be a pain when tools try to reformat the code. (Also, I don't usually omit optional braces like this. It does make me feel a bit dirty.)
  • I've used the "symmetric" approach again instead of a using statement with a foreach loop inside it. That wouldn't be hard to do, but it wouldn't be as simple.

That's just about it. The code does exactly what it looks like, which doesn't make for a very interesting blog post, but does make for good readability.

Conclusion

Two operators to go, one of which I might not even tackle fully (AsQueryable - it is part of Queryable rather than Enumerable, after all).

AsEnumerable should be pretty easy...

Posted by skeet | 18 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 34 - SequenceEqual

Nearly there now...

What is it?

SequenceEqual has two overloads - the obvious two given that we're dealing with equality:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second)

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)

The purpose of the operator is to determine if two sequences are equal; that is, if they consist of the same elements, in the same order. A custom equality comparer can be used to compare each individual pair of elements. Characteristics:

  • The first and second parameters mustn't be null, and are validated immediately.
  • The comparer parameter can be null, in which case the default equality comparer for TSource is used.
  • The overload without a comparer uses the default equality comparer for TSource (no funny discrepancies if ICollection<T> instances are involved, this time).
  • It uses immediate execution.
  • It returns as soon as it notices a difference, without evaluating the rest of either sequence.

So far, so good. Note that no optimizations are mentioned above. There are questions you might consider:

  • Should foo.SequenceEqual(foo) always return true?
  • If either or both of the sequences implements another collection interface, does that help us?

The first question sounds like it should be a no-brainer, but it's not as simple as it sounds. Suppose we have a sequence which always generates 10 random numbers. Is it equal to itself? If you iterate over it twice, you'll usually get different results. What about a sequence which explicitly changes each time you iterate over it, based on some side-effect? Both the .NET and Edulinq implementations say that these are non-equal. (The random case is interesting, of course - it could happen to yield the same elements as we iterate over the two sequences.)

The second question feels a little simpler to me. We can't take a shortcut to returning true, but it seems reasonably obvious to me that if you have two collections which allow for a fast "count" operation, and the two counts are different, then the sequences are unequal. Unfortunately, LINQ to Objects appears not to optimize for this case: if you create two huge arrays of differing sizes but equal elements as far as possible, it will take a long time for SequenceEqual to return false. Edulinq does perform this optimization. Note that just having one count isn't useful: you might expect it to, but it turns out that by the time we could realize that the lengths were different in that case, we're about to find that out in the "normal" fashion anyway, so there's no point in complicating the code to make use of the information.

What are we going to test?

As well as the obvious argument validation, I have tests for:

  • Collections of different lengths
  • Ranges of different lengths, both with first shorter than second and vice versa
  • Using a null comparer
  • Using a custom comparer
  • Using no comparer
  • Equal arrays
  • Equal ranges
  • The non-optimization of foo.SequenceEquals(foo) (using side-effects)
  • The optimization using Count (fails on LINQ to Objects)
  • Ordering: { 1, 2 } should not be equal to { 2, 1 }
  • The use of a HashSet<string> with a case-insensitive comparer: the default (case-sensitive) comparer is still used when no comparer is provided
  • Infinite first sequence with finite second sequence, and vice versa
  • Sequences which differ just before they would go bang

None of the test code is particularly interesting, to be honest.

Let's implement it!

I'm not going to show the comparer-less overoad, as it just delegates to the one with a comparer.

Before we get into the guts of SequenceEqual, it's time for a bit of refactoring. If we're going to optimize for count, we'll need to perform the same type tests as Count() twice. That would be horrifically ugly inline, so let's extract the functionality out into a private method (which Count() can then call as well):

private static bool TryFastCount<TSource>(
    IEnumerable<TSource> source,
    out int count)
{
    // Optimization for ICollection<T>
    ICollection<TSource> genericCollection = source as ICollection<TSource>;
    if (genericCollection != null)
    {
        count = genericCollection.Count;
        return true;
    }

    // Optimization for ICollection
    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        count = nonGenericCollection.Count;
        return true;
    }
    // Can't retrieve the count quickly. Oh well.
    count = 0;
    return false;
}

Pretty simple. Note that we always have to set the out parameter to some value. We use 0 on failure - which happens to work out nicely in the Count, as we can just start incrementing if TryFastCount has returned false.

Now we can make a start on SequenceEqual. Here's the skeleton before we start doing the real work:

public static bool SequenceEqual<TSource>(
    this IEnumerable<TSource> first,
    IEnumerable<TSource> second,
    IEqualityComparer<TSource> comparer)
{
    if (first == null)
    {
        throw new ArgumentNullException("first");
    }
    if (second == null)
    {
        throw new ArgumentNullException("second");
    }

    int count1;
    int count2;
    if (TryFastCount(first, out count1) && TryFastCount(second, out count2))
    {
        if (count1 != count2)
        {
            return false;
        }
    }

    comparer = comparer ?? EqualityComparer<TSource>.Default;

    // Main part of implementation goes here
}

I could have included the comparison between count1 and count2 within the single "if" condition, like this:

if (TryFastCount(first, out count1) && 
    TryFastCount(second, out count2) &&
    count1 != count2)
{
    return false;
}

... but I don't usually like using the values of out parameters like this. The behaviour is well-defined and correct, but it just feels a little ugly to me.

Okay, now let's implement the "Main part" which at the bottom of the skeleton. The idea is simple:

  • Get the iterators for both sequences
  • Use the iterators "in parallel" (not in the multithreading sense, but in the movement of the logical cursor down the sequence) to compare pairs of elements; we can return false if we ever see an unequal pair
  • If we ever see that one sequence has finished and the other hasn't, we can return false
  • If we get to the end of both sequences in the same iteration, we can return true

I've got three different ways of representing the basic algorithm in code though. Fundamentally, the problem is that we don't have a way of iterating over pairs of elements in two sequences with foreach - we can't use one foreach loop inside another, for hopefully obvious reasons. So we'll have to call GetEnumerator() explicitly on at least one of the sequences... and we could do it for both if we want.

The first implementation (and my least favourite) does use a foreach loop:

using (IEnumerator<TSource> iterator2 = second.GetEnumerator())
{
    foreach (TSource item1 in first)
    {
        // second is shorter than first
        if (!iterator2.MoveNext())
        {
            return false;
        }
        if (!comparer.Equals(item1, iterator2.Current))
        {
            return false;
        }
    }
    // If we can get to the next element, first was shorter than second.
    // Otherwise, the sequences are equal.
    return !iterator2.MoveNext();
}

I don't have a desperately good reason for picking them this way round (i.e. foreach over first, and GetEnumerator() on second) other than that it seems to still give primacy to first somehow... only first gets the "special treatment" of a foreach loop. (I can almost hear the chants now, "Equal rights for second sequences! Don't leave us out of the loop! Stop just 'using' us!") Although I'm being frivolous, I dislike the asymmetry of this.

The second attempt is a half-way house: it's still asymmetric, but slightly less so as we're explicitly fetching both iterators:

using (IEnumerator<TSource> iterator1 = first.GetEnumerator(),
                            iterator2 = second.GetEnumerator())
{
    while (iterator1.MoveNext())
    {
        // second is shorter than first
        if (!iterator2.MoveNext())
        {
            return false;
        }
        if (!comparer.Equals(iterator1.Current, iterator2.Current))
        {
            return false;
        }
    }
    // If we can get to the next element, first was shorter than second.
    // Otherwise, the sequences are equal.
    return !iterator2.MoveNext();
}

Note the use of the multi-variable "using" statement; this is equivalent nesting one statement inside another, of course.

The similarities between these two implementations are obvious - but the differences are worth pointing out. In the latter approach, we call MoveNext() on both sequences, and then we access the Current property on both sequences. In each case we use iterator1 before iterator2, but it still feels like they're being treated more equally somehow. There's still the fact that iterator1 is being used in the while loop condition, whereas iterator2 has to be used both inside and outside the while loop. Hmm.

The third implementation takes this even further, changing the condition of the while loop:

using (IEnumerator<TSource> iterator1 = first.GetEnumerator(),
       iterator2 = second.GetEnumerator())
{
    while (true)
    {
        bool next1 = iterator1.MoveNext();
        bool next2 = iterator2.MoveNext();
        // Sequences aren't of same length. We don't
        // care which way round.
        if (next1 != next2)
        {
            return false;
        }
        // Both sequences have finished - done
        if (!next1)
        {
            return true;
        }
        if (!comparer.Equals(iterator1.Current, iterator2.Current))
        {
            return false;
        }
    }

This feels about as symmetric as we can get. The use of next1 in the middle "if" condition is incidental - it could just as easily be next2, as we know the values are equal. We could switch round the order of the calls to MoveNext(), the order of arguments to comparer.Equals - the structure is symmetric.

I'm not generally a fan of while(true) loops, but in this case I think I rather like it. It makes it obvious that we're going to keep going until we've got a good reason to stop: one of the three return statements. (I suppose I should apologise to fans of the dogma around single exit points for methods, if any are reading. This must be hell for you...)

Arguably this is all a big fuss about nothing - but writing Edulinq has given me a new appreciation for diving into this level of detail to find the most readable code. As ever, I'd be interested to hear your views. (All three versions are in source control. Which one is active is defined with a #define.)

Conclusion

I really don't know why Microsoft didn't implement the optimization around different lengths for SequenceEqual. Arguably in the context of LINQ you're unlikely to be dealing with two materialized collections at a time - it's much more common to have one collection and a lazily-evaluated query, or possibly just two queries... but it's a cheap optimization and the benefits can be significant. Maybe it was just an oversight.

Our next operator also deals with pairs of elements, so we may be facing similar readability questions around it. It's Zip - the only new LINQ query operator in .NET 4.

Posted by skeet | 3 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 33 - Cast and OfType

More design decisions around optimization today, but possibly less controversial ones...

What are they?

Cast and OfType are somewhat unusual LINQ operators. They are extension methods, but they work on the non-generic IEnumerable type instead of the generic IEnumerable<T> type:

public static IEnumerable<TResult> Cast<TResult>(this IEnumerable source)
        
public static IEnumerable<TResult> OfType<TResult>(this IEnumerable source)

It's worth mentioning what Cast and OfType are used for to start with. There are two main purposes:

  • Using a non-generic collection (such as a DataTable or an ArrayList) within a LINQ query (DataTable has the AsEnumerable extension method too)
  • Changing the type of a generic collection, usually to use a more specific type (e.g. you have  List<Person> but you're confident they're all actually Employee instances - or you only want to query against the Employee instances)

I can't say that I use either operator terribly often, but if you're starting off from a non-generic collection for whatever reason, these two are your only easy way to get "into" the LINQ world.

Here's a quick rundown of the behaviour they have in common:

  • The source parameter must not be null, and this is validated eagerly
  • It uses deferred execution: the input sequence is not read until the output sequence is
  • It streams its data - you can use it on arbitrarily-long sequences and the extra memory required will be constant (and small :)

Both operators effectively try to convert each element of the input sequence to the result type (TResult). When they're successful, the results are equivalent (ignoring optimizations, which I'll come to later). The operators differ in how they handle elements which aren't of the result type.

Cast simply tries to cast each element to the result type. If the cast fails, it will throw an InvalidCastException in the normal way. OfType, however, sees whether each element is a value of the result type first - and ignores it if it's not.

There's one important case to consider where Cast will successfully return a value and OfType will ignore it: null references (with a nullable return type). In normal code, you can cast a null reference to any nullable type (whether that's a reference type or a nullable value type). However, if you use the "is" C# operator with a null value, it will always return false. Cast and OfType follow the same rules, basically.

It's worth noting that (as of .NET 3.5 SP1) Cast and OfType only perform reference and unboxing conversions. They won't convert a boxed int to a long, or execute user-defined conversions. Basically they follow the same rules as converting from object to a generic type parameter. (That's very convenient for the implementation!) In the original implementation of .NET 3.5, I believe some other conversions were supported (in particular, I believe that the boxed int to long conversion would have worked). I haven't even attempted to replicate the pre-SP1 behaviour. You can read more details in Ed Maurer's blog post from 2008.

There's one final aspect to discuss: optimization. If "source" already implements IEnumerable<TResult>, the Cast operator just returns the parameter directly, within the original method call. (In other words, this behaviour isn't deferred.) Basically we know that every cast will succeed, so there's no harm in returning the input sequence. This means you shouldn't use Cast as an "isolation" call to protect your original data source, in the same way as we sometimes use Select with an identity projection. See Eric Lippert's blog post on degenerate queries for more about protecting the original source of a query.

In the LINQ to Objects implementation, OfType never returns the source directly. It always uses an iterator. Most of the time, it's probably right to do so. Just because something implements IEnumerable<string> doesn't mean everything within it should be returned by OfType... because some elements may be null. The same is true of an IEnumerable<int?> - but not an IEnumerable<int>. For a non-nullable value type T, if source implements IEnumerable<T> then source.OfType<T>() will always contain the exact same sequence of elements as source. It does no more harm to return source from OfType() here than it does from Cast().

What are we going to test?

There are "obvious" tests for deferred execution and eager argument validation. Beyond that, I effectively have two types of test: ones which focus on whether the call returns the original argument, and ones which test the behaviour of iterating over the results (including whether or not an exception is thrown).

The iteration tests are generally not that interesting - in particular, they're similar to tests we've got everywhere else. The "identity" tests are more interesting, because they show some differences between conversions that are allowed by the CLR and those allowed by C#. It's obvious that an array of strings is going to be convertible to IEnumerable<string>, but a test like this might give you more pause for thought:

[Test]
public void OriginalSourceReturnedForInt32ArrayToUInt32SequenceConversion()
{
    IEnumerable enums = new int[10];
    Assert.AreSame(enums, enums.Cast<uint>());
}

That's trying to "cast" an int[] to an IEnumerable<uint>. If you try the same in normal C# code, it will fail - although if you cast it to "object" first (to distract the compiler, as it were) it's fine at both compile time and execution time:

int[] ints = new int[10];
// Fails with CS0030
IEnumerable<uint> uints = (IEnumerable<uint>) ints;
        
// Succeeds at execution time
IEnumerable<uint> uints = (IEnumerable<uint>)(object) ints;

We can have a bit more fun at the compiler's expense, and note its arrogance:

int[] ints = new int[10];
        
if (ints is IEnumerable<uint>)
{
    Console.WriteLine("This won't be printed");
}
if (((object) ints) is IEnumerable<uint>)
{
    Console.WriteLine("This will be printed");
}

This generates a warning for the first block "The given expression is never of the provided (...) type" and the compiler has the cheek to remove the block entirely... despite the fact that it would have worked if only it had been emitted as code.

Now, I'm not really trying to have a dig at the C# team here - the compiler is actually acting entirely reasonably within the rules of C#. It's just that the CLR has subtly different rules around conversions - so when the compiler makes a prediction about what would happen with a particular cast or "is" test, it can be wrong. I don't think this has ever bitten me as an issue, but it's quite fun to watch. As well as this signed/unsigned difference, there are similar conversions between arrays of enums and their underlying types.

There's another type of conversion which is interesting:

[Test]
public void OriginalSourceReturnedDueToGenericCovariance()
{
    IEnumerable strings = new List<string>();
    Assert.AreSame(strings, strings.Cast<object>());
}

This takes advantage of the generic variance introduced in .NET 4 - sort of. There is now a reference conversion from List<string> to IEnumerable<object> which wouldn't have worked in .NET 3.5. However, this isn't due to the fact that C# 4 now knows about variance; the compiler isn't verifying the conversion here, after all. It isn't due to a new feature in the CLRv4 - generic variance for interfaces and delegates has been present since generics were introduced in CLRv2. It's only due to the change in the IEnumerable<T> type, which has become IEnumerable<out T> in .NET 4. If you could make the same change to the standard library used in .NET 3.5, I believe the test above would pass. (It's possible that the precise CLR rules for variance changed between CLRv2 and CLRv4 - I don't think this variance was widely used before .NET 4, so the risk of it being a problematically-breaking change would have been slim.)

In addition to all these functional tests, I've included a couple of tests to show that the compiler uses Cast in query expressions if you give a range variable an explicit type. This works for both "from" and "join":

[Test]
public void CastWithFrom()
{
    IEnumerable strings = new[] { "first", "second", "third" };
    var query = from string x in strings
                select x;
    query.AssertSequenceEqual("first", "second", "third");
}

[Test]
public void CastWithJoin()
{
    var ints = Enumerable.Range(0, 10);
    IEnumerable strings = new[] { "first", "second", "third" };
    var query = from x in ints
                join string y in strings on x equals y.Length
                select x + ":" + y;
    query.AssertSequenceEqual("5:first", "5:third", "6:second");
}

Note how the compile-time type of "strings" is just IEnumerable in both cases. We couldn't use this in a query expression normally, because LINQ requires generic sequences - but by giving the range variables explicit types, the compiler has inserted a call to Cast which makes the rest of the translation work.

Let's implement them!

The "eager argument validation, deferred sequence reading" mode of Cast and OfType means we'll use the familiar approach of a non-iterator-block public method which finally calls an iterator block if it gets that far. This time, however, the optimization occurs within the public method. Here's Cast, to start with:

public static IEnumerable<TResult> Cast<TResult>(this IEnumerable source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    IEnumerable<TResult> existingSequence = source as IEnumerable<TResult>;
    if (existingSequence != null)
    {
        return existingSequence;
    }
    return CastImpl<TResult>(source);
}

private static IEnumerable<TResult> CastImpl<TResult>(IEnumerable source)
{
    foreach (object item in source)
    {
        yield return (TResult) item;
    }
}

We're using the normal as/null-test to check whether we can just return the source directly, and in the loop we're casting. We could have made the iterator block very slightly shorter here, using the behaviour of foreach to our advantage:

foreach (TResult item in source)
{
    yield return item;
}

Yikes! Where's the cast gone? How can this possibly work? Well, the cast is still there - it's just been inserted automatically by the compiler. It's the invisible cast that was present in almost every foreach loop in C# 1. The fact that it is invisible is the reason I've chosen the previous version. The point of the method is to cast each element - so it's pretty important to make the cast as obvious as possible.

So that's Cast. Now for OfType. First let's look at the public entry point:

public static IEnumerable<TResult> OfType<TResult>(this IEnumerable source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (default(TResult) != null)
    {
        IEnumerable<TResult> existingSequence = source as IEnumerable<TResult>;
        if (existingSequence != null)
        {
            return existingSequence;
        }
    }
    return OfTypeImpl<TResult>(source);
}

This is almost the same as Cast, but with the additional test of "default(TResult) != null" before we check whether the input sequence is an IEnumerable<TResult>. That's a simple way of saying, "Is this a non-nullablle value type." I don't know for sure, but I'd hope that when the JIT compiler looks at this method, it can completely wipe out the test, either removing the body of the if statement completely for nullable value types and reference types, or just go execute the body unconditionally for non-nullable value types. It really doesn't matter if JIT doesn't do this, but one day I may get up the courage to tackle this with cordbg and find out for sure... but not tonight.

Once we've decided we've got to iterate over the results ourselves, the iterator block method is quite simple:

private static IEnumerable<TResult> OfTypeImpl<TResult>(IEnumerable source)
{
    foreach (object item in source)
    {
        if (item is TResult)
        {
            yield return (TResult) item;
        }
    }
}

Note that we can't use the "as and check for null" test here, because we don't know that TResult is a nullable type. I was tempted to try to write two versions of this code - one for reference types and one for value types. (I've found before that using "as and check for null" is really slow for nullable value types. That may change, of course.) However, that would be quite tricky and I'm not convinced it would have much impact. I did a quick test yesterday testing whether an "object" was actually a "string", and the is+cast approach seemed just as good. I suspect that may be because string is a sealed class, however... testing for an interface or a non-sealed class may be more expensive. Either way, it would be premature to write a complicated optimization without testing first.

Conclusion

It's not clear to me why Microsoft optimizes Cast but not OfType. There's a possibility that I've missed a reason why OfType shouldn't be optimized even for a sequence of non-nullable value type values - if you can think of one, please point it out in the comments. My immediate objection would be that it "reveals" the source of the query... but as we've seen, Cast already does that sometimes, so I don't think that theory holds.

Other than that decision, the rest of the implementation of these operators has been pretty plain sailing. It did give us a quick glimpse into the difference between the conversions that the CLR allows and the ones that the C# specification allows though, and that's always fun.

Next up - SequenceEqual.

Posted by skeet | 1 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 32 - Contains

After the dubious optimizations of ElementAt/ElementAtOrDefault yesterday, we meet an operator which is remarkably good at defying optimization. Sort of. Depending on how you feel it should behave.

What is it?

Contains has two overloads, which only differ by whether or not they take an equality comparer - just like Distinct, Intersect and the like:

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value)

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value,
    IEqualityComparer<TSource> comparer)

The operator simply returns a Boolean indicating whether or not "value" was found in "source". The salient points of its behaviour should be predictable now:

  • It uses immediate execution (as it's returning a simple value instead of a sequence)
  • The source parameter cannot be null, and is validated immediately
  • The value parameter can be null: it's valid to search for a null value within a sequence
  • The comparer parameter can be null, which is equivalent to passing in EquailtyComparer<TSource>.Default.
  • The overload without a comparer uses the default equality comparer too.
  • If a match is found, the method returns immediately without reading the rest of the input sequence.
  • There's a documented optimization for ICollection<T> - but there are significant issues with it...

So far, so good.

What are we going to test?

Aside from argument validation, I have tests for the value being present in the source, and it not being present in the source... for the three options of "no comparer", "null comparer" and "specific comparer".

I then have one final test to validate that we return as soon as we've found a match, by giving a query which will blow up when the element after the match is computed.

Frankly none of the tests are earth-shattering, but in the spirit of giving you an idea of what they're like, here's one with a custom comparer - we use the same source and value for a "default comparer" test which doesn't find the value as the case differs:

[Test]
public void MatchWithCustomComparer()
{
    // Default equality comparer is ordinal
    string[] source = { "foo", "bar", "baz" };
    Assert.IsTrue(source.Contains("BAR", StringComparer.OrdinalIgnoreCase));
}

Currently I don't have a test for the optimization mentioned in the bullet points above, as I believe it's broken. More later.

Let's implement it!

To start with, let's dispense with the overload without a comparer parameter: that just delegates to the other one by specifying EqualityComparer<TSource>.Default. Trivial. (Or so we might think. There's more to this than meets the eye.)

I've got three implementations, but we'll start with just two of them. Which one you pick would depend on whether you're happy to use one operator to implement another. If you think that's okay, it's really simple:

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value,
    IEqualityComparer<TSource> comparer)
{
    comparer = comparer ?? EqualityComparer<TSource>.Default;
    return source.Any(item => comparer.Equals(value, item));
}

"Any" has exactly the traits we want, including validation of the non-nullity of "source". It's hardly complicated code if we don't use Any though:

public static bool Contains<TSource>(
    this IEnumerable<TSource> source,
    TSource value,
    IEqualityComparer<TSource> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    comparer = comparer ?? EqualityComparer<TSource>.Default;
    foreach (TSource item in source)
    {
        if (comparer.Equals(value, item))
        {
            return true;
        }
    }
    return false;
}

Obviously there's a slight penalty in using Any just because of executing a delegate on each iteration - and the extra memory requirement of building an object to capture the comparer. I haven't measured the performance impact of this - again, it's a candidate for benchmarking.

Can't we optimize? (And why does LINQ to Objects think it can?)

The implementations above are all very well, but they feel ever so simplistic. With ElementAt, we were able to take advantage of the fact that an IList<T> allows us random access by index. Surely we've got similar collections which allow us to test for containment cheaply?

Well, yes and no. We've got IDictionary<TKey, TValue> which allows you to check for the presence of a particular key - but even it would be hard to test whether the sequence we're looking at is the key sequence for some IDictionary<TSource, TValue>, and somehow get back to the dictionary.

ICollection<T> has a Contains method, but that doesn't necessarily do the right thing. This is particularly troubling, as the MSDN documentation for the comparer-less overload has contradictory information:

(Summary)

Determines whether a sequence contains a specified element by using the default equality comparer.

(Remarks)

If the type of source implements ICollection<T>, the Contains method in that implementation is invoked to obtain the result. Otherwise, this method determines whether source contains the specified element.

Enumeration is terminated as soon as a matching element is found.

Elements are compared to the specified value by using the default equality comparer, Default.

Why is this troubling? Well, let's look at a test:

[Test]
public void SetWithDifferentComparer()
{
    HashSet<string> sourceAsSet = new HashSet<string>(StringComparer.OrdinalIgnoreCase)
        { "foo", "bar", "baz" };
    IEnumerable<string> sourceAsSequence = sourceAsSet;
    Assert.IsTrue(sourceAsSet.Contains("BAR"));
    Assert.IsFalse(sourceAsSequence.Contains("BAR"));
    Assert.IsFalse(sourceAsSequence.Contains("BAR", StringComparer.Ordinal));
}

(This exact code won't build in the Edulinq project configuration, as that doesn't have a reference to the System.Core assembly which contains HashSet<T>. I've got a hack which allows me to run effectively this code though. See the source for details.)

Now this test looks correct to me: while we're regarding the set as a set, it should use the set's comparer and find "BAR" with a case-insensitive match. However, when we use it as a sequence in LINQ, it should obey the rules of Enumerable.Contains - which means that the middle call should use the default equality comparer for string. Under that equality comparer, "BAR" isn't present.

It doesn't: the above test fails on that middle call in LINQ to Objects, because HashSet<T> implements ICollection<T>. To fit in with the implementation, the documentation summary should actually be worded as something like:

"Determines whether a sequence contains a specified element by using the default equality comparer if the sequence doesn't implement ICollection<T>, or whatever equality comparison the collection uses if it does implement ICollection<T>."

Now you may be saying to yourself that this is only like relying on IList<T> to fetch an item by index in a fashion consistent with iterating over with it - but I'd argue that any IList<T> implementation which didn't do that was simply broken... whereas ICollection<T>.Contains is specifically documented to allow custom comparisons:

Implementations can vary in how they determine equality of objects; for example, List<T> uses Comparer<T>.Default, whereas Dictionary<TKey, TValue> allows the user to specify the IComparer<T> implementation to use for comparing keys.

Let's leave aside the fact that those "Comparer<T>" and "IComparer<T>" should be "EqualityComparer<T>" and "IEqualityComparer<T>" respectively for the minute, and just note that it's entirely reasonable for an implementation not to use the default equality comparer. That makes sense - but I believe it also makes sense for source.Contains(value) to be more predictable in terms of the equality comparer it uses.

Now I would certainly agree that having a method call which changes semantics based on whether the compile-time type of the source is IEnumerable<T> or ICollection<T> is undesirable too... but I'm not sure there is any particularly nice solution. The options are:

  • The current LINQ to Objects implementation where the comparer used is hard to predict.
  • The Edulinq implementation where the type's default comparer is always used... if the compile-time type means that Enumerable.Contains is used in the first place.
  • Remove the comparer-less overload entirely, and force people to specify one. This is lousy for convenience.

Note that you might expect the overload which takes a comparer to work the same way if you pass in null as the comparer - but it doesn't. That overload never delegates to ICollection<T>.Contains.

So: convenience, predictability, consistency. Pick any two. Isn't API design fun? This isn't even thinking about performance, of course...

It's worth bearing in mind that even the current behaviour which is presumably meant to encourage consistency doesn't work. One might expect that the following would always be equivalent for any sensible collection:

var first = source.Contains(value);
var second = source.Select(x => x).Contains(value);

... but of course the second line will always use EqualityComparer<T>.Default whereas the first may or may not.

(Just for fun, think about Dictionary<TKey, TValue> which implements ICollection<KeyValuePair<TKey, TValue>>; its explicitly-implemented ICollection<T>.Contains method will use its own equality comparer for the key, but the default equality comparer for the value part of the pair. Yay!)

So can we really not optimize?

I can think of exactly one situation which we could legitimately optimize without making the behaviour hard to predict. Basically we're fine to ask the collection to do our work for us if we can guarantee it will use the right comparer. Ironically, List<T>.Contains has an overload which allows us to specify the equality comparer, so we could delegate to that - but it's not going to be significantly faster than doing it ourselves. It's still got to look through everything.

ISet<T> in .NET 4 doesn't help us much - its API doesn't talk about equality comparers. (This makes a certain amount of sense - consider SortedSet<T> which uses IComparer<T> instead of IEqualityComparer<T>. It wouldn't make sense to ask a SortedSet<T> for an equality comparer - it couldn't give you one, as it wouldn't know how to produce a hash code.)

However, HashSet<T> does give us something to work with. You can ask a HashSet<T> which equality comparer it uses, so we could delegate to its implementation if and only if it would use the one we're interested in. We can bolt that into our existing implementation pretty easily, after we've worked out the comparer to use:

HashSet<TSource> hashSet = source as HashSet<TSource>;
if (hashSet != null && comparer.Equals(hashSet.Comparer))
{
    return hashSet.Contains(value);
}

So is this worth including or not?

Pros:

  • It covers one of the biggest use cases for optimizing Contains; I suspect this is used more often than the LINQ implementation of Contains working over a dictionary.
  • So long as the comparer doesn't override Equals in a bizarre way, it should be a true optimization with no difference in behaviour.
  • The optimization is applied for both overloads of Enumerable.Contains, not just the comparer-less one.

Cons:

  • It's specific to HashSet<T> rather than an interface type. That makes it feel a little too specific to be a good target of optimization.
  • We've still got the issue of consistency in terms of sourceAsSet.Contains(value) vs sourceAsSequence.Contains(value)
  • There's a tiny bit of overhead if the source isn't a hash set, and a further overhead if it is a hash set but with the wrong comparer. I'm not too bothered about this.

It's not the default implementation in Edulinq at the moment, but I could possibly be persuaded to include it. Likewise I have a conditionally-compiled version of Contains which is compatible with LINQ to Objects, with the "broken" optimization for the comparer-less overload; this is turned off by default too.

Conclusion

Gosh! I hadn't expected Contains to be nearly this interesting. I'd worked out that optimization would be a pain, but I hadn't expected it to be such a weird design choice.

This is the first time I've deliberately gone against the LINQ to Objects behaviour, other than the MS bug around descending orderings using "extreme" comparers. The option for compatibility is there, but I feel fairly strongly that this was a bad design decision on Microsoft's part. A bad decision out of some fairly unpleasant alternatives, I grant you. I'm willing to be persuaded of its virtues, of course - and in particular I'd welcome discussion with the LINQ team around this. In particular, it's always fun to hear about the history of design decisions.

Next up, Cast and OfType.

Posted by skeet | 7 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

A nice easy pair of operators tonight. I should possibly have covered them at the same time as First/Last/Single and the OrDefault variants, but never mind...

What are they?

ElementAt and ElementAtOrDefault have a single overload each:

public static TSource ElementAt<TSource>(
    this IEnumerable<TSource> source,
    int index)

public static TSource ElementAtOrDefault<TSource>(
    this IEnumerable<TSource> source,
    int index)

Isn't that blissfully simple after the overload storm of the past few days?

The two operators work in very similar ways:

  • They use immediate execution.
  • The source parameter must not be null, and this is validated immediately.
  • They return the element at the specified zero-based index, if it's in the range 0 <= index < count.

The methods only differ in their handling of an index which falls outside the given bound. ElementAt will throw an ArgumentOutOfRangeException; ElementAtOrDefault will return the default value for TSource (e.g. 0, null, false). This is true even if index is negative. You might have expected some way to specify the default value to return if the index is out of bounds, but there isn't one. (This is consistent with FirstOrDefault() and so on, but not with Nullable<T>.GetValueOrDefault())

This behaviour leaves us some room for common code - for once I haven't used cut and paste for the implementation. Anyway, I'm getting ahead of myself.

What are we going to test?

As you can imagine, my tests for the two operators are identical except for the expected result in the case of the index being out of range. I've tested the following cases:

  • Null source
  • A negative index
  • An index which is too big on a NonEnumerableCollection
  • An index which is too big on a NonEnumerableList
  • An index which is too big on a lazy sequence (using Enumerable.Range)
  • A valid index in a NonEnumerableList
  • A valid index in a lazy sequence

The "non-enumerable" list and collection are to test that the optimizations we're going to perform are working. In fact, the NonEnumerableCollection test fails on LINQ to Objects - it's only optimized for IList<T>. You'll see what I mean in a minute... and why that might not be a bad thing.

None of the tests are very interesting, to be honest.

Let's implement it!

As I mentioned earlier, I've used some common code for once (although I admit the first implementation used cut and paste). As the only difference between the two methods is the handling of a particular kind of failure, I've used the TryXXX pattern which exists elsewhere in the framework. There's a common method which tries to retrieve the right element as an out parameter, and indicates whether or not it succeeded via the return value. Not every kind of failure is just returned, of course - we want to throw an ArgumentNullException if source is null in either case.

That leaves our public methods looking quite straightforward:

public static TSource ElementAt<TSource>(
    this IEnumerable<TSource> source,
    int index)
{
    TSource ret;
    if (!TryElementAt(source, index, out ret))
    {
        throw new ArgumentOutOfRangeException("index");
    }
    return ret;
}

public static TSource ElementAtOrDefault<TSource>(
    this IEnumerable<TSource> source,
    int index)
{
    TSource ret;
    // We don't care about the return value - ret will be default(TSource) if it's false
    TryElementAt(source, index, out ret);
    return ret;
}

TryElementAt will only return false if the index is out of bounds, so the exception is always appropriate. However, there is a disadvantage to this approach: we can't easily indicate in the exception message whether index was too large or negative. We could have specified a message which included the value of index itself, of course. I think it's a minor matter either way, to be honest.

The main body of the code is in TryElementAt, obviously. It would actually be very simple - just looping and counting up to index, checking as we went - except there are two potential optimizations.

The most obvious - and most profitable - optimization is if the collection implements IList<T>. If it does, we can efficiently obtain the count using the ICollection<T>.Count property (don't forget that IList<T> extends ICollection<T>), check that it's not too big, and then use the indexer from IList<T> to get straight to the right element. Brilliant! That's a clear win.

The less clear optimization is if the collection implements ICollection<T> but not IList<T>, or if it only implements the nongeneric ICollection. In those cases we can still get at the count - but we can't then get directly to the right element. In other words, we can optimize the failure case (possibly hugely), but at a very slight cost - the cost of checking whether the sequence implements either interface - for the success case, where the check won't do us any good.

This is the sort of optimization which is impossible to judge without real data. How often are these operators called with an invalid index? How often does that happen on a collection which implements ICollection<T> but not IList<T> (or implements ICollection)? How large are those collections (so how long would it take to have found our error the normal way)? What's the cost of performing the type check? I don't have the answers to any of these questions. I don't even have strong suspicions. I know that Microsoft doesn't use the same optimization, but I don't know whether that was due to hard data or a gut feeling.

For the moment, I've kept all the optimizations. Here's the code:

private static bool TryElementAt<TSource>(
    IEnumerable<TSource> source,
    int index,
    out TSource element)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    element = default(TSource);
    if (index < 0)
    {
        return false;
    }
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        int count = collection.Count;
        if (index >= count)
        {
            return false;
        }
        // If it's a list, we know we're okay now - just return directly...
        IList<TSource> list = source as IList<TSource>;
        if (list != null)
        {
            element = list[index];
            return true;
        }
    }

    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        int count = nonGenericCollection.Count;
        if (index >= count)
        {
            return false;
        }
    }
    // We don't need to fetch the current value each time - get to the right
    // place first.
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        // Note use of -1 so that we start off my moving onto element 0.
        // Don't want to use i <= index in case index == int.MaxValue!
        for (int i = -1; i < index; i++)
        {
            if (!iterator.MoveNext())
            {
                return false;
            }
        }
        element = iterator.Current;
        return true;
    }
}

As you can see, the optimized cases actually form the bulk of the code - part of me thinks it would be worth removing the non-IList<T> optimizations just for clarity and brevity.

It's worth looking at the "slow" case where we actually iterate. The for loop looks odd, until you think that to get at element 0, you have to call MoveNext() once. We don't want to just add one to index or use a less-than-or-equal condition: both of those would fail in the case where index is int.MaxValue; we'd either not loop at all (by incrementing index and it overflowing either causing an exception or becoming negative) or we'd loop forever, as every int is less than or equal to int.MaxValue.

Another way to look at it is that the loop counter ("i") is the "current index" within the iterator: the iterator starts before the first element, so it's reasonable to start at -1.

The reason I'm drawing attention to this is that I got all of this wrong first time... and was very grateful for unit tests to catch me out.

Conclusion

For me, the most interesting part of ElementAt is the decision about optimization. I'm sure I'm not the only one who optimizes without data at times - but it's a dangerous thing to do. The problem is that this isn't the normal micro-optimization quandary of "it's always a tiny bit better, but it's probably insignificant and makes the code harder to read". For the cases where this is faster, it could make an enormous difference - asking for element one million of a linked list which doesn't quite have enough elements could be very painful. But do failure cases need to be fast? How common are they? As you can tell, I'm dithering. I think it's at least worth thinking about what optimizations might make a difference - even if we later remove them.

Next time, I think I'll tackle Contains - an operator which you might expect to be really fast on a HashSet<T>, but which has some interesting problems of its own...

Posted by skeet | 11 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 30 - Average

This is the final aggregation operator, after which I suspect we won't need to worry about floating point difficulties any more. Between this and the unexpected behaviour of Comparer<string>.Default, I've covered two of my "big three" pain points. It's hard to see how I could get dates and times into Edulinq naturally; it's even harder to see how time zones could cause problems. I've still got a few operators to go though, so you never know...

What is it?

Average has 20 overloads, all like the following but for long, decimal, float and double as well as int:

public static double Average(this IEnumerable<int> source)

public static double Average<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int> selector)

public static double? Average(this IEnumerable<int?> source)

public static double? Average<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int?> selector)

The operators acting on float sequences return float, and likewise the operators acting on decimal sequences return decimal, with the same equivalent nullable types for the nullable sequences.

As before (for Min/Max/Sum), the overloads which take a selector are equivalent to just applying that selector to each element in the sequence.

General behaviour - pretty much as you'd expect, I suspect:

  • Each operator calculates the arithmetic mean of a sequence of values.
  • source and selector can't be null, and are validated immediately.
  • The operators all use immediate execution.
  • The operators all iterate over the entire input sequence, unless an exception is thrown (e.g. due to overflow).
  • The operators with a non-nullable return type throw InvalidOperationException if the input sequence is empty.
  • The operators with a nullable return type ignore any null input values, and return null if the input sequence is empty or contains only null values. If non-null values are present, the return value will be non-null.

It all sounds pretty simple, doesn't it? We just sum the numbers, and divide by the count. It's not too complicated, but we have a couple of things to consider:

  • How should we count items - which data type should we use? Do we need to cope with more than int.MaxValue elements?
  • How should we sum items? Should we be able to find the average of { int.MaxValue, int.MaxValue } even though the sum clearly overflows the bounds of int?

Given the behaviour of my tests, I believe I've made the same decisions. I use a long for the counter, always. I use a long total for the int/long overloads, a double total for the float/double overloads, and a decimal total for the decimal overloads. These aren't particularly tricky decisions once you'd realised that you need to make them, but it would be very easy to implement the operators in a simplistic way without thinking about such things. (I'd probably have done so if it weren't for the comments around Sum this morning.)

What are we going to test?

I've only got in-depth tests for the int overloads, covering:

  • Argument validation
  • Empty sequences for nullable and non-nullable types
  • Sequences with only null values
  • Sequences of perfectly normal values :)
  • Projections for all the above
  • A sequence with just over int.MaxValue elements, to test we can count properly

Then I have a few extra tests for interesting situations. First I check the overflow behaviour of each type, using a common pattern of averaging a sequence of (max, max, -max, -max) where "max" is the maximum value for the sequence type. The results are:

  • For int we get the correct result of 0 because we're accumulating over longs
  • For long we get an OverflowException when it tries to add the first two values together
  • For float we get the correct result of 0 because we're accumulating over doubles
  • For double we get PositiveInfinity because that's the result of the first addition
  • For decimal we get an OverflowException when it tries to add the first two values together

Additionally, I have a couple of floating-point-specific tests: namely further proof that we use a double accumulator when averaging floats, and the behaviour of Average in the presence of NaN values:

[Test]
public void SingleUsesDoubleAccumulator()
{
    // All the values in the array are exactly representable as floats,
    // as is the correct average... but intermediate totals aren't.
    float[] array = { 20000000f, 1f, 1f, 2f };
    Assert.AreEqual(5000001f, array.Average());
}

[Test]
public void SequenceContainingNan()
{
    double[] array = { 1, 2, 3, double.NaN, 4, 5, 6 };
    Assert.IsNaN(array.Average());
}

I'm sure someone can think of some other interesting scenarios I should be considering :)

Let's implement it!

This is another cut-and-paste job, but with more editing required - for each method, I needed to make sure I was using the right accumulator type, and I occasionally removed redundant casts. Still, the code follows pretty much the same pattern for all types. Here's the int implementation:

public static double Average(this IEnumerable<int> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    checked
    {
        long count = 0;
        long total = 0;
        foreach (int item in source)
        {
            total += item;
            count++;
        }
        if (count == 0)
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        return (double)total / (double)count;
    }
}

public static double Average<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int> selector)
{
    return source.Select(selector).Average();
}

public static double? Average(this IEnumerable<int?> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    checked
    {
        long count = 0;
        long total = 0;
        foreach (int? item in source)
        {
            if (item != null)
            {
                count++;
                total += item.Value;
            }
        }
        return count == 0 ? (double?)null : (double)total / (double)count;
    }
}

public static double? Average<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int?> selector)
{
    return source.Select(selector).Average();
}

Salient points:

  • Again I'm using Select to make the implementation of the overloads with selectors trivial
  • I've cast both operands of the division when calculating the average, just for clarity. We could get away with either of them.
  • In the case of the conditional operator, I could actually just cast one of the division operators to "double?" and then remove both of the other casts... again, I feel this version is clearer. (I could change my mind tomorrow, mind you...)
  • I've explicitly used checked blocks for int and long. For float and double we won't get overflow anyway, and for decimal the checked/unchecked context is irrelevant.

There's one optimization we can perform here. Consider this loop, for the nullable sequence:

long count = 0;
long total = 0;
foreach (int? item in source)
{
    if (item != null)
    {
        count++;
        total += item.Value; // This line can be optimized...
    }
}

The line I've highlighted seems perfectly reasonable, right? We're trying to add the "real" non-null value within a value type value, and we know that there is a real value, because we've checked it's not the null value already.

Now think about what the Value property actually does... it checks whether or not it's the null value, and then returns the real value or throws an exception. But we know it won't throw an exception, because we've checked it. We just want to get at the value - don't bother with any more checks. That's exactly what GetValueOrDefault() does. In the case where the value is non-null, GetValueOrDefault() and the Value property obviously do the same thing - but intuition tells me that GetValueOrDefault() can do it quicker, because it doesn't actually need to check anything. It can just return the value of the underlying field - which will be the default value of the underlying type for a null wrapper value anyway.

I've benchmarked this, and on my laptop it's about 5% faster than using Value. But... it's such a grotty hack. I would feel dirty putting it in. Surely Value is the more readable code here - it just happens to be slower. As always, I'm undecided. There's no behavioural difference, just a slight speed boost. Thoughts, folks?

Conclusion

I'm quite pleased to be shot of the Aggregate Operators Of Overload Doom. I've felt for a while that they've been hanging over me - I knew they'd be annoying in terms of cut and paste, but there's been more interesting situations to consider than I'd expected.

There's not a lot left now. According to my previous list, I've got:

  • Cast and OfType
  • ElementAt and ElementAtOrDefault
  • SequenceEqual
  • Zip (from .NET 4)
  • Contains

However, that doesn't include AsEnumerable and AsQueryable. I'm unsure at the moment what I'm doing with those... AsEnumerable is trivial, and probably worth doing... AsQueryable could prove interesting in terms of testing, as it requires expression trees (which are in System.Core; a library I'm not referencing from tests when testing the Edulinq implementation). I'll play around and see what happens :)

Not sure what I'll implement next, to be honest... we'll see tomorrow!

Posted by skeet | 13 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 29 - Min/Max

The second and third AOOOD operators today... if I'm brave enough to tackle Average tomorrow, I'll have done them all. More surprises here today, this time in terms of documentation...

What are they?

Min and Max are both extension methods with 22 overloads each. Min looks like this:

public static int Min(this IEnumerable<int> source)

public static int Min<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int> selector)

public static int? Min(this IEnumerable<int?> source)

public static int? Min<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int?> selector)

// Repeat the above four overloads for long, float, double and decimal,
// then add two more generic ones:

public static TSource Min<TSource>(this IEnumerable<TSource> source)

public static TResult Min<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector
)

(Max is exactly the same as Min; just replace the name.)

The more obvious aspects of the behaviour are as follows:

  • source and selector mustn't be null
  • All overloads use immediate execution
  • The minimum or maximum value within the sequence is returned
  • If a selector is present, it is applied to each value within source, and the maximum of the projected values is returned. (Note how the return type of these methods is TResult, not TSource.)

Some less obvious aspects - in all cases referring to the result type (as the source type is somewhat incidental when a selector is present; it doesn't affect the behaviour):

  • The type's IComparable<T> implementation is used when available, otherwise IComparable is used. An ArgumentException is thrown if values can't be compared. Fortunately, this is exactly the behaviour of Comparer<T>.Default.
  • For any nullable type (whether it's a reference type or a nullable value type), nulls within the sequence are ignored, and an empty sequence (or one which contains only null values) will cause a null value to be returned. If there are any non-null values in the sequence, the return value will be non-null. (Note that this is different from Sum, which will return the non-null zero value for empty sequences over nullable types.)
  • For any non-nullable value type, an empty sequence will cause InvalidOperationException to be thrown.

The first point is particularly interesting when you consider the double and float types, and their "NaN" (not-a-number) values. For example, Math.Max regards NaN as greater than positive infinity, but Enumerable.Max regards positive infinity as being the greater of the two. Math.Min and Enumerable.Min agree, however, that NaN is less than negative infinity. (It would actually make sense to me for NaN to be treated as the numeric equivalent of null here, but that would be strange in other ways...) Basically, NaN behaves oddly in all kinds of ways. I believe that IEEE-754-2008 actually specifies behaviour with NaNs which encourages the results we're getting here, but I haven't verified that yet. (I can't find a free version of the standard online, which is troubling in itself. Ah well.)

The behaviour of the nullable and non-nullable types is well documented for the type-specific overloads using int, Nullable<int> etc. However, the generic overloads (the ones using TSource) are poorly documented:

  • InvalidOperationException isn't in the list of possibly-thrown arguments for any of the overloads
  • The methods using selectors from TSource to TResult don't mention the possibility of nullity at all
  • The methods without selectors describe the behaviour of null values for reference types, but don't mention the possibility of empty sequences for non-nullable value types, or consider nullable value types at all.

(I should point out that ArgumentException isn't actually mentioned either for the case where values are incomparable, but that feels like a slightly less important offence for some reason. Possibly just because it didn't trip me up.)

If I remember, I'll open a Connect issue against this hole in the documentation when I find time. Unlike the optimizations and set ordering (where it's reasonably forgivable to deliberately omit implementation details from the contract) you simply can't predict the behaviour in a useful way from the documentation here. And yes, I'm going on about this because it bit me. I had to resort to writing tests and running them against LINQ to Objects to see if they were correct or not. (They were incorrect in various places.)

If you look at the behaviour of the non-generic methods, the generic ones are entirely consistent of course.

There are a couple of things which you might consider "missing" in terms of Max and Min:

  • The ability to find out the minimum/maximum value of a sequence by a projection. For example, consider a sequence of people. We may wish to find the youngest person in the sequence, in which case we'd like to be able to write something like:
    var oldest = people.MaxBy(person => person.Age);
    We can find the maximum age itself easily enough - but then we'd need a second pass to find the first person with that age. I've addressed this in MoreLINQ with the MaxBy and MinBy operators. The System.Interactive assembly in Reactive Extensions has the same methods too.
  • The ability to specify a custom IComparer<T> implementation, as we can in most of the operators using IEqualityComparer<T>. For example, we can't find the "maximum" string in a sequence, using a case-insensitive ordinal comparison.

Still, at least that means there's less to test...

What are we going to test?

I decided I really couldn't find the energy to replicate all the tests for every type involved here. Instead, I have a bunch of tests for int and Nullable<int>, a few tests exploring the oddness of doubles, and a bunch of tests around the generic methods. In particular, I know that I've implemented decimal, float etc by calling the same methods that the int overloads use.

The tests cover:

  • Argument validation
  • Empty sequences
  • Sequences of null values where applicable
  • Projections of the above
  • Generic tests for nullable and non-nullable value types, and reference types (with empty sequences etc)
  • Incomparable values

Let's implement them!

Okay, let's start off with the simplest detail: the order of implementation:

  • Max(int)
  • Max(generic)
  • Cut and paste Max implementations for other numeric types (replace the type name, basically)
  • Cut and paste the entirety of Max to Min:
    • Replace "Max" with "Min" everywhere
    • Replace " < " with " > " everywhere (only 4 occurrences; basically the results of calling Compare or ComparerTo and comparing with 0)

Just as with Sum, I could have used templating - but I don't think it would actually have saved me significant time.

This time, I thought I'd use Select internally for the overloads with selectors (unlike my approach for Sum which used identity projections). There's no particular reason for this - I just thought it would be interesting to try both approaches. Overall, I think I prefer this one, but I haven't done any benchmarking to find out the relative performance penalties.

Each set of numeric overloads calls into a single pair of generic "implementation" methods. These aren't the public general-purpose ones: they require that the types in use implement IComparable<T>, and I've added a "struct" constraint just for kicks. This is just one approach. Other options:

  • I could have implemented the code separately for each numeric type. That may well be faster than calling IComparable<T>.Compare (at least for most types) as the IL would have contained the appropriate operator directly. However, it would have meant more code and explicitly dealing with the headache of NaNs for double/float. If I ever write benchmarks, I'll investigate the difference that this can make.
  • I could have used the public generic overloads, which eventually call into Comparer<T>.Default. Again, the penalty for this (if any) is unknown to me at this point. Can the JIT inline deeply enough to make this as fast as a "native" implementation? I wouldn't like to guess without tests.

I've separated out the nullable implementations from the non-nullable ones, as the behaviour differs significantly between the two.

Here's the public code for int:

public static int Max(this IEnumerable<int> source)
{
    return PrimitiveMax(source);
}

public static int Max<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int> selector)
{
    // Select will validate the arguments
    return PrimitiveMax(source.Select(selector));
}

public static int? Max(this IEnumerable<int?> source)
{
    return NullablePrimitiveMax(source);
}

public static int? Max<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int?> selector)
{
    // Select will validate the arguments
    return NullablePrimitiveMax(source.Select(selector));
}

All the methods consider argument validation to be somebody else's problem - either Select or the generic method we're calling to find the maximum value. Part of me thinks this is lazy; part of me likes it in terms of not repeating code. All of me would prefer the ability to specify non-nullable parameters declaratively...

Here are the "primitive" methods called into above:

// These are uses by all the overloads which use a known numeric type.
// The term "primitive" isn't truly accurate here as decimal is not a primitive
// type, but it captures the aim reasonably well.
// The constraint of being a value type isn't really required, because we don't rely on
// it within the method and only code which already knows it's a comparable value type
// will call these methods anyway.
        
private static T PrimitiveMax<T>(IEnumerable<T> source) where T : struct, IComparable<T>
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    using (IEnumerator<T> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        T max = iterator.Current;
        while (iterator.MoveNext())
        {
            T item = iterator.Current;
            if (max.CompareTo(item) < 0)
            {
                max = item;
            }
        }
        return max;
    }
}

private static T? NullablePrimitiveMax<T>(IEnumerable<T?> source) where T : struct, IComparable<T>
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    T? max = null;
    foreach (T? item in source)
    {
        if (item != null &&
            (max == null || max.Value.CompareTo(item.Value) < 0))
        {
            max = item;
        }
    }
    return max;
}

The first method is interesting in terms of its approach to throwing an exception if the first element isn't present, and using that as an initial candidate otherwise.

The second method needs to consider nullity twice on each iteration:

  • Is the item from the sequence null? If so, we can ignore it.
  • Is our "current maximum" null? If so, we can replace it with the item from the sequence without performing a comparison.

Now there's one case which is ambiguous here: when both values are null. At that point we can choose to replace our "current maximum" with the item, or not... it doesn't matter as the values are the same anyway. It is important that we don't try to perform a comparison unless both values are non-null though... the short-circuiting && and || operators keep us safe here.

Having implemented the code above, all the interesting work lies in the generic forms. Here we don't have different public methods to determine which kind of behaviour we'll use: but I wrote two private methods instead, and just delegated to the right one from the public one. This seemed cleaner than putting the code all in one method:

public static TSource Max<TSource>(
    this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    // This condition will be true for reference types and nullable value types, and false for
    // non-nullable value types.
    return default(TSource) == null ? NullableGenericMax(source) : NonNullableGenericMax(source);
}

public static TResult Max<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    return Max(source.Select(selector));
}

/// <summary>
/// Implements the generic behaviour for non-nullable value types.
/// </summary>
/// <remarks>
/// Empty sequences will cause an InvalidOperationException to be thrown.
/// Note that there's no *compile-time* validation in the caller that the type
/// is a non-nullable value type, hence the lack of a constraint on T.
/// </remarks>
private static T NonNullableGenericMax<T>(IEnumerable<T> source)
{
    Comparer<T> comparer = Comparer<T>.Default;

    using (IEnumerator<T> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        T max = iterator.Current;
        while (iterator.MoveNext())
        {
            T item = iterator.Current;
            if (comparer.Compare(max, item) < 0)
            {
                max = item;
            }
        }
        return max;
    }
}

/// <summary>
/// Implements the generic behaviour for nullable types - both reference types and nullable
/// value types.
/// </summary>
/// <remarks>
/// Empty sequences and sequences comprising only of null values will cause the null value
/// to be returned. Any sequence containing non-null values will return a non-null value.
/// </remarks>
private static T NullableGenericMax<T>(IEnumerable<T> source)
{
    Comparer<T> comparer = Comparer<T>.Default;

    T max = default(T);
    foreach (T item in source)
    {
        if (item != null &&
            (max == null || comparer.Compare(max, item) < 0))
        {
            max = item;
        }
    }
    return max;
}

As you can tell, there's a significant similarity between the "PrimitiveMax" and "NonNullableGenericMax" methods, and likewise between "NullablePrimitiveMax" and "NullableGenericMax". This should come as no surprise. Fundamentally the difference is just between using an IComparable<T> implementation, and using Comparer<T>.Default. (The argument validation occurs in a different place too, as we'll be going through a public entry point for the non-primitive code.)

Once I'd discovered the correct behaviour, this was reasonably simple. Of course, the above code wasn't my first implementation, where I'd completely forgotten about null values, and hadn't thought about how the nullability of the source type might affect the behaviour of empty sequences...

Conclusion

If you're ever in a company which rewards you for checking in lots of lines of code, offer to implement Sum/Min/Max. This weekend I've checked in about 2,500 lines of code in (split between production and test) and none of it's been terribly hard. Of course, if you're ever in such a company you should also consider looking for another job. (Have I mentioned that Google's hiring? Email me if you're interested. I'm serious.)

As you can tell, I was slightly irritated by the lack of clarity around the documentation in some places - but I find it interesting that even a simple-sounding function like "find the maximum value from a sequence" should need the kind of documentation that's missing here. I'm not saying it's a failure of the design - more just musing how a complete specification is almost always going to be longer than you might think at first glance. And if you think I was diligent here, think again: I didn't bother specifying which maximum or minimum value would be returned if there were two. For example, if a sequence consists of references to two equal but distinct strings, which reference should be returned? I have neither stated what my implementation (or the LINQ to Objects implementation) will do, nor tested for it.

Next up is Average - a single method with a mere 20 overloads. There are various corner cases to consider... but that's a post for another day.

Posted by skeet | 7 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 28 - Sum

Okay, I've bitten the bullet. The first of the four Aggregation Operators Of Overload Doom (AOOOD) that I've implemented is Sum. It was far from difficult to implement - just tedious.

What is it?

Sum has 20 overloads - a set of 4 for each of the types that it covers (int, long, float, double, decimal). Here are the overloads for int:

public static int Sum(this IEnumerable<int> source)

public static int? Sum(this IEnumerable<int?> source)

public static int Sum<T>(
    this IEnumerable<T> source,
    Func<T, int> selector)

public static int? Sum<T>(
    this IEnumerable<T> source,
    Func<T, int?> selector)

As you can see, there are basically two variations:

  • A source of the numeric type itself, or a source of an arbitrary type with a projection to the numeric type
  • The numeric type can be nullable or non-nullable

The behaviour is as follows:

  • All overloads use immediate execution: it will immediately iterate over the source sequence to compute the sum, which is obviously the return value.
  • source and selector must both be non-null
  • Where there's a selector, the operator is equivalent to source.Select(selector).Sum() - or you can think of the versions without a selector as using an identity selector
  • Where the numeric type is nullable, null values are ignored
  • The sum of an empty sequence is 0 (even for nullable numeric types)

The last point is interesting - because the overloads with nullable numeric types never return a null value. Initially I missed the fact that the return type even was nullable. I think it's somewhat misleading to be nullable but never null - you might have at least expected that the return value for an empty sequence (or one consisting only of null values) would be null.

For int, long and decimal, overflow within the sum will throw OverflowException. For single and double, the result will be positive or negative infinity. If the sequence contains a "not-a-number" value (NaN), the result will be NaN too.

What are we going to test?

A lot!

In total, I have 123 tests across the five types. The tests are mostly the same for each type, with the exception of overflow behaviour and not-a-number behaviour for single and double. Each overload is tested reasonably thoroughly:

  • Argument validation
  • Summing a simple sequence
  • Summing an empty sequence
  • (Nullable) Summing a simple sequence containing null values
  • (Nullable) Summing a sequence containing only null values
  • Positive overflow (either to an exception or infinity)
  • Negative overflow (only one test per type, rather than for each overload)
  • (Single/Double) Sequences containing NaN values
  • Projections resulting in all of the above

Most of this was done using cut and paste, leading to a 916-line source file. On Twitter, followers have suggested a couple of alternatives - templating (possibly for both the tests and the implementation), or using more advanced features of unit test frameworks. There's nothing wrong with these suggestions - but I'm always concerned about the balance within an elegant-but-complex solution to repetition. If it takes longer to get to the "neat" solution, and then each individual test is harder to read, is it worth it? It certainly makes it easier to add one test which is then applicable over several types, or to modify all "copies" of an existing test - but equally it makes it harder to make variations (such as overflow) fit within the pattern. I have no quarrel with the idea of using more advanced techniques here, but I've stuck to a primitive approach for the moment.

Let's implement it!

Again, I'll only demonstrate the "int" implementations - but talk about single/double later on. There are plenty of ways I could have implemented this:

  • Delegate everything to the simplest overload using "Where" for the nullable sequences and "Select" for the projection sequences
  • Delegate everything to the most complex overload using identity projections
  • Implement each method independently
  • Somewhere in-between :)

In the end I've implemented each non-projecting overload by delegating to the corresponding projection-based one with an identity projection. I've implemented the non-nullable and nullable versions separately though. Here's the complete implementation:

public static int Sum(this IEnumerable<int> source)
{
    return Sum(source, x => x);
}

public static int? Sum(this IEnumerable<int?> source)
{
    return Sum(source, x => x);
}

public static int Sum<T>(
    this IEnumerable<T> source,
    Func<T, int> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    checked
    {
        int sum = 0;
        foreach (T item in source)
        {
            sum += selector(item);
        }
        return sum;
    }
}

public static int? Sum<T>(
    this IEnumerable<T> source,
    Func<T, int?> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    checked
    {
        int sum = 0;
        foreach (T item in source)
        {
            sum += selector(item).GetValueOrDefault();
        }
        return sum;
    }
}

Note the use of Nullable<T>.GetValueOrDefault() to "ignore" null values - it felt easier to add zero than to use an "if" block here. I suspect it's also more efficient, as there's no need for any conditionality here: I'd expect the implementation of GetValueOrDefault() to just return the underlying "value" field within the Nullable<T>, without performing the check for HasValue which the Value property normally would.

Of course, if I were really bothered by performance I'd implement each operation separately, instead of using the identity projection.

Note the use of the "checked" block to make sure that overflow is handled appropriately. As I've mentioned before, it would quite possibly be a good idea to turn overflow checking on for the whole assembly, but here I feel it's worth making it explicit to show that we consider overflow as an important part of the behaviour of this operator. The single/double overloads don't use checked blocks, as their overflow behaviour isn't affected by the checked context.

Conclusion

One down, three to go! I suspect Min and Max will use even more cutting and pasting (with judiciously applied changes, of course). There are 22 overloads for each of those operators, due to the possibility of using an arbitrary type - but I may well be able to use the most generic form to implement all the numeric versions. I may measure the impact this has on performance before deciding for sure. Anyway, that's a topic for the next post...

Addendum

As has been pointed out in the comments, my original implementation used a float to accumulate values when summing a sequence of floats. This causes problems, as these two new tests demonstrate:

[Test]
public void NonOverflowOfComputableSumSingle()
{
    float[] source = { float.MaxValue, float.MaxValue,
                      -float.MaxValue, -float.MaxValue };
    // In a world where we summed using a float accumulator, the
    // result would be infinity.
    Assert.AreEqual(0f, source.Sum());
}

[Test]
public void AccumulatorAccuracyForSingle()
{
    // 20000000 and 20000004 are both exactly representable as
    // float values, but 20000001 is not. Therefore if we use
    // a float accumulator, we'll end up with 20000000. However,
    // if we use a double accumulator, we'll get the right value.
    float[] array = { 20000000f, 1f, 1f, 1f, 1f };
    Assert.AreEqual(20000004f, array.Sum());
}

The second of these tests is specific to floating point arithmetic - there's no equivalent in the integer domain. Hopefully the comment makes the test clear. We could still do better if we used the Kahan summation algorithm, but I haven't implemented that yet, and don't currently intend to. Worth noting as a potential follow-on project though.

Back to the first test though: this certainly can be represented in integers. If we try to sum { int.MaxValue, int.MaxValue, -int.MaxValue, -int.MaxValue } there are two options: we can overflow (throwing an exception) or we can return 0. If we use a long accumulator, we'll return 0. If we use an int accumulator, we'll overflow. I genuinely didn't know what the result for LINQ to Objects would be until I tried it - and found that it overflows. I've added a test to document this behaviour:

[Test]
public void OverflowOfComputableSumInt32()
{
    int[] source = { int.MaxValue, 1, -1, -int.MaxValue };
    // In a world where we summed using a long accumulator, the
    // result would be 0.
    Assert.Throws<OverflowException>(() => source.Sum());
}

Of course, I could have gone my own way and made Edulinq more capable than LINQ to Objects here, but in this case I've gone with the existing behaviour.

Posted by skeet | 8 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 27 - Reverse

Time for a change of pace after the deep dive into sorting. Reversing is pretty simple... which is not to say there's nothing to discuss, of course.

What is it?

Reverse only has a single, simple signature:

public static IEnumerable<TSource> Reverse<TSource>(
    this IEnumerable<TSource> source)

The behaviour is pretty simple to describe:

  • source cannot be null; this is validated eagerly.
  • The operator uses deferred execution: until you start reading from the result, it won't read anything from the input sequence
  • As soon as you start reading from the result sequence, the input sequence is read in its entirety
  • The result sequence contains all the elements of the input sequence, in the opposite order. (So the first element of the result sequence is the last element of the input sequence.)

The third point is the most interesting. It sounds like an obvious requirement just to get it to work at all - until you think of possible optimizations. Imagine if you implemented Reverse with an optimization for arrays: we know the array won't change size, and we can find out that size easily enough - so we could just use the indexer on each iteration, starting off with an index of "length - 1" and decrementing until we'd yielded every value.

LINQ to Objects doesn't behave this way - and that's observable because if you change the value of the array after you start iterating over the result sequence, you don't see those changes. Deferred execution means that you will see changes made to the array after the call to Reverse but before you start iterating over the results, however.

Note that the buffering nature of this operator means that you can't use it on infinite sequences - which makes sense when you think about it. What's the last element of an infinite sequence?

What are we going to test?

Most of the tests are pretty obvious, but I have one test to demonstrate how the timing of changes to the contents of the input sequence affect the result sequence:

[Test]
public void ArraysAreBuffered()
{
    // A sneaky implementation may try to optimize for the case where the collection
    // implements IList or (even more "reliable") is an array: it mustn't do this,
    // as otherwise the results can be tainted by side-effects within iteration
    int[] source = { 0, 1, 2, 3 };

    var query = source.Reverse();
    source[1] = 99; // This change *will* be seen due to deferred execution
    using (var iterator = query.GetEnumerator())
    {
        iterator.MoveNext();
        Assert.AreEqual(3, iterator.Current);

        source[2] = 100; // This change *won't* be seen               
        iterator.MoveNext();
        Assert.AreEqual(2, iterator.Current);

        iterator.MoveNext();
        Assert.AreEqual(99, iterator.Current);

        iterator.MoveNext();
        Assert.AreEqual(0, iterator.Current);
    }
}

If you can think of any potentially-surprising tests, I'd be happy to implement them - there wasn't much I could think of in terms of corner cases.

Let's implement it!

Eager validation of source combined with deferred execution suggests the normal implementation of splitting the operator into two methods - I won't bother showing the public part, as it only does exactly what you'd expect it to. However, to make up for the fact that Reverse is so simple, I'll present three implementations of the "Impl" method.

First, let's use a collection which performs the reversing for us automatically: a stack. The iterator returned by Stack<T> returns items in the order in which they would be seen by multiple calls to Pop - i.e. the reverse of the order in which they were added. This makes the implementation trivial:

private static IEnumerable<TSource> ReverseImpl<TSource>(IEnumerable<TSource> source)
{
    Stack<TSource> stack = new Stack<TSource>(source);
    foreach (TSource item in stack)
    {
        yield return item;
    }
}

Again, with "yield foreach" we could have done this in a single statement.

Next up, a linked list. In some ways, using a linked list is very natural - you never need to resize an array, or anything like that. On the other hand, we have an extra node object for every single element, which is a massive overhead. It's not what I'd choose to use for production in this case, but it's worth showing:

private static IEnumerable<TSource> ReverseImpl<TSource>(IEnumerable<TSource> source)
{
    LinkedList<TSource> list = new LinkedList<TSource>(source);
    LinkedListNode<TSource> node = list.Last; // Property, not method!
    while (node != null)
    {
        yield return node.Value;
        node = node.Previous;
    }
}

Finally, a more "close to the metal" approach using our existing ToBuffer method:

private static IEnumerable<TSource> ReverseImpl<TSource>(IEnumerable<TSource> source)
{
    int count;
    TSource[] array = source.ToBuffer(out count);
    for (int i = count - 1; i >= 0; i--)
    {
        yield return array[i];
    }
}

This is probably not significantly more efficient than the version using Stack<T> - I expect Stack<T> has a similar implementation to ToBuffer when it's constructed with an input sequence. However, as it's so easy to count down from the end of the array to the start, we don't really need to take advantage of any of the features of Stack - so we might as well just use the array directly.

Note that this relies on the fact that ToBuffer will create a copy of whatever it's given, including an array. That's okay though - we're relying on that all over the place :)

Conclusion

It's hard to see how this could really be optimized any further, other than by improving ToBuffer based on usage data. Overall, a lovely simple operator.

It's probably about time I tackled some of the arithmetic aggregation operators... so next time I'll probably implement Sum.

Posted by skeet | 17 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 26d - Fixing the key selectors, and yielding early

I feel I need a voice over. "Previously, on reimplementing LINQ to Objects..." Well, we'd got as far as a working implementation of OrderedEnumerable which didn't have terrible performance - unless you had an expensive key selector. Oh, and it didn't make use of the fact that we may only want the first few results.

Executing key selectors only once

Our first problem is to do with the key selectors. For various reasons (mentioned in part 26b) life is better if we execute the key selector once per input element. While we can do that with lazy evaluation, it makes more sense in my opinion to do it up-front. That means we need to separate out the key selector from the key comparer - in other words, we need to get rid of the handy ProjectionComparer we used to simplify the arguments to OrderBy/ThenBy/etc.

If we're going to keep the key selectors in a strongly typed way, that means our OrderedEnumerable (or at least some type involved in the whole business) needs to become generic in the key type. Let's bite the bullet and make it OrderedEnumerable. Now we have a slight problem right away in the fact that the "CreateOrderedEnumerable" method is generic, introducing a new type parameter TKey... so we shouldn't use TKey as the name of the new type parameter for OrderedEnumerable. We could rename the type parameter in the generic method implementation, but I'm becoming a big believer in leaving the signatures of methods alone when I implement an interface. For type parameters it's not too bad, but for normal parameters it can be awful if you mess around with the names - particularly for those using named arguments.

Thinking ahead, our single "key" type parameter in OrderedEnumerable could well end up being a composite key. After all, if we have OrderBy(...).ThenBy(...).ThenBy(...) we're going to have to have some way of representing the key formed by the three selectors. It makes sense to use a "nested" key type, where the key type of OrderedEnumerable is always the "composite key so far". Thus I named the type parameter TCompositeKey, and introduced an appropriate field. Here's the skeleton of the new class:

internal class OrderedEnumerable<TElement, TCompositeKey> : IOrderedEnumerable<TElement>
{
    private readonly IEnumerable<TElement> source;
    private readonly Func<TElement, TCompositeKey> compositeSelector;
    private readonly IComparer<TCompositeKey> compositeComparer;

    internal OrderedEnumerable(IEnumerable<TElement> source,
        Func<TElement, TCompositeKey> compositeSelector,
        IComparer<TCompositeKey> compositeComparer)
    {
        this.source = source;
        this.compositeSelector = compositeSelector;
        this.compositeComparer = compositeComparer;
    }

    // Interface implementations here
}

(I'm aware this is very "stream of consciousness" - I'm assuming that presenting the decisions in the order in which I addressed them is a good way of explaining the necessary changes. Apologies if the style doesn't work for you.)

ThenBy and ThenByDescending don't have to change at all - they were already just using the interface. OrderBy and OrderByDescending become a little simpler, as we don't need to build the projection comparer. Here's the new version of OrderBy:

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    return new OrderedEnumerable<TSource, TKey>
        (source, keySelector, comparer ?? Comparer<TKey>.Default);
}

Lovely - we just call a constructor, basically.

So far, so good. Now what about the implementation of IOrderedEnumerable? We should expect this to get messy, because there are three types of key involved:

  • The current key type
  • The secondary key type
  • The composite key type

Currently we don't even have a type which can represent the composite key. We could use something like KeyValuePair<TKey, TValue>, but that doesn't really give the right impression. Instead, let's create our own simple type:

internal struct CompositeKey<TPrimary, TSecondary>
{
    private readonly TPrimary primary;
    private readonly TSecondary secondary;

    internal TPrimary Primary { get { return primary; } }
    internal TSecondary Secondary{ get { return secondary; } }

    internal CompositeKey(TPrimary primary, TSecondary secondary)
    {
        this.primary = primary;
        this.secondary = secondary;
    }
}

Now we can easily create a projection from two key selectors to a new one which selects a composite key. However, we'll need to do the same thing for a comparer. We could use the CompoundComparer class we created before, but that will end up with quite a bit of indirection. Instead, it would be nice to have a type to work directly with CompositeKey - something which knew it was dealing with comparers of different types, one for each part of the key.

We could create a completely separate top-level type for that... but specifying the type parameters again seems a bit daft when we can reuse them by simply creating a nested class within CompositeKey:

internal struct CompositeKey<TPrimary, TSecondary>
{
    // Other members as shown above

    internal sealed class Comparer : IComparer<CompositeKey<TPrimary, TSecondary>>
    {
        private readonly IComparer<TPrimary> primaryComparer;
        private readonly IComparer<TSecondary> secondaryComparer;

        internal Comparer(IComparer<TPrimary> primaryComparer,
                          IComparer<TSecondary> secondaryComparer)
        {
            this.primaryComparer = primaryComparer;
            this.secondaryComparer = secondaryComparer;
        }

        public int Compare(CompositeKey<TPrimary, TSecondary> x,
                           CompositeKey<TPrimary, TSecondary> y)
        {
            int primaryResult = primaryComparer.Compare(x.Primary, y.Primary);
            if (primaryResult != 0)
            {
                return primaryResult;
            }
            return secondaryComparer.Compare(x.Secondary, y.Secondary);
        }
    }
}

This may look a little odd to begin with, but the two types really are quite deeply connected.

Now that we can compose keys in terms of both selection and comparison, we can implement CreateOrderedEnumerable:

public IOrderedEnumerable<TElement> CreateOrderedEnumerable<TKey>(
    Func<TElement, TKey> keySelector,
    IComparer<TKey> comparer,
    bool descending)
{
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    comparer = comparer ?? Comparer<TKey>.Default;
    if (descending)
    {
        comparer = new ReverseComparer<TKey>(comparer);
    }

    // Copy to a local variable so we don't need to capture "this"
    Func<TElement, TCompositeKey> primarySelector = compositeSelector;
    Func<TElement, CompositeKey<TCompositeKey, TKey>> newKeySelector = 
        element => new CompositeKey<TCompositeKey, TKey>(primarySelector(element), keySelector(element));

    IComparer<CompositeKey<TCompositeKey, TKey>> newKeyComparer =
        new CompositeKey<TCompositeKey, TKey>.Comparer(compositeComparer, comparer);

    return new OrderedEnumerable<TElement, CompositeKey<TCompositeKey, TKey>>
        (source, newKeySelector, newKeyComparer);
}

I'm not going to pretend that the second half of the method is anything other than ghastly. I'm not sure I've ever written code which is so dense in type arguments. IComparer<CompositeKey<TCompositeKey, TKey>> is a particularly "fine" type. Ick.

However, it works - and once you've got your head round what each of the type parameters actually means at any one time, it's not really complicated code - it's just verbose and clunky.

The only bit which might require a bit of explanation is the primarySelector variable. I could certainly have just used compositeSelector within the lambda expression used to create the new key selector - it's not like it's going to change, after all. The memory benefits of not having a reference to "this" (where the intermediate OrderedEnumerable is likely to be eligible for GC collection immediately, in a typical OrderBy(...).ThenBy(...) call) are almost certainly not worth it. It just feels right to have both the primary and secondary key selectors in the same type, which is what will happen with the current code. They're both local variables, they'll be captured together, all will be well.

I hope you can see the parallel between the old code and the new code. Previously we composed a new (element-based) comparer based on the existing comparer, and a projection comparer from the method parameters. Now we're composing a new key selector and a new key comparer. It's all the same idea, just maintaining the split between key selection and key comparison.

Now let's sort...

So far, we haven't implemented GetEnumerator - and that's all. As soon as we've done that to our satisfaction, we're finished with ordering.

There are several approaches to how we could sort. Here are a few of them:

  • Project each element to its key, and create a KeyValuePair for each item. Merge sort in the existing way to achieve stability. This will involve copying a lot of data around - particularly if the element and key types end up being large value types.
  • Project each element to a { key, index } pair, and create another composite comparer which uses the index as a tie-breaker to achieve stability. This still involves copying keys around, but it means we could easily use a built-in sort (such as List<T>).
  • Project each element to a key, and separately create an array of indexes (0, 1, 2, 3...). Sort the indexes by accessing the relevant key at any point, using indexes as tie-breakers. This requires a more fiddly sort, as we need to keep indexing into the indexes array.
  • Build up "chunks" of sorted data as we read it in, keeping some number of chunks and merging them appropriate when we want to. We can then yield the results without ever performing a full sort, by effectively performing the "merge" operation of merge sort, just yielding values instead of copying them to temporary storage. (Obviously this is trivial with 2 chunks, but can be extended to more.)
  • Do something involving a self-balancing binary tree :)

I decided to pick the middle option, using quicksort as the sorting algorithm. This comes with the normal problems of possibly picking bad pivots, but it's usually a reasonable choice. I believe there are cunning ways of improving the worst-case performance, but I haven't implemented any of those.

Here's the non-quicksort part of the code, just to set the scene.

public IEnumerator<TElement> GetEnumerator()
{
    // First copy the elements into an array: don't bother with a list, as we
    // want to use arrays for all the swapping around.
    int count;
    TElement[] data = source.ToBuffer(out count);

    int[] indexes = new int[count];
    for (int i = 0; i < indexes.Length; i++)
    {
        indexes[i] = i;
    }

    TCompositeKey[] keys = new TCompositeKey[count];
    for (int i = 0; i < keys.Length; i++)
    {
        keys[i] = compositeSelector(data[i]);
    }

    QuickSort(indexes, keys, 0, count - 1);

    for (int i = 0; i < indexes.Length; i++)
    {
        yield return data[indexes[i]];
    }
}

I could certainly have combined the first two loops - I just liked the separation provided in this code. One tiny micro-optimization point to note is that for each loop I'm using the Length property of the array rather than "count" as the upper bound, as I believe that will reduce the amount of array boundary checking the JIT will generate. I very much doubt that it's relevant, admittedly :) I've left the code here as it is in source control - but looking at it now, I could certainly have used a foreach loop on the final yield part. We wouldn't be able to later, admittedly... but I'll come to that all in good time.

The actual quicksort part is reasonably standard except for the fact that I pass in both the arrays for both indexes and keys - usually there's just the one array which is being sorted. Here's the code for both the recursive call and the partition part:

private void QuickSort(int[] indexes, TCompositeKey[] keys, int left, int right)
{
    if (right > left)
    {
        int pivot = left + (right - left) / 2;
        int pivotPosition = Partition(indexes, keys, left, right, pivot);
        QuickSort(indexes, keys, left, pivotPosition - 1);
        QuickSort(indexes, keys, pivotPosition + 1, right);
    }
}

private int Partition(int[] indexes, TCompositeKey[] keys, int left, int right, int pivot)
{
    // Remember the current index (into the keys/elements arrays) of the pivot location
    int pivotIndex = indexes[pivot];
    TCompositeKey pivotKey = keys[pivotIndex];

    // Swap the pivot value to the end
    indexes[pivot] = indexes[right];
    indexes[right] = pivotIndex;
    int storeIndex = left;
    for (int i = left; i < right; i++)
    {
        int candidateIndex = indexes[i];
        TCompositeKey candidateKey = keys[candidateIndex];
        int comparison = compositeComparer.Compare(candidateKey, pivotKey);
        if (comparison < 0 || (comparison == 0 && candidateIndex < pivotIndex))
        {
            // Swap storeIndex with the current location
            indexes[i] = indexes[storeIndex];
            indexes[storeIndex] = candidateIndex;
            storeIndex++;
        }
    }
    // Move the pivot to its final place
    int tmp = indexes[storeIndex];
    indexes[storeIndex] = indexes[right];
    indexes[right] = tmp;
    return storeIndex;
}

It's interesting to observe how similar the quicksort and merge sort recursive parts are - both picking a midpoint, recursing on the left of it, recursing on the right of it, and performing some operation on the whole sublist. Of course the "some operation" is very different between partition and merge, and it occurs at a different time - but it's an interesting parallel nonetheless.

One significant difference between merge sort and quicksort is the use of the pivot. Once Partition has returned where the pivot element ended up, quicksort doesn't touch that element itself (we already know it will be in the right place). It recurses on the sublist entirely to the left of the pivot and the sublist entirely to the right of the pivot. Compare this with merge sort with recurses on two sublists which together comprise the whole list for that call.

The overloading of the word "index" here is unfortunate, but that is unfortunately life. Both sorts of "index" here really are indexes... you just need to keep an eye on which is which.

The final point to note is how we're using the indexes in the comparison, as a tie-break to keep stability. It's an ugly expression, but it does the job.

(As a small matter of language, I wasn't sure whether to use indexes or indices. I far prefer the former, so I used it. Having just checked in the dictionary, it appears both are correct. This reminds me of when I was writing C# in Depth - I could never decide between appendixes and appendices. Blech.)

Now, do you want to hear the biggest surprise I received last night? After I'd fixed up the compile-time errors to arrive at the code above, it worked first time. I'm not kidding. I'm not quite sure how I pulled that off (merge sort didn't take long either, but it did at least have a few tweaks to fix up) but it shocked the heck out of me. So, are we done? Well, not quite.

Yielding early

Just as a reminder, one of my aims was to be able to use iterator blocks to return some values to anyone iterating over the result stream without having to do all the sorting work. This means that in the case of calling OrderBy(...).Take(5) on a large collection, we can end up saving a lot of work... I hope!

This is currently fairly normal quicksort code, leaving the "dual arrays" aspect aside... but it's not quite amenable to early yielding. We're definitely computing the earliest results first, due to the order of the recursion - but we can't yield from the recursive method - iterator blocks just don't do that.

So, we'll have to fake the recursion. Fortunately, quicksort is only directly recursive - we don't need to worry about mutually recursive routines: A calling B which might call C or it might call back to A, etc. Instead, we can just keep a Stack<T> of "calls" to quicksort that we want to make, and execute the appropriate code within our GetEnumerator() method, so we can yield at the right point. Now in the original code, quicksort has four parameters, so you might expect our Stack<T> to have those four values within T too... but no! Two of those values are just the keys and indexes... and we already have those in two local variables. We only need to keep track of "right" and "left". Again, for the sake of clarity I decided to implement this using a custom struct - nested within OrderedEnumerable as there's no need for it to exist anywhere else:

private struct LeftRight
{
    internal int left, right;
    internal LeftRight(int left, int right)
    {
        this.left = left;
        this.right = right;
    }
}

Purists amongst you may curse at the use of internal fields rather than properties. I'm not bothered - this is a private class, and we're basically using this as a tuple. Heck, I would have used anonymous types if it weren't for two issues:

  • I wanted to use Stack<T>, and there's no way of creating one of those for an anonymous type (without introducing more generic methods to use type inference)
  • I wanted to use a struct - we'll end up creating a lot of these values, and there's simply no sense in them being individual objects on the heap. Anonymous types are always classes.

So, as a first step we can transform our code to use this "fake recursion" but still yield at the very end:

var stack = new Stack<LeftRight>();
stack.Push(new LeftRight(0, count - 1));
while (stack.Count > 0)
{
    LeftRight leftRight = stack.Pop();
    int left = leftRight.left;
    int right = leftRight.right;
    if (right > left)
    {
        int pivot = left + (right - left) / 2;
        int pivotPosition = Partition(indexes, keys, left, right, pivot);
        stack.Push(new LeftRight(pivotPosition + 1, right));
        stack.Push(new LeftRight(left, pivotPosition - 1));
    }
}

for (int i = 0; i < indexes.Length; i++) 

    yield return data[indexes[i]]; 
}

We initially push a value of (0, count - 1) to simulate the call to QuickSort(0, count - 1) which started it all before. The code within the loop is very similar to the original QuickSort method, with three changes:

  • We have to grab the next value of LeftRight from the stack, and then separate it into left and right values
  • Instead of calls to QuickSort, we have calls to stack.Push
  • We've reversed the order of the recursive calls: in order to sort the left sublist first, we have to push it onto the stack last.

Happy so far? We're getting very close now. All we need to do is work out when to yield. This is the bit which caused me the most headaches, until I worked out that the "if (right > left)" condition really meant "if we've got work to do"... and we're interested in the exact opposite scenario - when we don't have any work to do, as that means everything up to and including "right" is already sorted. There are two situations here: either right == left, i.e. we're sorting one element, or right == left - 1, which will occur if we picked a pivot which was the maximum or minimum value in the list at the previous recursive step.

It's taken me a little bit of thinking (and just running the code) to persuade me that we will always naturally reach a situation where we end up seeing right == count and right <= left, i.e. a place where we know we're completely done. But it's okay - it does happen.

It's not just a case of yielding the values between left and right though - because otherwise we'd never yield a pivot. Remember how I pointed out that quick sort missed out the pivot when specifying the sublists to recurse into? Well, that's relevant here. Fortunately, it's really easy to work out what to do. Knowing that everything up to and including "right" has been sorted means we just need to keep a cursor representing the next index to yield, and then just move that cursor up until it's positioned beyond "right". The code is probably easier to understand than the description:

int nextYield = 0;

var stack = new Stack<LeftRight>();
stack.Push(new LeftRight(0, count - 1));
while (stack.Count > 0)
{
    LeftRight leftRight = stack.Pop();
    int left = leftRight.left;
    int right = leftRight.right;
    if (right > left)
    {
        int pivot = left + (right - left) / 2;
        int pivotPosition = Partition(indexes, keys, left, right, pivot);
        // Push the right sublist first, so that we *pop* the
        // left sublist first
        stack.Push(new LeftRight(pivotPosition + 1, right));
        stack.Push(new LeftRight(left, pivotPosition - 1));
    }
    else
    {
        while (nextYield <= right)
        {
            yield return data[indexes[nextYield]];
            nextYield++;
        }
    }
}

Tada! It works (at least according to my tests).

I have tried optimizing this a little further, to deal with the case when right == left + 1, i.e. we're only sorting two elements. It feels like that ought to be cheaper to do explicitly than via pivoting and adding two pointless entries to the stack... but the code gets a lot more complicated (to the point where I had to fiddle significantly to get it working) and from what I've seen, it doesn't make much performance difference. Odd. If this were a production-quality library to be used in performance-critical situations I'd go further in the testing, but as it is, I'm happy to declare victory at this point.

Performance

So, how well does it perform? I've only performed crude tests, and they perplex me somewhat. I'm sure that last night, when I was running the "yield at the end" code, my tests were running twice as slowly in Edulinq as in LINQ to Objects. Fair enough - this is just a hobby, Microsoft have no doubt put a lot of performance testing effort into this. (That hasn't stopped them from messing up "descending" comparers, admittedly, as I found out last night to my amusement.) That was on my "meaty" laptop (which is 64-bit with a quad core i7). On my netbook this morning, the same Edulinq code seemed to be running slightly faster than LINQ to Objects. Odd.

This evening, having pulled the "early out" code from the source repository, the Edulinq implementation is running faster than the LINQ to Objects implementation even when the "early out" isn't actually doing much good. That's just plain weird. I blame my benchmarking methodology, which is far from rigorous. I've tweaked the parameters of my tests quite a bit, but I haven't tried all kinds of different key and element types, etc. The basic results are very roughly:

  • When evaluating the whole ordered list, Edulinq appears to run about 10% faster than LINQ to Objects
  • When evaluating only the top 5 of a large ordered list, Edulinq can be much faster. How much faster depends on the size of the list of course, and it still has to perform the initial complete partitioning step - but on 100,000 items it's regularly about 10x faster than LINQ to Objects.

That makes me happy :) Of course, the code is all open source, so if Microsoft wish to include the Edulinq implementation in .NET 5, they're quite at liberty to do so, as long as they abide by the terms of the licence. I'm not holding my breath ;)

More seriously, I fully expect there are a bunch of scenarios where my knocked-up-in-an-evening code performs slower than that in the framework. Maybe my approach takes a lot more memory. Maybe it has worse locality of reference in some scenarios. There are all kinds of possibilities here. Full performance analysis was never meant to be the goal of Edulinq. I'm doing this in the spirit of learning more about LINQ - but it's fun to try to optimize just a little bit. I'm going to delete the increasingly-inaccurately-named MergeSortTest project now - I may institute a few more benchmarks later on though. I'm also removing CompoundComparer and ProjectionComparer, which are no longer used. They'll live on in part 26a though...

Conclusions

Well that was fun, wasn't it? I'm pretty pleased with the result. The final code has some nasty generic complexity in it, but it's not too bad if you keep all the types clear in your mind.

None of the remaining operators will be nearly as complex as this, unless I choose to implement AsQueryable (which I wasn't planning on doing). On the other hand, as I've mentioned before, Max/Sum/etc have oodles of overloads. While I'll certainly implement all of them, I'm sure I'll only present the code for selected interesting overloads.

As a bit of light relief, I think I'll tackle Reverse. That's about as simple as it gets - although it could still present some interesting options.

Addendum

An earlier version of this post (and the merge sort implementation) had a flawed piece of code for choosing the pivot. Here's both the old and the new code:

// Old code
int pivot = (left + right) / 2;

// New code
int pivot = left + (right - left) / 2;

The difference is whether or not the code can overflow when left and right are very large. Josh Bloch wrote about it back in 2006. A colleague alerted me to this problem shortly after posting, but it's taken until now to correct it. (I fixed the source repository almost immediately, but deferred writing this addendum.) Why was I not too worried? Because .NET restricts each object to be less than 2GB in size, even in .NET 4.0, even on a 64-bit CLR. As we've created an array of integers, one per entry, that means we can only have just under (int.MaxValue / 4) elements. Within those limits, there's no problem in the original pivot code. However, it's still worth fixing of course - one never knows when the restriction will be lifted. The CLR team blogged about the issue back in 2005 (when the 64-bit CLR was new) - I haven't seen any mentions of plans to remove the limitation, but I would imagine it's discussed periodically.

One oddity about this is that the Array class itself has some API support for large arrays, such as the LongLength property. To be honest, I can't see large arrays ever being particularly pleasant to work with - what would they return for the normal Length property, for example, or their implementation of IList<T> etc? I suspect we may see support for larger objects before we see support for arrays with more than int.MaxValue elements, but that's a complete guess.

Posted by skeet | 13 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 26c - Optimizing OrderedEnumerable

Part 26b left us with a working implementation of the ordering operators, with two caveats:

  • The sort algorithm used was awful
  • We were performing the key selection on every comparison, instead of once to start with

Today's post is just going to fix the first bullet - although I'm pretty sure that fixing the second will require changing it again completely.

Choosing a sort algorithm

There are lots of sort algorithms available. In our case, we need the eventual algorithm to:

  • Work on arbitrary pair-based comparisons
  • Be stable
  • Go like the clappers :)
  • (Ideally) allow the first results to be yielded without performing all the sorting work, and without affecting the performance in cases where we do need all the results.

The final bullet it an interesting one to me: it's far from unheard of to want to get the "top 3" results from an ordered query. In LINQ to Objects we can't easily tell the Take operator about the OrderBy operator so that it could pass on the information, but we can potentially yield the first results before we've sorted everything. (In fact, we could add an extra interface specifically to enable this scenario, but it's not part of normal LINQ to Objects, and could introduce horrible performance effects with innocent-looking query changes.)

If we decide to implement sorting in terms of a naturally stable algorithm, that limits the choices significantly. I was rather interested in timsort, and may one day set about implementing it - but it looked far too complicated to introduce just for the sake of Edulinq.

The best bet seemed to be merge sort, which is reasonably easy to implement and has reasonable efficiency too. It requires extra memory and a fair amount of copying, but we can probably cope with that.

We don't have to use a stable sort, of course. We could easily regard our "key" as the user-specified key plus the original index, and use that index as a final tie-breaker when comparing elements. That gives a stable result while allowing us to use any sorting algorithm we want. This may well be the approach I take eventually - especially as quicksort would allow us to start yielding results early in a fairly simple fashion. For the moment though, I'll stick with merge sort.

Preparing for merge sort

Just looking from the algorithm for merge sort, it's obvious that there will be a good deal of shuffling data around. As we want to make the implementation as fast as possible, that means it makes sense to use arrays to store the data. We don't need dynamic space allocation (after we've read all the data in, anyway) or any of the other features associated with higher-level collections. I'm aware that arrays are considered (somewhat) harmful, but purely for the internals of an algorithm which does so much data access, I believe they're the most appropriate solution.

We don't even need our arrays to be the right size - assuming we need to read in all the data before we start processing it (which will be true for this implementation of merge sort, but not for some other algorithms I may consider in the future) it's fine to use an oversized array as temporary storage - it's never going to be seen by the users, after all.

We've already got code which reads in all the data into a possibly-oversized array though - in the optimized ToArray code. So my first step was to extract out that functionality into a new internal extension method. This has to return a buffer containing all the data and give us an indication of the size. In .NET 4 I could use Tuple to return both pieces of data, but we can also just use an out parameter - I've gone for the latter approach at the moment. Here's the ToBuffer extension method:

internal static TSource[] ToBuffer<TSource>(this IEnumerable<TSource> source, out int count)
{
    // Optimize for ICollection<T>
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        count = collection.Count;
        TSource[] tmp = new TSource[count];
        collection.CopyTo(tmp, 0);
        return tmp;
    }

    // We'll have to loop through, creating and copying arrays as we go
    TSource[] ret = new TSource[16];
    int tmpCount = 0;
    foreach (TSource item in source)
    {
        // Need to expand...
        if (tmpCount == ret.Length)
        {
            Array.Resize(ref ret, ret.Length * 2);
        }
        ret[tmpCount++] = item;
    }
    count = tmpCount;
    return ret;
}

Note that I've used a local variable to keep track of the count in the loop near the end, only copying it into the output variable just before returning. This is due to a possibly-unfounded performance concern: we don't know where the variable will actually "live" in storage - and I'd rather not cause some arbitrary page of heap memory to be required all the way through the loop. This is a gross case of micro-optimization without evidence, and I'm tempted to remove it... but I thought I'd at least share my thinking.

This is only an internal API, so I'm trusting callers not to pass me a null "source" reference. It's possible that it would be a useful operator to expose at some point, but not just now. (If it were public, I would definitely use a local variable in the loop - otherwise callers could get weird effects by passing in a variable which could be changed elsewhere - such as due to side-effects within the loop. That's a totally avoidable problem, simply by using a local variable. For an internal API, I just need to make sure that I don't do anything so silly.)

Now ToArray needs to be changed to call ToBuffer, which is straightforward:

public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    int count;
    TSource[] ret = source.ToBuffer(out count);
    // Now create another copy if we have to, in order to get an array of the
    // right size
    if (count != ret.Length)
    {
        Array.Resize(ref ret, count);
    }
    return ret;
}

then we can prepare our OrderedEnumerable.GetEnumerator method for merging:

public IEnumerator<TElement> GetEnumerator()
{
    // First copy the elements into an array: don't bother with a list, as we
    // want to use arrays for all the swapping around.
    int count;
    TElement[] data = source.ToBuffer(out count);
    TElement[] tmp = new TElement[count];
            
    MergeSort(data, tmp, 0, count - 1);
    for (int i = 0; i < count; i++)
    {
        yield return data[i];
    }
}

The "tmp" array is for use when merging - while there is an in-place merge sort, it's more complex than the version where the "merge" step merges two sorted lists into a combined sorted list in temporary storage, then copies it back into the original list.

The arguments of 0 and count - 1 indicate that we want to sort the whole list - the parameters to my MergeSort method take the "left" and "right" boundaries of the sublist to sort - both of which are inclusive. Most of the time I'm more used to using exclusive upper bounds, but all the algorithm descriptions I found used inclusive upper bounds - so it made it easier to stick with that than try to "fix" the algorithm to use exclusive upper bounds everywhere. I think it highly unlikely that I'd get it all right without any off-by-one errors :)

Now all we've got to do is write an appropriate MergeSort method, and we're done.

Implementing MergeSort

I won't go through the details of how a merge sort works - read the wikipedia article for a pretty good description. In brief though, the MergeSort method guarantees that it will leave the specified portion of the input data sorted. It does this by splitting that section in half, and recursively merge sorting each half. It then merges the two halves by walking along two cursors (one from the start of each subsection) finding the smallest element out of the two at each point, copying that element into the temporary array and advancing just that cursor. When it's finished, the temporary storage will contain the sorted section, and it's copied back to the "main" array. The recursion has to stop at some point, of course - and in my implementation it stops if the section has fewer than three elements.

Here's the MergeSort method itself first:

// Note: right is *inclusive*
private void MergeSort(TElement[] data, TElement[] tmp, int left, int right)
{
    if (right > left)
    {
        if (right == left + 1)
        {
            TElement leftElement = data[left];
            TElement rightElement = data[right];
            if (currentComparer.Compare(leftElement, rightElement) > 0)
            {
                data[left] = rightElement;
                data[right] = leftElement;
            }
        }
        else
        {
            int mid = left + (right - left) / 2;
            MergeSort(data, tmp, left, mid);
            MergeSort(data, tmp, mid + 1, right);
            Merge(data, tmp, left, mid + 1, right);
        }
    }
}

The test for "right > left" is part of a vanilla merge sort (if the section either has one element or none, we don't need to take any action), but I've optimized the common case of only two elements. All we need to do is swap the elements - and even then we only need to do so if they're currently in the wrong order. There's no point in setting up all the guff of the two cursors - or even have the slight overhead of a method call - for that situation.

Other than that one twist, this is a pretty standard merge sort. Now for the Merge method, which is slightly more complicated (although still reasonably straighforward):

private void Merge(TElement[] data, TElement[] tmp, int left, int mid, int right)
{
    int leftCursor = left;
    int rightCursor = mid;
    int tmpCursor = left;
    TElement leftElement = data[leftCursor];
    TElement rightElement = data[rightCursor];
    // By never merging empty lists, we know we'll always have valid starting points
    while (true)
    {
        // When equal, use the left element to achieve stability
        if (currentComparer.Compare(leftElement, rightElement) <= 0)
        {
            tmp[tmpCursor++] = leftElement;
            leftCursor++;
            if (leftCursor < mid)
            {
                leftElement = data[leftCursor];
            }
            else
            {
                // Only the right list is still active. Therefore tmpCursor must equal rightCursor,
                // so there's no point in copying the right list to tmp and back again. Just copy
                // the already-sorted bits back into data.
                Array.Copy(tmp, left, data, left, tmpCursor - left);
                return;
            }
        }
        else
        {
            tmp[tmpCursor++] = rightElement;
            rightCursor++;
            if (rightCursor <= right)
            {
                rightElement = data[rightCursor];
            }
            else
            {
                // Only the left list is still active. Therefore we can copy the remainder of
                // the left list directly to the appropriate place in data, and then copy the
                // appropriate portion of tmp back.
                Array.Copy(data, leftCursor, data, tmpCursor, mid - leftCursor);
                Array.Copy(tmp, left, data, left, tmpCursor - left);
                return;
            }
        }
    }
}

Here, "mid" is the exclusive upper bound of the left subsection, and the inclusive lower bound of the right subsection... whereas "right" is the inclusive upper bound of the right subsection. Again, it's possible that this is worth tidying up at some point to be more consistent, but it's not too bad.

This time there's a little bit more special-casing. We take the approach that whichever sequence runs out first (which we can detect as soon as the "currently advancing" cursor hits its boundary), we can optimize what still has to be copied. If the "left" sequence runs out first, then we know the remainder of the "right" sequence must already be in the correct place - so all we have to do is copy as far as we've written with tmpCursor back from the temporary array to the main array.

If the "right" sequence runs out first, then we can copy the rest of the "left" sequence directly into the right place (at the end of the section) and then again copy just what's needed from the temporary array back to the main array.

This is as fast as I've managed to get it so far (without delving into too many of the more complicated optimizations available) - and I'm reasonably pleased with it. I have no doubt that it could be improved significantly, but I didn't want to spend too much effort on it when I knew I'd be adapting everything for the key projection difficulty anyway.

Testing

I confess I don't know the best way to test sorting algorithms. I have two sets of tests here:

  • A new project (MergeSortTest) where I actually implemented the sort before integrating it into OrderedEnumerable
  • All my existing OrderBy (etc) tests

The new project also acts as a sort of benchmark - although it's pretty unscientific, and the key projection issue means the .NET implementation isn't really comparable with the Edulinq one at the moment. Still, it's a good indication of very roughly how well the implementation is doing. (It varies, interestingly enough... on my main laptop, it's about 80% slower than LINQ to Objects; on my netbook it's only about 5% slower. Odd, eh?) The new project sorts a range of sizes of input data, against a range of domain sizes (so with a small domain but a large size you're bound to get equal elements - this helps to verify stability). The values which get sorted are actually doubles, but we only sort based on the integer part - so if the input sequence is 1.3, 3.5, 6.3, 3.1 then we should get an output sequence of 1.3, 3.5, 3.1, 6.3 - the 3.5 and 3.1 are in that order due to stability, as they compare equal under the custom comparer. (I'm performing the "integer only" part using a custom comparer, but we could equally have used OrderBy(x => (int) x)).

Conclusion

One problem (temporarily) down, one to go. I'm afraid that the code in part 26d is likely to end up being pretty messy in terms of generics - and even then I'm likely to talk about rather more options than I actually get round to coding.

Still, our simplistic model of OrderedEnumerable has served us well for the time being. Hopefully it's proved more useful educationally this way - I suspect that if I'd dived into the final code right from the start, we'd all end up with a big headache.

Posted by skeet | 8 comment(s)
Filed under: , ,

Reimplementing LINQ to Objects: Part 26b - OrderBy{,Descending}/ThenBy{,Descending}

Last time we looked at IOrderedEnumerable<TElement> and I gave an implementation we could use in order to implement the public extension methods within LINQ. I'm still going to do that in this post, but it's worth mentioning something else that's coming up in another part (26d) - I'm going to revisit my OrderedEnumerable implementation.

There may be trouble ahead...

A comment on the previous post mentioned how my comparer executes the keySelector on each element every time it makes a comparison. I didn't think of that as a particularly awful problem, until I thought of this sample query to rank people's favourite colours:

var query = people.GroupBy(p => p.FavouriteColour)
                  .OrderByDescending(g => g.Count())
                  .Select(g => g.Key);

Eek. Now every time we compare two elements, we have to count everything in a group. Ironically, I believe that counting the items in a group is fast using the LINQ to Objects implementation, but not in mine - something I may fix later on. But with LINQ to Objects, this wouldn't cause a problem in the first place!

There are ways to make this use an efficient key selector, of course - a simple Select before the OrderByDescending call would do fine... but it would be nicer if it wasn't a problem in the first place. Basically we want to extract the keys for each element once, and then compare them repeatedly when we need to. This would also allow us to shuffle a sequence using code such as this:

Random rng = new Random(); // Or get it from elsewhere...
var shuffled = collection.OrderBy(x => rng.NextDouble());

I'm not advocating that way of shuffling, admittedly - but it would be nice if it didn't cause significant problems, which it currently would, as the key selector is non-deterministic.

The interesting thing is that when I've finished today's post, I believe the code will obey all the documented behaviour of LINQ to Objects: there's nothing in the documentation about how often the key selector will be called. That doesn't mean it's a good idea to ignore this problem though, which is why I'll revisit OrderedEnumerable later. However, that's going to complicate the code somewhat... so while we're still getting to grips with how everything hangs together, I'm going to stick to my inefficient implementation.

Meanwhile, back to the actual LINQ operators for the day...

What are they?

OrderBy, OrderByDescending, ThenBy and ThenByDescending all have very similar overloads:

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

public static IOrderedEnumerable<TSource> ThenByDescending<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector)

public static IOrderedEnumerable<TSource> ThenByDescending<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)

They're all extension methods, but ThenBy/ThenByDescending are extension methods on IOrderedEnumerable<T> instead of IEnumerable<T>.

We've already talked about what they do to some extent - each of them returns a sequence which is ordered according to the specified key. However, in terms of details:

  • The source and keySelector parameters can't be null, and are validated eagerly.
  • The comparer parameter (where provided) can be null, in which case the default comparer for the key type is used.
  • They use deferred execution - the input sequence isn't read until it has to be.
  • They read and buffer the entire input sequence when the result is iterated. Or rather, they buffer the original input sequence - as I mentioned last time, when a compound ordered sequence (source.OrderBy(...).ThenBy(...).ThenBy(...)) is evaluated, the final query will go straight to the source used for OrderBy, rather than sorting separately for each key.

What are we going to test?

I have tests for the following:

  • Deferred execution (using ThrowingEnumerable)
  • Argument validation
  • Ordering stability
  • Simple comparisons
  • Custom comparers
  • Null comparers
  • Ordering of null keys

In all of the tests which don't go bang, I'm using an anonymous type as the source, with integer "Value" and "Key" properties. I'm ordering using the key, and then selecting the value - like this:

[Test]
public void OrderingIsStable()
{
    var source = new[]
    {
        new { Value = 1, Key = 10 },
        new { Value = 2, Key = 11 },
        new { Value = 3, Key = 11 },
        new { Value = 4, Key = 10 },
    };
    var query = source.OrderBy(x => x.Key)
                      .Select(x => x.Value);
    query.AssertSequenceEqual(1, 4, 2, 3);
}

For ThenBy/ThenByDescending I have multiple key properties so I can test the interaction between the primary and secondary orderings. For custom key comparer tests, I have an AbsoluteValueComparer which simply compares the absolute values of the integers provided.

The "Value" property is always presented in ascending order (from 1) to make it easier to keep track of, and the "Key" properties are always significantly larger so we can't get confused between the two. I originally used strings for the keys in all tests, but then I found out that the default string comparer was culture-sensitive and didn't behave how I expected it to. (The default string equality comparer uses ordinal comparisons, which are rather less brittle...) I still use strings for the keys in nullity tests, but there I'm specifying the ordinal comparer.

I wouldn't claim the tests are exhaustive - by the time you've considered multiple orderings with possibly equal keys, different comparers etc the possibilities are overwhelming. I'm reasonably confident though (particularly after the tests found some embarrassing bugs in the implementation). I don't think they're hugely readable either - but I was very keen to keep the value separated from the key, rather than just ordering by "x => x" in tests. If anyone fancies cloning the repository and writing better tests, I'd be happy to merge them :)

What I deliberately don't have yet is a test for how many times the key selector is executed: I'll add one before post 26d, so I can prove we're doing the right thing eventually.

Let's implement them!

We've got two bits of implementation to do before we can run the tests:

  • The extension methods
  • The GetEnumerator() method of OrderedEnumerable

The extension methods are extremely easy. All of the overloads without comparers simply delegate to the ones with comparers (using Comparer<TKey>.Default) and the remaining methods look like this:

public static IOrderedEnumerable<TSource> OrderBy<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    return new OrderedEnumerable<TSource>(source,
        new ProjectionComparer<TSource, TKey>(keySelector, comparer));
}

public static IOrderedEnumerable<TSource> OrderByDescending<TSource, TKey>(
    this IEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    IComparer<TSource> sourceComparer = new ProjectionComparer<TSource, TKey>(keySelector, comparer);
    sourceComparer = new ReverseComparer<TSource>(sourceComparer);
    return new OrderedEnumerable<TSource>(source, sourceComparer);
}

public static IOrderedEnumerable<TSource> ThenBy<TSource, TKey>(
    this IOrderedEnumerable<TSource> source,
    Func<TSource, TKey> keySelector,
    IComparer<TKey> comparer)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (keySelector == null)
    {
        throw new ArgumentNullException("keySelector");
    }
    return source.CreateOrderedEnumerable(keySelector, comparer, false);
}

(To get ThenByDescending, just change the name of the method and change the last argument of CreateOrderedEnumerable to true.)

All very easy. I'm pretty sure I'm going to want to change the OrderedEnumerable constructor to accept the key selector and key comparer in the future (in 26d), which will make the above code even simpler. That can wait a bit though.

Now for the sorting part in OrderedEnumerable. Remember that we need a stable sort, so we can't just delegate to List<T>.Sort - at least, not without a bit of extra fiddling. (We could project to a type which contained the index, and add that onto the end of the comparer as a final tie-breaker.)

For the minute - and I swear it won't stay like this - here's the horribly inefficient (but easy to understand) implementation I've got:

public IEnumerator<TElement> GetEnumerator()
{
    // This is a truly sucky way of implementing it. It's the simplest I could think of to start with.
    // We'll come back to it!
    List<TElement> elements = source.ToList();
    while (elements.Count > 0)
    {
        TElement minElement = elements[0];
        int minIndex = 0;
        for (int i = 1; i < elements.Count; i++)
        {
            if (currentComparer.Compare(elements[i], minElement) < 0)
            {
                minElement = elements[i];
                minIndex = i;
            }
        }
        elements.RemoveAt(minIndex);
        yield return minElement;
    }
}

We simply copy the input to a list (which is something we may well do in the final implementation - we certainly need to suck it all in somehow) and then repeatedly find the minimum element (favouring earlier elements over later ones, in order to achieve stability), removing them as we go. It's an O(n2) approach, but hey - we're going for correctness first.

Conclusion

This morning, I was pretty confident this would be an easy and quick post to write. Since then, I've been found pain in the following items:

  • Calling key selectors only once per element is more important than it might sound at first blush
  • The default sort order for string isn't what I'd have guessed
  • My (committed!) extension methods were broken, because I hadn't edited them properly after a cut and paste
  • Writing tests for situations where there are lots of combinations is irritating

So far these have only extended my estimated number of posts for this group of operators to 4 (26a-26d) but who knows what the next few days will bring...

Posted by skeet | 10 comment(s)
Filed under: , ,
More Posts Next page »