Reimplementing LINQ to Objects: Part 41 - How query expressions work

Okay, first a quick plug. This won't be in as much detail as chapter 11 of C# in Depth. If you want more, buy a copy. (Until Feb 1st, there's 43% off it if you buy it from Manning with coupon code j2543.) Admittedly that chapter has to also explain all the basic operators rather than just query expression translations, but that's a fair chunk of it.

If you're already familiar with query expressions, don't expect to discover anything particularly insightful here. However, you might be interested in the cheat sheet at the end, in case you forget some of the syntax occasionally. (I know I do.)

What is this "query expression" of which you speak?

Query expressions are a little nugget of goodness hidden away in section 7.16 of the C# specification. Unlike some language features like generics and dynamic typing, query expressions keep themselves to themselves, and don't impinge on the rest of the spec. A query expression is a bit of C# which looks a bit like a mangled version of SQL. For example:

from person in people
where person.FirstName.StartsWith("J")
orderby person.Age
select person.LastName

It looks somewhat unlike the rest of C#, which is both a blessing and a curse. On the one hand, queries stand out so it's easy to see they're queries. On the other hand... they stand out rather than fitting in with the rest of your code. To be honest I haven't found this to be an issue, but it can take a little getting used to.

Every query expression can be represented in C# code, but the reverse isn't true. Query expressions only take in a subset of the standard query operators - and only a limited set of the overloads, at that. It's not unusual to see a query expression followed by a "normal" query operator call, for example:

var list = (from person in people
            where person.FirstName.StartsWith("J")
            orderby person.Age
            select person.LastName)
           .ToList();

So, that's very roughly what they look like. That's the sort of thing I'm dealing with in this post. Let's start dissecting them.

Compiler translations

First it's worth introducing the general principle of query expressions: they effectively get translated step by step into C# which eventually doesn't contain any query expressions. To stick with our first example, that ends up be translated into this code:

people.Where(person => person.FirstName.StartsWith("J"))
      .OrderBy(person => person.Age)
      .Select(person => person.LastName)

It's important to understand that the compiler hasn't done anything apart from systematic translation to get to this point. In particular, so far we haven't depended on what "people" is, nor "Where", "OrderBy" or "Select".

Can you tell what this code does yet? You can probably hazard a pretty good guess, but you can't tell. Is it going to call Edulinq.Enumerable.Select, or System.Linq.Enumerable.Select, or something entirely different? It depends on the context. Heck, "people" could be the name of a type which has a static Where method. Or maybe it could be a reference to a class which has an instance method called Where... the options are open.

Of course, they don't stay open for long: the compiler takes that expression and compiles it applying all the normal rules. It converts the lambda expression into either a delegate or an expression tree, tries to resolve Where, OrderBy and Select as normal, and life continues. (Don't worry if you're not sure about expression trees yet - I'll come to them in another post.)

The important point is that the query expression translations don't know about System.Linq. The spec barely mentioned IEnumerable<T>, and certainly doesn't rely on it. The whole thing is pattern based. If you have an API which provides some or all of the operators used by the pattern, in an appropriate way, you can use query expressions with it. That's the secret sauce that allows you to use the same syntax for LINQ to Objects, LINQ to SQL, LINQ to Entities, Reactive Extensions, Parallel Extensions and more.

Range variables and the initial "from" clause

The first part of the query to look at is the first "from" clause, at the start of the query. It's worth mentioning upfront that this is handled somewhat differently to any later "from" clauses - I'll explain how they're translated later.

So we have an expression of the form:

from [type] identifier in expression

The "expression" part is just any expression. In most cases there isn't a type specified, in which case the translated version is simply the expression, but with the compiler remembering the identifier as a range variable. I'll do my best to explain what range variables are in a minute :)

If there is a type specified, that represents a call to Cast<type>(). So examples of the two translations so far are:

// Query (incomplete)
from x in people

// Translation (+ range variable of "x")
people


// Query (incomplete)
from Person x in people

// Translation (+ range variable of "x")
(people).Cast<Person>()

These aren't complete query expressions - queries have very precise rules about how they can start and end. They always start with a "from" clause like this, and always end either with a "group by" clause or a "select" clause.

So what's the point of the range variable? Well, that's what gets used as the name of the lambda expression parameter used in all the later clauses. Let's add a select clause to create a complete expression and demonstrate how the variable could be used.

A "select" clause

A select clause is usually translated into a call to Select, using the "body" of the clause as the body of the lambda expression... and the range variable as the parameter. So to expand our previous query, we might have this translation:

// Query
from x in people
select x.Name

// Translation
people.Select(x => x.Name)

That's all that range variables are used for: to provide placeholders within lambda expressions, effectively. They're quite unlike normal variables in most senses. It only makes sense to talk about the "value" of a range variable within a particular clause at a particular point in time when the clause is executing, for one particular value. Their nearest conceptual neighbour is probably the iteration variable declared in a foreach statement, but even that's not really the same - particularly given the way iteration variables are captured, often to the surprise of developers.

The body part has to be a single expression - you can't use "statement lambdas" in query expressions. For example, there's no query expression which would translate to this:

// Can't express this in a query expression
people.Select(x => { 
                     Console.WriteLine("Got " + x);
                     return x.Name;
                   })

That's a perfectly valid C# expression, it's just there's now way of expressing it directly as a query expression.

I mentioned that a select clause usually translates into a Select call. There are two cases where it doesn't:

  • If it's the sole clause after a secondary "from" clause, or a "group by", "join" or "join ... into" clause, the body is used in the translation of that clause
  • If it's an "identity" projection coming after another clause, it's removed entirely.

I'll deal with the first point when we reach the relevant clauses. The second point leads to these translations:

// Query
from x in people
where x.IsAdult
select x

// Translation: Select is removed
people.Where(x => x.IsAdult)


// Query
from x in people
select x

// Translation: Select is *not* removed
people.Select(x => x)

The point of including the "pointless" select in the second translation is to hide the original source sequence; it's assumed that there's no need to do this in the first translation as the "Where" call will already have protected the source sufficiently.

The "where" clause

This one's really simple - especially as we've already seen it! A where clause always just translates into a Where call. Sample translation, this time with no funny business removing degenerate query expressions:

// Query
from x in people
where x.IsAdult
select x.Name

// Translation
people.Where(x => x.IsAdult)
      .Select(x => x.Name)

Note how the range variable is propagated through the query.

The "orderby" clause

Here's a secret: I can never remember offhand whether it's "orderby" or "order by" - it's confusing because it really is "group by", but "orderby" is actually just a single word. Of course, Visual Studio gives a pretty unsubtle hint in terms of colouring.

In the simplest form, an orderby clause might look like this:

// Query
from x in people
orderby x.Age
select x.Name

// Translation
people.OrderBy(x => x.Age)
      .Select(x => x.Name)

There are two things which can add complexity though:

  • You can order by multiple expressions, separating them by commas
  • Each expression can be ordered ascending implicitly, ascending explicitly or descending explicitly.

The first sort expression is always translated into OrderBy or OrderByDescending; subsequent ones always become ThenBy or ThenByDescending. It makes no difference whether you explicitly specify "ascending" or not - I've very rarely seen it in real queries. Here's an example putting it all together:

// Query
from x in people
orderby x.Age, x.FirstName descending, x.LastName ascending
select x.LastName

// Translation
people.OrderBy(x => x.Age)
      .ThenByDescending(x => x.FirstName)
      .ThenBy(x => x.LastName)
      .Select(x => x.LastName)

Top tip: don't use multiple "orderby" clauses consecutively. This query is almost certainly not what you want:

// Don't do this!
from x in people 
orderby x.Age
orderby x.FirstName
select x.LastName

That will end up sorting by FirstName and then Age, and doing so rather slowly as it has to sort twice.

The "group by" clause

Grouping is another alternative to "select" as the final clause in a query. There are two expressions involved: the element selector (what you want to get in each group) and the key selector (how you want the groups to be organized). Unsurprisingly, this uses the GroupBy operator. So you might have a query to group people in families by their last name, with each group containing the first names of the family members:

// Query expression
from x in people 
group x.FirstName by x.LastName

// Translation
people.GroupBy(x => x.LastName, x => x.FirstName)

If the element selector is trivial, it isn't specified as part of the translation:

// Query expression
from x in people 
group x by x.LastName

// Translation
people.GroupBy(x => x.LastName)

Query continuations

Both "select" and "group by" can be followed by "into identifier". This is known as a query continuation, and it's really simple. Its translation in the specification isn't in terms of a method call, but instead it transforms one query expression into another, effectively nesting one query as the source of another. I find that translation tricky to think about, personally... I prefer to think of it as using a temporary variable, like this:

// Original query
var query = from x in people
            select x.Name into y
            orderby y.Length
            select y[0];

// Query continuation translation
var tmp = from x in people
          select x.Name;

var query = from y in tmp
            orderby y.Length
            select y[0];

// Final translation into methods
var query = people.Select(x => x.Name)
                  .OrderBy(y => y.Length)
                  .Select(y => y[0]);

Obviously that final translation could have been expressed in terms of two statements as well... they'd be equivalent. This is why it's important that LINQ uses deferred execution - you can split up a query as much as you like, and it won't alter the execution flow. The query wouldn't actually execute when the value is assigned to "tmp" - it's just preparing the query for execution.

Transparent identifiers and the "let" clause

The rest of the query expression clauses all introduce an extra range variable in some form or other. This is the part of query expression translation which is hardest to understand, because it affects how any usage of the range variable in the query expression is translated.

We'll start with probably the simplest of the remaining clauses: the "let" clause. This simply introduces a new range variable based upon a projection. It's a bit like a "select", but after a "let" clause both the original range variable and the new one are in scope for the rest of the query. They're typically used to avoid redundant computations, or simply to make the code simpler to read. For example, suppose computing an employee's tax is a complicated operation, and we want to display a list of employees and the tax they pay, with the higher tax-payer first:

from x in employees
let tax = x.ComputeTax()
orderby tax descending
select x.LastName + ": " + tax

That's pretty readable, and we've managed to avoid computing the tax twice (once for sorting and once for display).

The problem is, both "x" and "tax" are in scope at the same time... so what are we going to pass to the Select method at the end? We need one entity to pass through our query, which knows the value of both "x" and "tax" at any point (after the "let" clause, obviously). This is precisely the point of a transparent identifier. You can think of the above query as being translated into this:

// Translation from "let" clause to another query expression
from x in employees
select new { x, tax = x.ComputeTax() } into z
orderby z.tax descending
select z.x.LastName + ": " + z.tax

// Final translated query
employees.Select(x => new { x, tax = x.ComputeTax() })
         .OrderByDescending(z => z.tax)
         .Select(z => z.x.LastName + ": " + z.tax)

Here "z" is the transparent identifier - which I've made somewhat more opaque by giving it a name. In the specification, the query translations are performed in terms of "*" - which clearly isn't a valid identifier, but which stands in for the transparent one.

The good news about transparent identifiers is that most of the time you don't need to think of them at all. They simply let you have multiple range variables in scope at the same time. I find myself only bothering to think about them explicitly when I'm trying to work out the full translation of a query expression which uses them. It's worth knowing about them to avoid being stumped by the concept of (say) a select clause being able to use multiple range variables, but that's all.

Now that we've got the basic concept, we can move onto the final few clauses.

Secondary "from" clauses

We've seen that the introductory "from" clause isn't actually translated into a method call, but any subsequent ones are. The syntax is still the same, but the translation uses SelectMany. In many cases this is used just like a cross-join (Cartesian product) but it's more flexible than that, as the "inner" sequence introduced by the secondary "from" clause can depend on the current value from the "outer" sequence. Here's an example of that. with the call to SelectMany in the translation:

// Query expression
from parent in adults
from child in parent.Children
where child.Gender == Gender.Male
select child.Name + " is a son of " + parent.Name

// Translation (using z for the transparent identifier)
adults.SelectMany(parent => parent.Children,
                  (parent, child) => new { parent, child })
      .Where(z => z.child.Gender == Gender.Male)
      .Select(z => z.child.Name + " is a son of " + z.parent.Name;

Again we can see the effect of the transparent identifier - an anonymous type is introduced to propagate the { parent, child } tuple through the rest of the query.

There's a special case, however - if "the rest of the query" is just a "select" clause, we don't need the anonymous type. We can just apply the projection directly in the SelectMany call. Here's a similar example, but this time without the "where" clause:

// Query expression
from parent in adults
from child in parent.Children
select child.Name + " is a child of " + parent.Name

// Translation (using z for the transparent identifier)
adults.SelectMany(parent => parent.Children,
                  (parent, child) => child.Name + " is a child of " + parent.Name)

This same trick is used in GroupJoin and Join, but I won't go into the details there. It's simpler to just provide examples which use the shortcut, instead of including unnecessary extra clauses just to force the transparent identifier to appear in the translation.

Note that just like the introductory "from" clause, you can specify a type for the range variable, which forces a call to "Cast<>".

Simple "join" clauses (no "into")

A "join" clause without an "into" part corresponds to a call to the Join method, which represents an inner equijoin. In some ways this is like an extra "from" clause with a "where" clause to provide the relevant filtering, but there's a significant difference: while the "from" clause (and SelectMany) allow you to project each element in the outer sequence to an inner sequence, in Join you merely provide the inner sequence directly, once. You also have to specify the two key selectors - one for the outer sequence, and one for the inner sequence. The general syntax is:

join identifier in inner-sequence on outer-key-selector equals inner-key-selector

The identifier names the extra range variable introduced. Here's an example including the translation:

// Query expression
from customer in customers
join order in orders on customer.Id equals order.CustomerId
select customer.Name + ": " + order.Price

// Translation
customers.Join(orders,
               customer => customer.Id,
               order => order.CustomerId,
               (customer, order) => customer.Name + ": " + order.Price)

Note how if you put the key selectors the wrong way round, it's highly unlikely that the result will compile - the lambda expression for the outer sequence doesn't "know about" the inner sequence element, and vice versa. The C# compiler is even nice enough to guess the probable cause, and suggest the fix.

Group joins - "join ... into"

Group joins look exactly the same as inner joins, except they have an extra "into identifier" part at the end. Again, this introduces an extra range variable - but it's the identifier after the "into" which ends up in scope, not the one after "join"; that one is only used in the key selector. This is easier to see when we look at a sample translation:

// Query expression
from customer in customers
join order in orders on customer.Id equals order.CustomerId into customerOrders
select customer.Name + ": " + customerOrders.Count()

// Translation
customers.GroupJoin(orders,
                    customer => customer.Id,
                    order => order.CustomerId,
                    (customer, customerOrders) => customer.Name + ": " + customerOrders.Count())

If we had tried to refer to "order" in the select clause, the result would have been an error: it's not in scope any more. Note that this is not a query continuation unlike "select ... into" and "group ... into". It introduces a new range variable, but all the previous range variables are still in scope.

That's it! That's all the translations that the C# compiler supports. VB's query expressions are rather richer - but I suspect that's at least partly because it's more painful to write the "dot notation" syntax in VB, as the lambda expression syntax isn't as nice as C#'s.

Translation cheat sheet

I thought it would be useful to produce a short table of the kinds of clauses supported in query expressions, with the translation used by the C# compiler. The translation is given assuming a single range variable named "x" is in scope. I haven't given the alternative options where transparent identifiers are introduced - this table isn't meant to be a replacement for all the information above! (Likewise this doesn't mention the optimizations for degenerate query expressions or "identity projection" groupings.)

Query expression clause Translation
First "from [type] x in sequence" Just "sequence" or "sequence.Cast<type>()", but with the introduction of a range variable
Subsequent "from" clauses:
"from [type] y in projection"
SelectMany(x => projection, (x, y) => new { x, y })
or SelectMany(x => projection.Cast<type>(), (x, y) => new { x, y })
where predicate Where(x => predicate)
select projection Select(x => projection)
let y = projection Select(x => new { x, y = projection })
orderby o1, o2 ascending, o3 descending
(Each ordering may have descending or ascending specified explicitly; the default is ascending)
OrderBy(x => o1)
.ThenBy(x => o2)
.ThenByDescending(x => o3)
group projection by key-selector GroupBy(x => key-selector, x => projection)
join y in inner-sequece
on outer-key-selector equals inner-key-selector
Join(x => outer-key-selector,
    y => inner-key-selector,
    (x, y) => new { x, y })
join y in inner-sequece
on outer-key-selector equals inner-key-selector
into z
GroupJoin(x => outer-key-selector,
    y => inner-key-selector,
    (x, z) => new { x, z })
query1 into y
query2
(Translation in terms of a new query expression)
from y in (query1)
query2

Conclusion

Hopefully that's made a certain amount of sense out of a fairly complicated topic. I find it's one of those "aha!" things - at some point it clicks, and then seems reasonably simple (aside from transparent identifiers, perhaps). Until that time, query expressions can be a bit magical.

As an aside, I have a sneaking suspicion that one of my first blog posts consisted of my initial impressions of LINQ, written in a plane on the way to the MVP conference in Seattle in September 2005. I would check, but I'm finishing this post in another plane, this time on the way to San Francisco. I think I'd have been somewhat surprised to be told in 2005 that I'd still be writing blog posts about LINQ over five years later. Mind you, I can think of any number of things which have happened in the intervening years which would have astonished me to about the same degree.

Next time: some more thoughts on optimization. Oh, and I'm likely to update my wishlist of extra operators as well, but within the existing post.

Published Fri, Jan 28 2011 6:22 by skeet
Filed under: , ,

Comments

# re: Reimplementing LINQ to Objects: Part 41 - How query expressions work

Nicely done.  I learned a few things :)

Monday, January 31, 2011 4:27 AM by Mark

# re: Reimplementing LINQ to Objects: Part 41 - How query expressions work

It is a pitty that there is no full join available in the framework. I sometimes need it to diff collections.

Monday, January 31, 2011 6:02 AM by tobi

# re: Reimplementing LINQ to Objects: Part 41 - How query expressions work

A small question. Under Query continuations you've written `The query would actually execute when the value is assigned to "tmp" - it's just preparing the query for execution'.

Is this a typo? Either way, I found it a little confusing - what exactly is the "value"?

Aside from that, thanks for an interesting review of this feature.

Monday, January 31, 2011 6:07 AM by Kobi

# re: Reimplementing LINQ to Objects: Part 41 - How query expressions work

@Kobi: Sorry, yes, that should have said "wouldn't actually execute" - fixed now.

Do you mean what's the value of having query continuations? I don't use them very often myself, but occasionally they can make a query simpler - if after the first half of a query you've got a bunch of range variables in scope, but you only really need one projection from them.

Monday, January 31, 2011 8:54 AM by skeet