Reimplementing LINQ to Objects: Part 33 - Cast and OfType

More design decisions around optimization today, but possibly less controversial ones...

What are they?

Cast and OfType are somewhat unusual LINQ operators. They are extension methods, but they work on the non-generic IEnumerable type instead of the generic IEnumerable<T> type:

public static IEnumerable<TResult> Cast<TResult>(this IEnumerable source)
        
public static IEnumerable<TResult> OfType<TResult>(this IEnumerable source)

It's worth mentioning what Cast and OfType are used for to start with. There are two main purposes:

  • Using a non-generic collection (such as a DataTable or an ArrayList) within a LINQ query (DataTable has the AsEnumerable extension method too)
  • Changing the type of a generic collection, usually to use a more specific type (e.g. you have  List<Person> but you're confident they're all actually Employee instances - or you only want to query against the Employee instances)

I can't say that I use either operator terribly often, but if you're starting off from a non-generic collection for whatever reason, these two are your only easy way to get "into" the LINQ world.

Here's a quick rundown of the behaviour they have in common:

  • The source parameter must not be null, and this is validated eagerly
  • It uses deferred execution: the input sequence is not read until the output sequence is
  • It streams its data - you can use it on arbitrarily-long sequences and the extra memory required will be constant (and small :)

Both operators effectively try to convert each element of the input sequence to the result type (TResult). When they're successful, the results are equivalent (ignoring optimizations, which I'll come to later). The operators differ in how they handle elements which aren't of the result type.

Cast simply tries to cast each element to the result type. If the cast fails, it will throw an InvalidCastException in the normal way. OfType, however, sees whether each element is a value of the result type first - and ignores it if it's not.

There's one important case to consider where Cast will successfully return a value and OfType will ignore it: null references (with a nullable return type). In normal code, you can cast a null reference to any nullable type (whether that's a reference type or a nullable value type). However, if you use the "is" C# operator with a null value, it will always return false. Cast and OfType follow the same rules, basically.

It's worth noting that (as of .NET 3.5 SP1) Cast and OfType only perform reference and unboxing conversions. They won't convert a boxed int to a long, or execute user-defined conversions. Basically they follow the same rules as converting from object to a generic type parameter. (That's very convenient for the implementation!) In the original implementation of .NET 3.5, I believe some other conversions were supported (in particular, I believe that the boxed int to long conversion would have worked). I haven't even attempted to replicate the pre-SP1 behaviour. You can read more details in Ed Maurer's blog post from 2008.

There's one final aspect to discuss: optimization. If "source" already implements IEnumerable<TResult>, the Cast operator just returns the parameter directly, within the original method call. (In other words, this behaviour isn't deferred.) Basically we know that every cast will succeed, so there's no harm in returning the input sequence. This means you shouldn't use Cast as an "isolation" call to protect your original data source, in the same way as we sometimes use Select with an identity projection. See Eric Lippert's blog post on degenerate queries for more about protecting the original source of a query.

In the LINQ to Objects implementation, OfType never returns the source directly. It always uses an iterator. Most of the time, it's probably right to do so. Just because something implements IEnumerable<string> doesn't mean everything within it should be returned by OfType... because some elements may be null. The same is true of an IEnumerable<int?> - but not an IEnumerable<int>. For a non-nullable value type T, if source implements IEnumerable<T> then source.OfType<T>() will always contain the exact same sequence of elements as source. It does no more harm to return source from OfType() here than it does from Cast().

What are we going to test?

There are "obvious" tests for deferred execution and eager argument validation. Beyond that, I effectively have two types of test: ones which focus on whether the call returns the original argument, and ones which test the behaviour of iterating over the results (including whether or not an exception is thrown).

The iteration tests are generally not that interesting - in particular, they're similar to tests we've got everywhere else. The "identity" tests are more interesting, because they show some differences between conversions that are allowed by the CLR and those allowed by C#. It's obvious that an array of strings is going to be convertible to IEnumerable<string>, but a test like this might give you more pause for thought:

[Test]
public void OriginalSourceReturnedForInt32ArrayToUInt32SequenceConversion()
{
    IEnumerable enums = new int[10];
    Assert.AreSame(enums, enums.Cast<uint>());
}

That's trying to "cast" an int[] to an IEnumerable<uint>. If you try the same in normal C# code, it will fail - although if you cast it to "object" first (to distract the compiler, as it were) it's fine at both compile time and execution time:

int[] ints = new int[10];
// Fails with CS0030
IEnumerable<uint> uints = (IEnumerable<uint>) ints;
        
// Succeeds at execution time
IEnumerable<uint> uints = (IEnumerable<uint>)(object) ints;

We can have a bit more fun at the compiler's expense, and note its arrogance:

int[] ints = new int[10];
        
if (ints is IEnumerable<uint>)
{
    Console.WriteLine("This won't be printed");
}
if (((object) ints) is IEnumerable<uint>)
{
    Console.WriteLine("This will be printed");
}

This generates a warning for the first block "The given expression is never of the provided (...) type" and the compiler has the cheek to remove the block entirely... despite the fact that it would have worked if only it had been emitted as code.

Now, I'm not really trying to have a dig at the C# team here - the compiler is actually acting entirely reasonably within the rules of C#. It's just that the CLR has subtly different rules around conversions - so when the compiler makes a prediction about what would happen with a particular cast or "is" test, it can be wrong. I don't think this has ever bitten me as an issue, but it's quite fun to watch. As well as this signed/unsigned difference, there are similar conversions between arrays of enums and their underlying types.

There's another type of conversion which is interesting:

[Test]
public void OriginalSourceReturnedDueToGenericCovariance()
{
    IEnumerable strings = new List<string>();
    Assert.AreSame(strings, strings.Cast<object>());
}

This takes advantage of the generic variance introduced in .NET 4 - sort of. There is now a reference conversion from List<string> to IEnumerable<object> which wouldn't have worked in .NET 3.5. However, this isn't due to the fact that C# 4 now knows about variance; the compiler isn't verifying the conversion here, after all. It isn't due to a new feature in the CLRv4 - generic variance for interfaces and delegates has been present since generics were introduced in CLRv2. It's only due to the change in the IEnumerable<T> type, which has become IEnumerable<out T> in .NET 4. If you could make the same change to the standard library used in .NET 3.5, I believe the test above would pass. (It's possible that the precise CLR rules for variance changed between CLRv2 and CLRv4 - I don't think this variance was widely used before .NET 4, so the risk of it being a problematically-breaking change would have been slim.)

In addition to all these functional tests, I've included a couple of tests to show that the compiler uses Cast in query expressions if you give a range variable an explicit type. This works for both "from" and "join":

[Test]
public void CastWithFrom()
{
    IEnumerable strings = new[] { "first", "second", "third" };
    var query = from string x in strings
                select x;
    query.AssertSequenceEqual("first", "second", "third");
}

[Test]
public void CastWithJoin()
{
    var ints = Enumerable.Range(0, 10);
    IEnumerable strings = new[] { "first", "second", "third" };
    var query = from x in ints
                join string y in strings on x equals y.Length
                select x + ":" + y;
    query.AssertSequenceEqual("5:first", "5:third", "6:second");
}

Note how the compile-time type of "strings" is just IEnumerable in both cases. We couldn't use this in a query expression normally, because LINQ requires generic sequences - but by giving the range variables explicit types, the compiler has inserted a call to Cast which makes the rest of the translation work.

Let's implement them!

The "eager argument validation, deferred sequence reading" mode of Cast and OfType means we'll use the familiar approach of a non-iterator-block public method which finally calls an iterator block if it gets that far. This time, however, the optimization occurs within the public method. Here's Cast, to start with:

public static IEnumerable<TResult> Cast<TResult>(this IEnumerable source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    IEnumerable<TResult> existingSequence = source as IEnumerable<TResult>;
    if (existingSequence != null)
    {
        return existingSequence;
    }
    return CastImpl<TResult>(source);
}

private static IEnumerable<TResult> CastImpl<TResult>(IEnumerable source)
{
    foreach (object item in source)
    {
        yield return (TResult) item;
    }
}

We're using the normal as/null-test to check whether we can just return the source directly, and in the loop we're casting. We could have made the iterator block very slightly shorter here, using the behaviour of foreach to our advantage:

foreach (TResult item in source)
{
    yield return item;
}

Yikes! Where's the cast gone? How can this possibly work? Well, the cast is still there - it's just been inserted automatically by the compiler. It's the invisible cast that was present in almost every foreach loop in C# 1. The fact that it is invisible is the reason I've chosen the previous version. The point of the method is to cast each element - so it's pretty important to make the cast as obvious as possible.

So that's Cast. Now for OfType. First let's look at the public entry point:

public static IEnumerable<TResult> OfType<TResult>(this IEnumerable source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (default(TResult) != null)
    {
        IEnumerable<TResult> existingSequence = source as IEnumerable<TResult>;
        if (existingSequence != null)
        {
            return existingSequence;
        }
    }
    return OfTypeImpl<TResult>(source);
}

This is almost the same as Cast, but with the additional test of "default(TResult) != null" before we check whether the input sequence is an IEnumerable<TResult>. That's a simple way of saying, "Is this a non-nullablle value type." I don't know for sure, but I'd hope that when the JIT compiler looks at this method, it can completely wipe out the test, either removing the body of the if statement completely for nullable value types and reference types, or just go execute the body unconditionally for non-nullable value types. It really doesn't matter if JIT doesn't do this, but one day I may get up the courage to tackle this with cordbg and find out for sure... but not tonight.

Once we've decided we've got to iterate over the results ourselves, the iterator block method is quite simple:

private static IEnumerable<TResult> OfTypeImpl<TResult>(IEnumerable source)
{
    foreach (object item in source)
    {
        if (item is TResult)
        {
            yield return (TResult) item;
        }
    }
}

Note that we can't use the "as and check for null" test here, because we don't know that TResult is a nullable type. I was tempted to try to write two versions of this code - one for reference types and one for value types. (I've found before that using "as and check for null" is really slow for nullable value types. That may change, of course.) However, that would be quite tricky and I'm not convinced it would have much impact. I did a quick test yesterday testing whether an "object" was actually a "string", and the is+cast approach seemed just as good. I suspect that may be because string is a sealed class, however... testing for an interface or a non-sealed class may be more expensive. Either way, it would be premature to write a complicated optimization without testing first.

Conclusion

It's not clear to me why Microsoft optimizes Cast but not OfType. There's a possibility that I've missed a reason why OfType shouldn't be optimized even for a sequence of non-nullable value type values - if you can think of one, please point it out in the comments. My immediate objection would be that it "reveals" the source of the query... but as we've seen, Cast already does that sometimes, so I don't think that theory holds.

Other than that decision, the rest of the implementation of these operators has been pretty plain sailing. It did give us a quick glimpse into the difference between the conversions that the CLR allows and the ones that the C# specification allows though, and that's always fun.

Next up - SequenceEqual.

Published Thu, Jan 13 2011 19:32 by skeet
Filed under: , ,

Comments

# re: Reimplementing LINQ to Objects: Part 33 - Cast and OfType

A third use-case for Cast (and one that I've actually used) is to up-cast. (eg pass an IEnumerable<String> to a method expecting IEnumerable<object>. Useful before IEnumerable was made covariant!

Thursday, January 13, 2011 4:37 PM by Ben Lings