Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

A nice easy pair of operators tonight. I should possibly have covered them at the same time as First/Last/Single and the OrDefault variants, but never mind...

What are they?

ElementAt and ElementAtOrDefault have a single overload each:

public static TSource ElementAt<TSource>(
    this IEnumerable<TSource> source,
    int index)

public static TSource ElementAtOrDefault<TSource>(
    this IEnumerable<TSource> source,
    int index)

Isn't that blissfully simple after the overload storm of the past few days?

The two operators work in very similar ways:

  • They use immediate execution.
  • The source parameter must not be null, and this is validated immediately.
  • They return the element at the specified zero-based index, if it's in the range 0 <= index < count.

The methods only differ in their handling of an index which falls outside the given bound. ElementAt will throw an ArgumentOutOfRangeException; ElementAtOrDefault will return the default value for TSource (e.g. 0, null, false). This is true even if index is negative. You might have expected some way to specify the default value to return if the index is out of bounds, but there isn't one. (This is consistent with FirstOrDefault() and so on, but not with Nullable<T>.GetValueOrDefault())

This behaviour leaves us some room for common code - for once I haven't used cut and paste for the implementation. Anyway, I'm getting ahead of myself.

What are we going to test?

As you can imagine, my tests for the two operators are identical except for the expected result in the case of the index being out of range. I've tested the following cases:

  • Null source
  • A negative index
  • An index which is too big on a NonEnumerableCollection
  • An index which is too big on a NonEnumerableList
  • An index which is too big on a lazy sequence (using Enumerable.Range)
  • A valid index in a NonEnumerableList
  • A valid index in a lazy sequence

The "non-enumerable" list and collection are to test that the optimizations we're going to perform are working. In fact, the NonEnumerableCollection test fails on LINQ to Objects - it's only optimized for IList<T>. You'll see what I mean in a minute... and why that might not be a bad thing.

None of the tests are very interesting, to be honest.

Let's implement it!

As I mentioned earlier, I've used some common code for once (although I admit the first implementation used cut and paste). As the only difference between the two methods is the handling of a particular kind of failure, I've used the TryXXX pattern which exists elsewhere in the framework. There's a common method which tries to retrieve the right element as an out parameter, and indicates whether or not it succeeded via the return value. Not every kind of failure is just returned, of course - we want to throw an ArgumentNullException if source is null in either case.

That leaves our public methods looking quite straightforward:

public static TSource ElementAt<TSource>(
    this IEnumerable<TSource> source,
    int index)
{
    TSource ret;
    if (!TryElementAt(source, index, out ret))
    {
        throw new ArgumentOutOfRangeException("index");
    }
    return ret;
}

public static TSource ElementAtOrDefault<TSource>(
    this IEnumerable<TSource> source,
    int index)
{
    TSource ret;
    // We don't care about the return value - ret will be default(TSource) if it's false
    TryElementAt(source, index, out ret);
    return ret;
}

TryElementAt will only return false if the index is out of bounds, so the exception is always appropriate. However, there is a disadvantage to this approach: we can't easily indicate in the exception message whether index was too large or negative. We could have specified a message which included the value of index itself, of course. I think it's a minor matter either way, to be honest.

The main body of the code is in TryElementAt, obviously. It would actually be very simple - just looping and counting up to index, checking as we went - except there are two potential optimizations.

The most obvious - and most profitable - optimization is if the collection implements IList<T>. If it does, we can efficiently obtain the count using the ICollection<T>.Count property (don't forget that IList<T> extends ICollection<T>), check that it's not too big, and then use the indexer from IList<T> to get straight to the right element. Brilliant! That's a clear win.

The less clear optimization is if the collection implements ICollection<T> but not IList<T>, or if it only implements the nongeneric ICollection. In those cases we can still get at the count - but we can't then get directly to the right element. In other words, we can optimize the failure case (possibly hugely), but at a very slight cost - the cost of checking whether the sequence implements either interface - for the success case, where the check won't do us any good.

This is the sort of optimization which is impossible to judge without real data. How often are these operators called with an invalid index? How often does that happen on a collection which implements ICollection<T> but not IList<T> (or implements ICollection)? How large are those collections (so how long would it take to have found our error the normal way)? What's the cost of performing the type check? I don't have the answers to any of these questions. I don't even have strong suspicions. I know that Microsoft doesn't use the same optimization, but I don't know whether that was due to hard data or a gut feeling.

For the moment, I've kept all the optimizations. Here's the code:

private static bool TryElementAt<TSource>(
    IEnumerable<TSource> source,
    int index,
    out TSource element)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    element = default(TSource);
    if (index < 0)
    {
        return false;
    }
    ICollection<TSource> collection = source as ICollection<TSource>;
    if (collection != null)
    {
        int count = collection.Count;
        if (index >= count)
        {
            return false;
        }
        // If it's a list, we know we're okay now - just return directly...
        IList<TSource> list = source as IList<TSource>;
        if (list != null)
        {
            element = list[index];
            return true;
        }
    }

    ICollection nonGenericCollection = source as ICollection;
    if (nonGenericCollection != null)
    {
        int count = nonGenericCollection.Count;
        if (index >= count)
        {
            return false;
        }
    }
    // We don't need to fetch the current value each time - get to the right
    // place first.
    using (IEnumerator<TSource> iterator = source.GetEnumerator())
    {
        // Note use of -1 so that we start off my moving onto element 0.
        // Don't want to use i <= index in case index == int.MaxValue!
        for (int i = -1; i < index; i++)
        {
            if (!iterator.MoveNext())
            {
                return false;
            }
        }
        element = iterator.Current;
        return true;
    }
}

As you can see, the optimized cases actually form the bulk of the code - part of me thinks it would be worth removing the non-IList<T> optimizations just for clarity and brevity.

It's worth looking at the "slow" case where we actually iterate. The for loop looks odd, until you think that to get at element 0, you have to call MoveNext() once. We don't want to just add one to index or use a less-than-or-equal condition: both of those would fail in the case where index is int.MaxValue; we'd either not loop at all (by incrementing index and it overflowing either causing an exception or becoming negative) or we'd loop forever, as every int is less than or equal to int.MaxValue.

Another way to look at it is that the loop counter ("i") is the "current index" within the iterator: the iterator starts before the first element, so it's reasonable to start at -1.

The reason I'm drawing attention to this is that I got all of this wrong first time... and was very grateful for unit tests to catch me out.

Conclusion

For me, the most interesting part of ElementAt is the decision about optimization. I'm sure I'm not the only one who optimizes without data at times - but it's a dangerous thing to do. The problem is that this isn't the normal micro-optimization quandary of "it's always a tiny bit better, but it's probably insignificant and makes the code harder to read". For the cases where this is faster, it could make an enormous difference - asking for element one million of a linked list which doesn't quite have enough elements could be very painful. But do failure cases need to be fast? How common are they? As you can tell, I'm dithering. I think it's at least worth thinking about what optimizations might make a difference - even if we later remove them.

Next time, I think I'll tackle Contains - an operator which you might expect to be really fast on a HashSet<T>, but which has some interesting problems of its own...

Published Tue, Jan 11 2011 23:59 by skeet
Filed under: , ,

Comments

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

If source implements ICollection<TSource> but not IList<TSource>, both the ICollection<TSource> and the ICollection index validation will occur. You should put the non-generic ICollection validation in the else-branch of the generic ICollection<TSource> validation.

Wednesday, January 12, 2011 1:35 AM by Tommy Carlier

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

@Tommy: Good call. I'll make that change. Although it's then becoming even more complicated for such a dubious optimization...

Wednesday, January 12, 2011 1:46 AM by skeet

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

I was surprised to see argument checking in the private method instead of in the two public methods.

In various previous blogs of this serie you have explicitly stated you do not want to see the private methods show up in the stacktrace ...

// Ryan

Wednesday, January 12, 2011 3:07 AM by Ryan Heath

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

@Ryan: I've usually tried to avoid a not-obviously-related method signature from showing up in the stack trace. (Whether you count Select as obviously-related to the Min/Max/Average/Sum operations which take selectors is a matter of opinion, of course :)

If I *have* specifically claimed a preference against *any* private method name from showing up, it just shows inconsistency, to be honest. I think that's likely to be one of the things I mention in the "wrapping-up" posts of the series - there are various decisions which have no right/wrong answers, and which are sometimes so close to call in my mind that I'll vary day by day :(

Wednesday, January 12, 2011 3:52 AM by skeet

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

I think a foreach would look a lot better here:

int currentIndex = 0;

foreach (T item in source) {

  if (currentIndex == index) {

     element = item;

     return true;

  }

  currentIndex++;

}

return false;

The overhead of actually fetching the item (i.e. accessing Current) should almost always be negligible.

Wednesday, January 12, 2011 9:30 AM by configurator

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

@configurator: I personally prefer my version. It makes it clearer that it's not interested in the value until the end - the primary focus of the loop is the *index*, not the *value*.

If you were to describe how to get to the 10th item in a list, would you talk about it in terms of iterating over each item and maintaining a separate counter, or would you talk about moving forward 10 (or 11 in this case) times?

It's a personal view of course, but I think the emphasis on just "how many times we've moved on" is more appropriate here.

Wednesday, January 12, 2011 10:26 AM by skeet

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

I guess I just think a bit differently because iterating over them and maintaining a separate counter seems right to me... If you need the fifth aisle in an unnumbered supermarket you go over the aisles and count them separately. That's also how you find the window that's fourth from the top on the third row of a building, after asking a friend "which one's yours?"

I think some of my different thinking is influence from reading SICP at a very young age. It made me love linq but I think I overuse it a bit :)

Thursday, January 13, 2011 5:35 PM by configurator

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

@configurator: But do you really "go over the aisles"? It's not like you *examine the contents* of the aisles... you just count every time you go *past* an aisle, and stop when you've reached the right count. You're not saying: "Meat and dairy - that's aisle 1. Frozen food - that's aisle 2." You're more saying, "I've gone past 4 aisles now - it's time to look at the aisle I'm in, because it should be the right one."

At least, that's what I do :) The *counting* is the important bit, not the contents of the aisles I'm going *past*.

Friday, January 14, 2011 1:05 AM by skeet

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

I think a countdown might be equally valid: you don't really care about how many items you've visited, just how many more are left until you've reached the target.  E.g.:

for( int countdown = index; countdown >= 0; --countdown ) {

 if( !iterator.MoveNext( ) ) {

   return false;

 }

}

element = iterator.Current;

return true;

Saturday, January 15, 2011 7:31 PM by Emperor XLII

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

@Emperor: Yes, a countdown would be fine too.

Sunday, January 16, 2011 2:06 AM by skeet

# re: Reimplementing LINQ to Objects: Part 31 - ElementAt / ElementAtOrDefault

Sorry, my point (which I inadvertently left out of the original comment :) was that the countdown might be more obviously correct compared to the uncommon initial value/comparison, and does not require special consideration for `int.MaxValue`.  On the other hand, decrementing `for` loops are also not terribly common, and might deserve its own explanatory comment.  Tradeoffs either way.

Sunday, January 16, 2011 7:35 AM by Emperor XLII