Reimplementing LINQ to Objects: Part 29 - Min/Max

The second and third AOOOD operators today... if I'm brave enough to tackle Average tomorrow, I'll have done them all. More surprises here today, this time in terms of documentation...

What are they?

Min and Max are both extension methods with 22 overloads each. Min looks like this:

public static int Min(this IEnumerable<int> source)

public static int Min<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int> selector)

public static int? Min(this IEnumerable<int?> source)

public static int? Min<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int?> selector)

// Repeat the above four overloads for long, float, double and decimal,
// then add two more generic ones:

public static TSource Min<TSource>(this IEnumerable<TSource> source)

public static TResult Min<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector
)

(Max is exactly the same as Min; just replace the name.)

The more obvious aspects of the behaviour are as follows:

  • source and selector mustn't be null
  • All overloads use immediate execution
  • The minimum or maximum value within the sequence is returned
  • If a selector is present, it is applied to each value within source, and the maximum of the projected values is returned. (Note how the return type of these methods is TResult, not TSource.)

Some less obvious aspects - in all cases referring to the result type (as the source type is somewhat incidental when a selector is present; it doesn't affect the behaviour):

  • The type's IComparable<T> implementation is used when available, otherwise IComparable is used. An ArgumentException is thrown if values can't be compared. Fortunately, this is exactly the behaviour of Comparer<T>.Default.
  • For any nullable type (whether it's a reference type or a nullable value type), nulls within the sequence are ignored, and an empty sequence (or one which contains only null values) will cause a null value to be returned. If there are any non-null values in the sequence, the return value will be non-null. (Note that this is different from Sum, which will return the non-null zero value for empty sequences over nullable types.)
  • For any non-nullable value type, an empty sequence will cause InvalidOperationException to be thrown.

The first point is particularly interesting when you consider the double and float types, and their "NaN" (not-a-number) values. For example, Math.Max regards NaN as greater than positive infinity, but Enumerable.Max regards positive infinity as being the greater of the two. Math.Min and Enumerable.Min agree, however, that NaN is less than negative infinity. (It would actually make sense to me for NaN to be treated as the numeric equivalent of null here, but that would be strange in other ways...) Basically, NaN behaves oddly in all kinds of ways. I believe that IEEE-754-2008 actually specifies behaviour with NaNs which encourages the results we're getting here, but I haven't verified that yet. (I can't find a free version of the standard online, which is troubling in itself. Ah well.)

The behaviour of the nullable and non-nullable types is well documented for the type-specific overloads using int, Nullable<int> etc. However, the generic overloads (the ones using TSource) are poorly documented:

  • InvalidOperationException isn't in the list of possibly-thrown arguments for any of the overloads
  • The methods using selectors from TSource to TResult don't mention the possibility of nullity at all
  • The methods without selectors describe the behaviour of null values for reference types, but don't mention the possibility of empty sequences for non-nullable value types, or consider nullable value types at all.

(I should point out that ArgumentException isn't actually mentioned either for the case where values are incomparable, but that feels like a slightly less important offence for some reason. Possibly just because it didn't trip me up.)

If I remember, I'll open a Connect issue against this hole in the documentation when I find time. Unlike the optimizations and set ordering (where it's reasonably forgivable to deliberately omit implementation details from the contract) you simply can't predict the behaviour in a useful way from the documentation here. And yes, I'm going on about this because it bit me. I had to resort to writing tests and running them against LINQ to Objects to see if they were correct or not. (They were incorrect in various places.)

If you look at the behaviour of the non-generic methods, the generic ones are entirely consistent of course.

There are a couple of things which you might consider "missing" in terms of Max and Min:

  • The ability to find out the minimum/maximum value of a sequence by a projection. For example, consider a sequence of people. We may wish to find the youngest person in the sequence, in which case we'd like to be able to write something like:
    var oldest = people.MaxBy(person => person.Age);
    We can find the maximum age itself easily enough - but then we'd need a second pass to find the first person with that age. I've addressed this in MoreLINQ with the MaxBy and MinBy operators. The System.Interactive assembly in Reactive Extensions has the same methods too.
  • The ability to specify a custom IComparer<T> implementation, as we can in most of the operators using IEqualityComparer<T>. For example, we can't find the "maximum" string in a sequence, using a case-insensitive ordinal comparison.

Still, at least that means there's less to test...

What are we going to test?

I decided I really couldn't find the energy to replicate all the tests for every type involved here. Instead, I have a bunch of tests for int and Nullable<int>, a few tests exploring the oddness of doubles, and a bunch of tests around the generic methods. In particular, I know that I've implemented decimal, float etc by calling the same methods that the int overloads use.

The tests cover:

  • Argument validation
  • Empty sequences
  • Sequences of null values where applicable
  • Projections of the above
  • Generic tests for nullable and non-nullable value types, and reference types (with empty sequences etc)
  • Incomparable values

Let's implement them!

Okay, let's start off with the simplest detail: the order of implementation:

  • Max(int)
  • Max(generic)
  • Cut and paste Max implementations for other numeric types (replace the type name, basically)
  • Cut and paste the entirety of Max to Min:
    • Replace "Max" with "Min" everywhere
    • Replace " < " with " > " everywhere (only 4 occurrences; basically the results of calling Compare or ComparerTo and comparing with 0)

Just as with Sum, I could have used templating - but I don't think it would actually have saved me significant time.

This time, I thought I'd use Select internally for the overloads with selectors (unlike my approach for Sum which used identity projections). There's no particular reason for this - I just thought it would be interesting to try both approaches. Overall, I think I prefer this one, but I haven't done any benchmarking to find out the relative performance penalties.

Each set of numeric overloads calls into a single pair of generic "implementation" methods. These aren't the public general-purpose ones: they require that the types in use implement IComparable<T>, and I've added a "struct" constraint just for kicks. This is just one approach. Other options:

  • I could have implemented the code separately for each numeric type. That may well be faster than calling IComparable<T>.Compare (at least for most types) as the IL would have contained the appropriate operator directly. However, it would have meant more code and explicitly dealing with the headache of NaNs for double/float. If I ever write benchmarks, I'll investigate the difference that this can make.
  • I could have used the public generic overloads, which eventually call into Comparer<T>.Default. Again, the penalty for this (if any) is unknown to me at this point. Can the JIT inline deeply enough to make this as fast as a "native" implementation? I wouldn't like to guess without tests.

I've separated out the nullable implementations from the non-nullable ones, as the behaviour differs significantly between the two.

Here's the public code for int:

public static int Max(this IEnumerable<int> source)
{
    return PrimitiveMax(source);
}

public static int Max<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int> selector)
{
    // Select will validate the arguments
    return PrimitiveMax(source.Select(selector));
}

public static int? Max(this IEnumerable<int?> source)
{
    return NullablePrimitiveMax(source);
}

public static int? Max<TSource>(
    this IEnumerable<TSource> source,
    Func<TSource, int?> selector)
{
    // Select will validate the arguments
    return NullablePrimitiveMax(source.Select(selector));
}

All the methods consider argument validation to be somebody else's problem - either Select or the generic method we're calling to find the maximum value. Part of me thinks this is lazy; part of me likes it in terms of not repeating code. All of me would prefer the ability to specify non-nullable parameters declaratively...

Here are the "primitive" methods called into above:

// These are uses by all the overloads which use a known numeric type.
// The term "primitive" isn't truly accurate here as decimal is not a primitive
// type, but it captures the aim reasonably well.
// The constraint of being a value type isn't really required, because we don't rely on
// it within the method and only code which already knows it's a comparable value type
// will call these methods anyway.
        
private static T PrimitiveMax<T>(IEnumerable<T> source) where T : struct, IComparable<T>
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    using (IEnumerator<T> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        T max = iterator.Current;
        while (iterator.MoveNext())
        {
            T item = iterator.Current;
            if (max.CompareTo(item) < 0)
            {
                max = item;
            }
        }
        return max;
    }
}

private static T? NullablePrimitiveMax<T>(IEnumerable<T?> source) where T : struct, IComparable<T>
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    T? max = null;
    foreach (T? item in source)
    {
        if (item != null &&
            (max == null || max.Value.CompareTo(item.Value) < 0))
        {
            max = item;
        }
    }
    return max;
}

The first method is interesting in terms of its approach to throwing an exception if the first element isn't present, and using that as an initial candidate otherwise.

The second method needs to consider nullity twice on each iteration:

  • Is the item from the sequence null? If so, we can ignore it.
  • Is our "current maximum" null? If so, we can replace it with the item from the sequence without performing a comparison.

Now there's one case which is ambiguous here: when both values are null. At that point we can choose to replace our "current maximum" with the item, or not... it doesn't matter as the values are the same anyway. It is important that we don't try to perform a comparison unless both values are non-null though... the short-circuiting && and || operators keep us safe here.

Having implemented the code above, all the interesting work lies in the generic forms. Here we don't have different public methods to determine which kind of behaviour we'll use: but I wrote two private methods instead, and just delegated to the right one from the public one. This seemed cleaner than putting the code all in one method:

public static TSource Max<TSource>(
    this IEnumerable<TSource> source)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    // This condition will be true for reference types and nullable value types, and false for
    // non-nullable value types.
    return default(TSource) == null ? NullableGenericMax(source) : NonNullableGenericMax(source);
}

public static TResult Max<TSource, TResult>(
    this IEnumerable<TSource> source,
    Func<TSource, TResult> selector)
{
    return Max(source.Select(selector));
}

/// <summary>
/// Implements the generic behaviour for non-nullable value types.
/// </summary>
/// <remarks>
/// Empty sequences will cause an InvalidOperationException to be thrown.
/// Note that there's no *compile-time* validation in the caller that the type
/// is a non-nullable value type, hence the lack of a constraint on T.
/// </remarks>
private static T NonNullableGenericMax<T>(IEnumerable<T> source)
{
    Comparer<T> comparer = Comparer<T>.Default;

    using (IEnumerator<T> iterator = source.GetEnumerator())
    {
        if (!iterator.MoveNext())
        {
            throw new InvalidOperationException("Sequence was empty");
        }
        T max = iterator.Current;
        while (iterator.MoveNext())
        {
            T item = iterator.Current;
            if (comparer.Compare(max, item) < 0)
            {
                max = item;
            }
        }
        return max;
    }
}

/// <summary>
/// Implements the generic behaviour for nullable types - both reference types and nullable
/// value types.
/// </summary>
/// <remarks>
/// Empty sequences and sequences comprising only of null values will cause the null value
/// to be returned. Any sequence containing non-null values will return a non-null value.
/// </remarks>
private static T NullableGenericMax<T>(IEnumerable<T> source)
{
    Comparer<T> comparer = Comparer<T>.Default;

    T max = default(T);
    foreach (T item in source)
    {
        if (item != null &&
            (max == null || comparer.Compare(max, item) < 0))
        {
            max = item;
        }
    }
    return max;
}

As you can tell, there's a significant similarity between the "PrimitiveMax" and "NonNullableGenericMax" methods, and likewise between "NullablePrimitiveMax" and "NullableGenericMax". This should come as no surprise. Fundamentally the difference is just between using an IComparable<T> implementation, and using Comparer<T>.Default. (The argument validation occurs in a different place too, as we'll be going through a public entry point for the non-primitive code.)

Once I'd discovered the correct behaviour, this was reasonably simple. Of course, the above code wasn't my first implementation, where I'd completely forgotten about null values, and hadn't thought about how the nullability of the source type might affect the behaviour of empty sequences...

Conclusion

If you're ever in a company which rewards you for checking in lots of lines of code, offer to implement Sum/Min/Max. This weekend I've checked in about 2,500 lines of code in (split between production and test) and none of it's been terribly hard. Of course, if you're ever in such a company you should also consider looking for another job. (Have I mentioned that Google's hiring? Email me if you're interested. I'm serious.)

As you can tell, I was slightly irritated by the lack of clarity around the documentation in some places - but I find it interesting that even a simple-sounding function like "find the maximum value from a sequence" should need the kind of documentation that's missing here. I'm not saying it's a failure of the design - more just musing how a complete specification is almost always going to be longer than you might think at first glance. And if you think I was diligent here, think again: I didn't bother specifying which maximum or minimum value would be returned if there were two. For example, if a sequence consists of references to two equal but distinct strings, which reference should be returned? I have neither stated what my implementation (or the LINQ to Objects implementation) will do, nor tested for it.

Next up is Average - a single method with a mere 20 overloads. There are various corner cases to consider... but that's a post for another day.

Published Sun, Jan 9 2011 22:03 by skeet
Filed under: , ,

Comments

# re: Reimplementing LINQ to Objects: Part 29 - Min/Max

You are writing and documenting brilliant code faster than I can read it. Do you have you a specific goal like to create some GLinq (Google Linq) implementation with algorithms that MS has never dreamed of ;-). I did never think that you can write a sort algorithm and yield already sorted elements before the complete sort is done.

I did compare the EduLinq OrderBy method and the "pure" Array.Sort which uses QuickSort to check how much overhead we get when we completely sort the array.

***** Edulinq.Tests.OrderByTest.Perf_ArraySort

Test Sort Integer Array[10] did run 10.000,00 times in 0,07s Frequency 142857

Test QuickSort Integer Array[10] did run 10.000,00 times in 0,00s Frequency 5000000

Test Sort Integer Array[100] did run 10.000,00 times in 0,36s Frequency 27548

Test QuickSort Integer Array[100] did run 10.000,00 times in 0,03s Frequency 384615

Test Sort Integer Array[1.000] did run 10.000,00 times in 4,06s Frequency 2462

Test QuickSort Integer Array[1.000] did run 10.000,00 times in 0,24s Frequency 42194

Test Sort Integer Array[10.000] did run 10.000,00 times in 47,53s Frequency 210

Test QuickSort Integer Array[10.000] did run 10.000,00 times in 2,74s Frequency 3656

The LINQ implementation gives

Test Sort Integer Array[10] did run 10.000,00 times in 0,06s Frequency 175439

Test QuickSort Integer Array[10] did run 10.000,00 times in 0,00s Frequency 5000000

Test Sort Integer Array[100] did run 10.000,00 times in 0,16s Frequency 62500

Test QuickSort Integer Array[100] did run 10.000,00 times in 0,03s Frequency 370370

Test Sort Integer Array[1.000] did run 10.000,00 times in 1,81s Frequency 5537

Test QuickSort Integer Array[1.000] did run 10.000,00 times in 0,24s Frequency 41841

Test Sort Integer Array[10.000] did run 10.000,00 times in 22,36s Frequency 447

Test QuickSort Integer Array[10.000] did run 10.000,00 times in 2,80s Frequency 3574

All those nice abstractions do cost a factor 10 (LINQ) to 20 (EduLinq) vs a plain Array sort. This comparison is not really fair since we compare deferred execution (LINQ) vs eager execution (imperative sort). When we want first results fast at the cost of increased execution time LINQ is the way to go.

The test was executed on a Intel Dual Core 6600 2.4GHz with Windows 7, NET 4.0, Debug build (Release does differ only few %).

Sunday, January 09, 2011 6:26 PM by Alois Kraus

# re: Reimplementing LINQ to Objects: Part 29 - Min/Max

@Alois: I'm glad you're enjoying the series. There are no aims for the project beyond educational ones. It's really just for fun, and to try to help people understand LINQ to Objects.

Monday, January 10, 2011 1:04 AM by skeet

# re: Reimplementing LINQ to Objects: Part 29 - Min/Max

See you should have used the heapsort? Then Min would be OrderBy(x => x).First() and Max would be OrderByDescending(x => x).First()...

Of course you'd have to call First() for non-nullable values and FirstOrDefault() for nullable values, and you'd have to make sure the comparer handles NaN and infinities as expected, but that's pretty much it I think.

Monday, January 10, 2011 8:05 AM by configurator

# re: Reimplementing LINQ to Objects: Part 29 - Min/Max

@Jon, there are a few mistakes in MinTest.cs : you copy pasted some code from MaxTest.cs and forgot to replace Max with Min ;)

Monday, January 10, 2011 4:51 PM by Thomas Levesque

# re: Reimplementing LINQ to Objects: Part 29 - Min/Max

@Thomas: Thanks. Those must have been the tests I wrote after I discovered the oddnesses with the generic sequences. Fixed now.

Monday, January 10, 2011 5:02 PM by skeet

# re: Reimplementing LINQ to Objects: Part 29 - Min/Max

It's not really correct to talk about NaN being equal, less or greater than anything - it's not a number after all [1]. It should be defining whether a method should *propagate* NaNs: both Math.Min and Max seem to be propagating here, but Enumerable is just ignoring it and giving nonsense results.

[1] IEEE-754 defines all comparisons involving at least one NaN to return false. This confusingly means:

   Both `NaN == NaN` and `NaN != NaN` are false (which is why you get a warning for `if (x == NaN)`).

   Both `a < NaN` and `a > NaN` are false (probably the source of the Linq to Objects results).

Tuesday, January 11, 2011 6:00 AM by Simon Buchan

# re: Reimplementing LINQ to Objects: Part 29 - Min/Max

@Simon: Yes, I know about the natural comparisons - but IComparable doesn't really work that way. Propagation does sound like a more sensible concept... it is slightly odd that Enumerable.Max doesn't propagate NaN...

Tuesday, January 11, 2011 12:29 PM by skeet