Reimplementing LINQ to Objects: Part 28 - Sum

Okay, I've bitten the bullet. The first of the four Aggregation Operators Of Overload Doom (AOOOD) that I've implemented is Sum. It was far from difficult to implement - just tedious.

What is it?

Sum has 20 overloads - a set of 4 for each of the types that it covers (int, long, float, double, decimal). Here are the overloads for int:

public static int Sum(this IEnumerable<int> source)

public static int? Sum(this IEnumerable<int?> source)

public static int Sum<T>(
    this IEnumerable<T> source,
    Func<T, int> selector)

public static int? Sum<T>(
    this IEnumerable<T> source,
    Func<T, int?> selector)

As you can see, there are basically two variations:

  • A source of the numeric type itself, or a source of an arbitrary type with a projection to the numeric type
  • The numeric type can be nullable or non-nullable

The behaviour is as follows:

  • All overloads use immediate execution: it will immediately iterate over the source sequence to compute the sum, which is obviously the return value.
  • source and selector must both be non-null
  • Where there's a selector, the operator is equivalent to source.Select(selector).Sum() - or you can think of the versions without a selector as using an identity selector
  • Where the numeric type is nullable, null values are ignored
  • The sum of an empty sequence is 0 (even for nullable numeric types)

The last point is interesting - because the overloads with nullable numeric types never return a null value. Initially I missed the fact that the return type even was nullable. I think it's somewhat misleading to be nullable but never null - you might have at least expected that the return value for an empty sequence (or one consisting only of null values) would be null.

For int, long and decimal, overflow within the sum will throw OverflowException. For single and double, the result will be positive or negative infinity. If the sequence contains a "not-a-number" value (NaN), the result will be NaN too.

What are we going to test?

A lot!

In total, I have 123 tests across the five types. The tests are mostly the same for each type, with the exception of overflow behaviour and not-a-number behaviour for single and double. Each overload is tested reasonably thoroughly:

  • Argument validation
  • Summing a simple sequence
  • Summing an empty sequence
  • (Nullable) Summing a simple sequence containing null values
  • (Nullable) Summing a sequence containing only null values
  • Positive overflow (either to an exception or infinity)
  • Negative overflow (only one test per type, rather than for each overload)
  • (Single/Double) Sequences containing NaN values
  • Projections resulting in all of the above

Most of this was done using cut and paste, leading to a 916-line source file. On Twitter, followers have suggested a couple of alternatives - templating (possibly for both the tests and the implementation), or using more advanced features of unit test frameworks. There's nothing wrong with these suggestions - but I'm always concerned about the balance within an elegant-but-complex solution to repetition. If it takes longer to get to the "neat" solution, and then each individual test is harder to read, is it worth it? It certainly makes it easier to add one test which is then applicable over several types, or to modify all "copies" of an existing test - but equally it makes it harder to make variations (such as overflow) fit within the pattern. I have no quarrel with the idea of using more advanced techniques here, but I've stuck to a primitive approach for the moment.

Let's implement it!

Again, I'll only demonstrate the "int" implementations - but talk about single/double later on. There are plenty of ways I could have implemented this:

  • Delegate everything to the simplest overload using "Where" for the nullable sequences and "Select" for the projection sequences
  • Delegate everything to the most complex overload using identity projections
  • Implement each method independently
  • Somewhere in-between :)

In the end I've implemented each non-projecting overload by delegating to the corresponding projection-based one with an identity projection. I've implemented the non-nullable and nullable versions separately though. Here's the complete implementation:

public static int Sum(this IEnumerable<int> source)
{
    return Sum(source, x => x);
}

public static int? Sum(this IEnumerable<int?> source)
{
    return Sum(source, x => x);
}

public static int Sum<T>(
    this IEnumerable<T> source,
    Func<T, int> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    checked
    {
        int sum = 0;
        foreach (T item in source)
        {
            sum += selector(item);
        }
        return sum;
    }
}

public static int? Sum<T>(
    this IEnumerable<T> source,
    Func<T, int?> selector)
{
    if (source == null)
    {
        throw new ArgumentNullException("source");
    }
    if (selector == null)
    {
        throw new ArgumentNullException("selector");
    }
    checked
    {
        int sum = 0;
        foreach (T item in source)
        {
            sum += selector(item).GetValueOrDefault();
        }
        return sum;
    }
}

Note the use of Nullable<T>.GetValueOrDefault() to "ignore" null values - it felt easier to add zero than to use an "if" block here. I suspect it's also more efficient, as there's no need for any conditionality here: I'd expect the implementation of GetValueOrDefault() to just return the underlying "value" field within the Nullable<T>, without performing the check for HasValue which the Value property normally would.

Of course, if I were really bothered by performance I'd implement each operation separately, instead of using the identity projection.

Note the use of the "checked" block to make sure that overflow is handled appropriately. As I've mentioned before, it would quite possibly be a good idea to turn overflow checking on for the whole assembly, but here I feel it's worth making it explicit to show that we consider overflow as an important part of the behaviour of this operator. The single/double overloads don't use checked blocks, as their overflow behaviour isn't affected by the checked context.

Conclusion

One down, three to go! I suspect Min and Max will use even more cutting and pasting (with judiciously applied changes, of course). There are 22 overloads for each of those operators, due to the possibility of using an arbitrary type - but I may well be able to use the most generic form to implement all the numeric versions. I may measure the impact this has on performance before deciding for sure. Anyway, that's a topic for the next post...

Addendum

As has been pointed out in the comments, my original implementation used a float to accumulate values when summing a sequence of floats. This causes problems, as these two new tests demonstrate:

[Test]
public void NonOverflowOfComputableSumSingle()
{
    float[] source = { float.MaxValue, float.MaxValue,
                      -float.MaxValue, -float.MaxValue };
    // In a world where we summed using a float accumulator, the
    // result would be infinity.
    Assert.AreEqual(0f, source.Sum());
}

[Test]
public void AccumulatorAccuracyForSingle()
{
    // 20000000 and 20000004 are both exactly representable as
    // float values, but 20000001 is not. Therefore if we use
    // a float accumulator, we'll end up with 20000000. However,
    // if we use a double accumulator, we'll get the right value.
    float[] array = { 20000000f, 1f, 1f, 1f, 1f };
    Assert.AreEqual(20000004f, array.Sum());
}

The second of these tests is specific to floating point arithmetic - there's no equivalent in the integer domain. Hopefully the comment makes the test clear. We could still do better if we used the Kahan summation algorithm, but I haven't implemented that yet, and don't currently intend to. Worth noting as a potential follow-on project though.

Back to the first test though: this certainly can be represented in integers. If we try to sum { int.MaxValue, int.MaxValue, -int.MaxValue, -int.MaxValue } there are two options: we can overflow (throwing an exception) or we can return 0. If we use a long accumulator, we'll return 0. If we use an int accumulator, we'll overflow. I genuinely didn't know what the result for LINQ to Objects would be until I tried it - and found that it overflows. I've added a test to document this behaviour:

[Test]
public void OverflowOfComputableSumInt32()
{
    int[] source = { int.MaxValue, 1, -1, -int.MaxValue };
    // In a world where we summed using a long accumulator, the
    // result would be 0.
    Assert.Throws<OverflowException>(() => source.Sum());
}

Of course, I could have gone my own way and made Edulinq more capable than LINQ to Objects here, but in this case I've gone with the existing behaviour.

Published Sat, Jan 8 2011 15:15 by skeet
Filed under: , ,

Comments

# re: Shortcut for Nan

In your 'behaviour' section, you've mentioned that "If the sequence contains a "not-a-number" value (NaN), the result will be NaN too."

Would it not be useful to include a "short-cut return" for the "Single+Double" implementation

e.g.

Single sum = 0;

foreach (T item in source)

{

 sum += selector(item)

   .GetValueOrDefault();

 if (Single.IsNaN(current))

   break;

}

return sum;

---

I might also have suggest shortcutting for Infinity, then I noticed that Single.NegativeInfinity + Single.PositiveInfinity gives Single.Nan, so it probably isn't worth

Saturday, January 08, 2011 10:17 AM by Steven Wilmot

# re: Reimplementing LINQ to Objects: Part 28 - Sum

You were very careful to use checked arithmetic earlier... why not here?

Saturday, January 08, 2011 1:03 PM by Star

# re: Reimplementing LINQ to Objects: Part 28 - Sum

Now if you just had a "Num a" equivalent in C#/the CLR to write "sum :: (Num a) => IE<a> -> a" instead of n overloads... ;)

Saturday, January 08, 2011 1:08 PM by Novox

# re: Reimplementing LINQ to Objects: Part 28 - Sum

@Star: Where do you believe I'm not being careful to use checked arithmetic?

Saturday, January 08, 2011 1:11 PM by skeet

# re: Reimplementing LINQ to Objects: Part 28 - Sum

I may have my nullable arithmetic all wrong, but wouldn't sum += selector(item) return the right result anyway?

Saturday, January 08, 2011 2:57 PM by configurator

# re: Reimplementing LINQ to Objects: Part 28 - Sum

@configurator: Nope... 1 + null == null.

Saturday, January 08, 2011 3:08 PM by skeet

# re: Reimplementing LINQ to Objects: Part 28 - Sum

When summing floats, do you accumulate in a double? Do you use an accurate summing algorithm[1] to limit loss of precision? FWIW I believe Linq to Objects does the former but not the latter based on tests like:

Enumerable.Repeat(Enumerable.Range(1,4).Select(i=>(float)i),10000000).SelectMany(x=>x).Sum()

(Try this with a float accumulator and you will end up with something like 6.710886E+07 instead of 1E+08)

and

const double eps = 2.220446e-16;

Enumerable.Concat(new[]{100.0}, Enumerable.Repeat(eps, 10000)).Sum()

(The resulting value is exactly 100.0, when a more accurate result could be represented in a double.)

That said, I can't imagine anyone using Linq to objects for any serious floating point arithmetic given the lack of documentation, so this may be a pointless diversion.

[1] - en.wikipedia.org/.../Kahan_summation_algorithm

Monday, January 10, 2011 7:58 AM by Weeble

# re: Reimplementing LINQ to Objects: Part 28 - Sum

@Weeble: Good questions. Currently IIRC, I use the result type as the accumulator type as well. Sounds like I should fix that.

Note that the same could be said for int as well, in a simpler way: I could potentially sum as a long, and then cast back to int at the end, so that int.MaxValue + 1 + (-1) would be okay.

Monday, January 10, 2011 8:05 AM by skeet