Reimplementing LINQ to Objects: Part 13 - Aggregate
EDIT: I've had to edit this quite a bit now that a second bug was discovered... basically my implementation of the first overload was completely broken :(
Last night's tweet asking for suggestions around which operator to implement next resulted in a win for Aggregate, so here we go.
What is it?
Aggregate has three overloads, effectively allow two defaults:
public static TSource Aggregate<TSource>(
this IEnumerable<TSource> source,
Func<TSource, TSource, TSource> func)
public static TAccumulate Aggregate<TSource, TAccumulate>(
this IEnumerable<TSource> source,
TAccumulate seed,
Func<TAccumulate, TSource, TAccumulate> func)
public static TResult Aggregate<TSource, TAccumulate, TResult>(
this IEnumerable<TSource> source,
TAccumulate seed,
Func<TAccumulate, TSource, TAccumulate> func,
Func<TAccumulate, TResult> resultSelector)
Aggregate is an extension method using immediate execution, returning a single result. The generalised behaviour is as follows:
- Start off with a seed. For the first overload, this defaults to the first value of the input sequence. The seed is used as the first accumulator value.
- For each item in the list, apply the aggregation function, which takes the current accumulator value and the newly found item, and returns a new accumulator value.
- Once the sequence has been exhausted, optionally apply a final projection to obtain a result. If no projection has been specified, we can imagine that the identity function has been provided.
The signatures make all of this look a bit more complicated because of the various type parameters involved. You can consider all the overloads as dealing with three different types, even though the first two actually have fewer type parameters:
- TSource is the element type of the sequence, always.
- TAccumulate is the type of the accumulator - and thus the seed. For the first overload where no seed is provided, TAccumulate is effectively the same as TSource.
- TResult is the return type when there's a final projection involved. For the first two overloads, TResult is effectively the same as TAccumulate (again, think of a default "identity projection" as being used when nothing else is specified)
In the first overload, which uses the first input element as the seed, an InvalidOperationException is thrown if the input sequence is empty.
What are we going to test?
Obviously the argument validation is reasonably simple to test - source, func and resultSelector can't be null. But there are two different approaches to testing the "success" cases.
We could work out exactly when each delegate should be called and with what values - effectively mock every step of the iteration. This would be a bit of a pain, but a very robust way of proceeding.
The alternative approach is just to take some sample data and aggregation function, work out what the result should be, and assert that result. If the result is sufficiently unlikely to be achieved by chance, this is probably good enough - and it's a lot simpler to implement. Here's a sample from the most complicated test, where we have a seed and a final projection:
[Test]
public void SeededAggregationWithResultSelector()
{
int[] source = { 1, 4, 5 };
int seed = 5;
Func<int, int, int> func = (current, value) => current * 2 + value;
Func<int, string> resultSelector = result => result.ToInvariantString();
Assert.AreEqual("57", source.Aggregate(seed, func, resultSelector));
}
Now admittedly I'm not testing this to the absolute full - I'm using the same types for TSource and TAccumulate - but frankly it gives me plenty of confidence that the implementation is correct.
EDIT: My result selector now calls ToInvariantString. It used to just call ToString, but as I've now been persuaded that there are some cultures where that wouldn't give us the right results, I've implemented an extension method which effectively means that x.ToInvariantString() is equivalent to x.ToString(CultureInfo.InvariantCulture) - so we don't need to worry about cultures with different numeric representations etc.
Just for the sake of completeness (I've convinced myself to improve the code while writing this blog post), here's an example which sums integers, but results in a long - so it copes with a result which is bigger than Int32.MaxValue. I haven't bothered with a final projection though:
[Test]
public void DifferentSourceAndAccumulatorTypes()
{
int largeValue = 2000000000;
int[] source = { largeValue, largeValue, largeValue };
long sum = source.Aggregate(0L, (acc, value) => acc + value);
Assert.AreEqual(6000000000L, sum);
Assert.IsTrue(sum > int.MaxValue);
}
Since I first wrote this post, I've also added tests for empty sequences (where the first overload should throw an exception) and a test which relies on the first overload using the first element of the sequence as the seed, rather than the default value of the input sequence's element type.
Okay, enough about the testing... what about the real code?
Let's implement it!
I'm still feeling my way around when it's a good idea to implement one method by using another, but at the moment my gut feeling is that it's okay to do so when:
- You're implementing one operator by reusing another overload of the same operator; in other words, no unexpected operators will end up in the stack trace of callers
- There are no significant performance penalties for doing so
- The observed behaviour is exactly the same - including argument validation
- The code ends up being simpler to understand (obviously)
Contrary to an earlier version of this post, the first overload can't be implemented in terms of the second or third ones, because of its behaviour regarding the seed and empty sequences.
public static TSource Aggregate<TSource>(
this IEnumerable<TSource> source,
Func<TSource, TSource, TSource> func)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
if (func == null)
{
throw new ArgumentNullException("func");
}
using (IEnumerator<TSource> iterator = source.GetEnumerator())
{
if (!iterator.MoveNext())
{
throw new InvalidOperationException("Source sequence was empty");
}
TSource current = iterator.Current;
while (iterator.MoveNext())
{
current = func(current, iterator.Current);
}
return current;
}
}
It still makes sense to share an implementation for the second and third overloads though. There's a choice around whether to implement the second operator in terms of the third (giving it an identity projection) or to implement the third operator in terms of the second (by just calling the second overload and then applying a projection). Obviously applying an unnecessary identity projection has a performance penalty in itself - but it's a tiny penalty. So which is more readable? I'm in two minds about this. I like code where various methods call one other "central" method where all the real work occurs (suggesting implementing the second overload using the third) but equally I suspect I really think about aggregation in terms of getting the final value of the accumulator... with just a twist in the third overload, of an extra projection. I guess it depends on whether you think of the final projection as part of the general form or an "extra" step.
For the moment, I've gone with the "keep all logic in one place" approach:
public static TAccumulate Aggregate<TSource, TAccumulate>(
this IEnumerable<TSource> source,
TAccumulate seed,
Func<TAccumulate, TSource, TAccumulate> func)
{
return source.Aggregate(seed, func, x => x);
}
public static TResult Aggregate<TSource, TAccumulate, TResult>(
this IEnumerable<TSource> source,
TAccumulate seed,
Func<TAccumulate, TSource, TAccumulate> func,
Func<TAccumulate, TResult> resultSelector)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
if (func == null)
{
throw new ArgumentNullException("func");
}
if (resultSelector == null)
{
throw new ArgumentNullException("resultSelector");
}
TAccumulate current = seed;
foreach (TSource item in source)
{
current = func(current, item);
}
return resultSelector(current);
}
The bulk of the "real work" method is argument validation - the actual iteration is almost painfully simple.
Conclusion
The moral of today's story is to read the documentation carefully - sometimes there's unexpected behaviour to implement. I still don't really know why this difference in behaviour exists... it feels to me as if the first overload really should behave like the second one, just with a default initial seed. EDIT: it seems that you need to read it really carefully. You know, every word of it. Otherwise you could make an embarrassing goof in a public blog post. <sigh>
The second moral should really be about the use of Aggregate - it's a very generalized operator, and you can implement any number of other operators (Sum, Max, Min, Average etc) using it. In some ways it's the scalar equivalent of SelectMany, just in terms of its diversity. Maybe I'll show some later operators implemented using Aggregate...
Next up, there have been requests for some of the set-based operators - Distinct, Union, etc - so I'll probably look at those soon.