A short case study in LINQ efficiency
I've been thinking a bit about how I'd use LINQ in real life (leaving DLinq and XLinq alone for the moment). One of the examples I came up with is a fairly common one - trying to find the element in a collection which has the maximum value for a certain property. Note that quite often I don't just need to know the maximum value of the property itself - I need to know which element had that value. Now, it's not at all hard to implement that in "normal" code, but using LINQ could potentially make the intention clearer. So, I tried to work out what the appropriate LINQ expression should look like.
I've come up with three ways of expressing what I'm interested in in LINQ. For these examples, I've created a type NameAndSize which has (unsurprisingly) properties Name (a string) and Size (an int). For testing purposes, I've created a list of these items, with a variable list storing a reference to the list. All samples assume that the list is non-empty.
Sort, and take first element
(from item in list orderby item.Size descending select item).First();
|
This orders by size descending, and takes the first element. The obvious disadvantage of this is that before finding the first element (which is the only one we care about) we have to sort all the elements - nasty! Assuming a reasonable sort, this operation is likely to be O(n log (n))
Subselect in where clause
(from item in list where item.Size==list.Max(x=>x.Size) select item).First();
|
This goes through the list, finding every element whose size is equal to the maximum one, and then takes the first of those elements. Unfortunately, the comparison calculates the maximum size on every iteration This makes it an O(n^2) operation.
Two selects
int maxSize = list.Max(x=>x.Size); NameAndSize max = (from item in list where item.Size==maxSize select item).First();
|
This is similar to the previous version, but solves the problem of the repeated calculation of the maximum size by doing it before anything else. This makes the whole operation O(n), but it's still somewhat dissatisfying, as we're having to iterate through the list twice.
The non-LINQ way
NameAndSize max = list[0]; foreach (NameAndSize item in list) { if (item.Size > max.Size) { max = item; } }
|
This keeps a reference to the "maximum element so far". It only iterates through the list once, and is still O(n).
Benchmarks
Now, I've written a little benchmark which runs all of these except the "embedded select" version which was just too slow to run sensibly by the time I'd made the list large enough to get sensible results for the other versions. Here are the results using a list of a million elements, averaged over five runs:
Sorting: 437ms
Two queries: 109ms
Non-LINQ: 38ms
After tripling the size of the list, the results were:
Sorting: 1437ms
Two queries: 324ms
Non-LINQ: 117ms
These results show the complexities being roughly as predicted above, and in particular show that it's definitely cheaper to only iterate through the collection once than to iterate through it twice.
Now, this query is a fairly simple one, conceptually - it would be a shame if LINQ couldn't cope with it efficiently. I suspect it could be solved by giving the Max operator another parameter, which specified what should be selected, as well as what should be used for comparisons. Then I could just use list.Max(item => item.Size, item=>item). At that stage, the only loss in efficiency would be through invoking the delegates, which is a second-order problem (and one which is inherent in LINQ). Fortunately, the way LINQ works makes this really easy to try out - just write an extension class:
static class Extensions
{
public static V Max<T,V>(this IEnumerable<T> source,
Func<T,int> comparisonMap,
Func<T,V> selectMap)
{
int maxValue=0;
V maxElement=default(V);
bool gotAny = false;
using (IEnumerator<T> enumerator = source.GetEnumerator())
{
while (enumerator.MoveNext())
{
T sourceValue = enumerator.Current;
int value = comparisonMap(sourceValue);
if (!gotAny || value > maxValue)
{
maxValue = value;
maxElement = selectMap(sourceValue);
gotAny = true;
}
}
}
if (!gotAny)
{
throw new EmptySequenceException();
}
return maxElement;
}
} |
This gave results of 57ms and 169ms for the two list sizes used earlier - not quite as fast as the non-LINQ way, but much better than any of the others - and by far the simplest to express, too.
Lessons learned
- You really need to think about the complexity of LINQ queries, and know where they will be executed (I suspect that DLinq would have coped with the "subselect" version admirably).
- There's still some more work to be done on the standard query operators to find efficient solutions to common use cases.
- Even if the standard query operators don't quite do what you want, it can be worthwhile to implement your own - and it's not that hard to do so!