Logging enumeration flow
I'm currently reading Pro LINQ: Language Integrated Query in C# 2008 by Joe Rattz and yesterday I came across a claim about Enumerable.Intersect which didn't quite ring true. I consulted MSDN and the documentation is exactly the same as the book. Here's what it says:
When the object returned by this method is enumerated, Intersect enumerates first, collecting all distinct elements of that sequence. It then enumerates second, marking those elements that occur in both sequences. Finally, the marked elements are yielded in the order in which they were collected.
(first is the first parameter, the one which the method appears to be called on when using it as an extension method. second is the second parameter - the other sequence involved.)
This seems to be needlessly restrictive. In particular, it doesn't allow you to work with an infinite sequence on either side. It also means loading the whole of both sequences into memory at the same time. Given the way that Join works, I was surprised to see this. So I thought I'd test it. This raised the question of how you trace the flow of a sequence - how do you know when data is being pulled from it? The obvious answer is to create a new sequence which fetches from the old one, logging as it goes. Fortunately this is really easy to implement:
using System;
using System.Collections.Generic;
public static class Extensions
{
public static IEnumerable<T> WithLogging<T>(this IEnumerable<T> source,
string name)
{
foreach (T element in source)
{
Console.WriteLine("{0}: {1}", name, element);
yield return element;
}
}
}
We keep a name for the sequence so we can easily trace which sequence is being pulled from at what point. Now let's apply this logging to a call to Intersect:
using System;
using System.Linq;
using System.Collections.Generic;
class Test
{
static void Main()
{
var first = Enumerable.Range(1, 5).WithLogging("first");
var second = Enumerable.Range(3, 5).WithLogging("second");
foreach (int i in first.Intersect(second))
{
Console.WriteLine("Intersect: {0}", i);
}
}
}
As you can see, we're intersecting the numbers 1-5 with the numbers 3-7 - the intersection should clearly be 3-5. We'll see a line of output each time data is pulled from either first or second, and also when the result of Intersect yields a value. Given the documentation and the book, one would expect to see this output:
// Theoretical output. It doesn't really do this
first: 1
first: 2
first: 3
first: 4
first: 5
second: 3
second: 4
second: 5
second: 6
second: 7
Intersect: 3
Intersect: 4
Intersect: 5
Fortunately, it actually works exactly how I'd expect: the second sequence is evaluated fully, then the first is evaluated in a streaming fashion, with results being yielded as they're found. (This means that, if you're sufficiently careful with the result, e.g. by calling Take with a suitably small value, the first sequence can be infinite.) Here's the actual output demonstrating that:
// Actual output.
second: 3
second: 4
second: 5
second: 6
second: 7
first: 1
first: 2
first: 3
Intersect: 3
first: 4
Intersect: 4
first: 5
Intersect: 5
Initial Conclusion
There are two interesting points here, to my mind. The first is demonstrating that the documentation for Intersect is wrong - the real code is more sensible than the docs. That's not as important as seeing how easy it is to log the flow of sequence data - as simple as adding a single extension method and calling it. (You could do it with a Select projection which writes the data and then yields the value of course, but I think this is neater.)
I'm hoping to finish reading Joe's book this week, and write the review over the weekend, by the way.
Update (Sept. 11th 2008)
Frederik Siekmann replied to this post with a thrilling alternative implementation to stream the intersection, which takes alternating elements from the two sequences involved. It's a bit more memory hungry (with three sets of elements to remember instead of just one) but it means that we can deal with two infinite streams, if we're careful. Here's a complete example:
using System;
using System.Collections.Generic;
using System.Linq;
public static class Extensions
{
public static IEnumerable<T> AlternateIntersect<T>(this IEnumerable<T> first, IEnumerable<T> second)
{
var intersection = new HashSet<T>();
var firstSet = new HashSet<T>();
var secondSet = new HashSet<T>();
using (IEnumerator<T> firstEnumerator = first.GetEnumerator())
{
using (IEnumerator<T> secondEnumerator = second.GetEnumerator())
{
bool firstHasValues = firstEnumerator.MoveNext();
bool secondHasValues = secondEnumerator.MoveNext();
while (firstHasValues && secondHasValues)
{
T currentFirst = firstEnumerator.Current;
T currentSecond = secondEnumerator.Current;
if (!intersection.Contains(currentFirst) &&
secondSet.Contains(currentFirst))
{
intersection.Add(currentFirst);
yield return currentFirst;
}
firstSet.Add(currentFirst);
if (!intersection.Contains(currentSecond) &&
firstSet.Contains(currentSecond))
{
intersection.Add(currentSecond);
yield return currentSecond;
}
secondSet.Add(currentSecond);
firstHasValues = firstEnumerator.MoveNext();
secondHasValues = secondEnumerator.MoveNext();
}
if (firstHasValues)
{
do
{
T currentFirst = firstEnumerator.Current;
if (!intersection.Contains(currentFirst) &&
secondSet.Contains(currentFirst))
{
intersection.Add(currentFirst);
yield return currentFirst;
}
} while (firstEnumerator.MoveNext());
}
if (secondHasValues)
{
do
{
T currentSecond = secondEnumerator.Current;
if (!intersection.Contains(currentSecond) &&
firstSet.Contains(currentSecond))
{
intersection.Add(currentSecond);
yield return currentSecond;
}
} while (secondEnumerator.MoveNext());
}
}
}
}
public static IEnumerable<T> WithLogging<T>(this IEnumerable<T> source, string name)
{
foreach (T element in source)
{
Console.WriteLine(string.Format("{0}: {1}", name, element));
yield return element;
}
}
}
class Test
{
static void Main()
{
var positiveIntegers = Enumerable.Range(0, int.MaxValue);
var multiplesOfTwo = positiveIntegers.Where(x => (x%2) == 0)
.WithLogging("Twos");
var multiplesOfThree = positiveIntegers.Where(x => (x%3) == 0)
.WithLogging("Threes");
foreach (int x in multiplesOfTwo.AlternateIntersect(multiplesOfThree).Take(10))
{
Console.WriteLine ("AlternateIntersect: {0}", x);
}
}
}
Most of the code is the alternating intersection - the test at the end just shows intersection of the sequence 2, 4, 6, 8... with 3, 6, 9, 12... The output shows elements being taken from both sequences, and yielded when a match is found. It's important that we limit the output in some way - in the above code we call Take(10) but anything which prevents the loop from just executing until we run out of memory is fine.
Twos: 0
Threes: 0
AlternateIntersect: 0
Twos: 2
Threes: 3
Twos: 4
Threes: 6
Twos: 6
Threes: 9
AlternateIntersect: 6
Twos: 8
Threes: 12
Twos: 10
Threes: 15
Twos: 12
Threes: 18
AlternateIntersect: 12
Twos: 14
Threes: 21
Twos: 16
Threes: 24
Twos: 18
Threes: 27
AlternateIntersect: 18
(etc)
That's really neat. Quite how often it'll be useful is a different matter, but I find this kind of thing fascinating to consider. Thanks Frederik!