Reimplementing LINQ to Objects: Part 24 - ToArray
This really is a simple one. So simple I might even find the energy to implement ToDictionary as well tonight... we'll see. (EDIT: Oops. It became slightly less simple in the end, as I came up with the third implementation. Oh well.)
What is it?
ToArray is basically the equivalent of ToList, but producing an array instead of a list. It has a single signature:
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
Just to recap:
- It's another extension method, which is useful when we want to use type inference.
- It uses immediate execution - nothing is deferred, and the entire sequence is read before the method completes.
- It performs simple argument validation: source can't be null
- It creates an independent but shallow copy of the collection (so any changes to the objects referenced by source will be visible in the result, but not changes to source itself - and vice versa)
- It's optimized for ICollection<T>, although in a slightly different way to ToList.
What are we going to test?
Exactly the same as we did for ToList, basically - with one bit of care required. In our test for the source and result being independent of each other, we can't just create a new variable of type List<T> and call ToArray on it - because that would call the implementation in List<T> itself. I reckoned the easiest way of making this clear was to call the method directly as a normal static method, instead of using it as an extension method:
[Test]
public void ResultIsIndependentOfSource()
{
List<string> source = new List<string> { "xyz", "abc" };
string[] result = Enumerable.ToArray(source);
result.AssertSequenceEqual("xyz", "abc");
source[0] = "xxx";
Assert.AreEqual("xyz", result[0]);
result[1] = "yyy";
Assert.AreEqual("abc", source[1]);
}
One other interesting facet of the testing is that we can only partially test the optimization for ICollection<T>. We can make sure that it's optimized as far as not using the iterator (as per the previous test), but there's more optimization coming, and I haven't worked out how to test for that yet. You'll see what I mean when we come to implement it. Speaking of which...
Let's implement it!
We can make all our tests pass with a cheat: whatever the input type, convert it to a List<T> and then call ToArray on the result. Heck, if we call ToList to do the initial conversion, that will even do the argument validation for us and use the right parameter name in the exception:
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
return source.ToList().ToArray();
}
Now I'm not averse to the general approach taken here - but there's actually a bit more optimization we can do.
Remember how ICollection<T> exposes Count and CopyTo, which the List<T> constructor uses when the input implements ICollection<T>? Well, that means that building a List<T> is relatively cheap for a collection - but calling ToArray on the list will still mean copying all the data out again (as List<T>.ToArray can't just return its own internal array - it has to create a copy). We can use exactly the same members ourselves, and avoid one level of copying:
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
ICollection<TSource> collection = source as ICollection<TSource>;
if (collection != null)
{
TSource[] ret = new TSource[collection.Count];
collection.CopyTo(ret, 0);
return ret;
}
return new List<TSource>(source).ToArray();
}
That's pretty good now - except it still involves a copy from the List<T> into a new array every time. That's almost always appropriate, in fact... because unless the resulting list happened to expand its array to exactly the right size, we'd need to make a copy anyway. After all, we can't return an array that's too big. However, we can optimize for the "just the right size" case if we basically implement List<T>'s array expansion ourselves, leading to this:
public static TSource[] ToArray<TSource>(this IEnumerable<TSource> source)
{
if (source == null)
{
throw new ArgumentNullException("source");
}
ICollection<TSource> collection = source as ICollection<TSource>;
if (collection != null)
{
TSource[] tmp = new TSource[collection.Count];
collection.CopyTo(tmp, 0);
return tmp;
}
TSource[] ret = new TSource[16];
int count = 0;
foreach (TSource item in source)
{
if (count == ret.Length)
{
Array.Resize(ref ret, ret.Length * 2);
}
ret[count++] = item;
}
if (count != ret.Length)
{
Array.Resize(ref ret, count);
}
return ret;
}
Is this level of optimization worth it? Probably not. I picked the starting size of 16 out of thin air (or dimly recalled initial counts for some collection or other - possibly Java's ArrayList<T>). Maybe we should triple the capacity rather than double it, just for laughs. It's all guesswork, really. The middle implementation feels like a more appropriate one to me, but with an obvious optimization just itching to be implemented, I thought I might as well provide the code. It's reasonably obviously correct, but it's just a bit longwinded for the marginal benefit over our second attempt.
It's even more annoying that I can't think of a way to test this easily - I could benchmark it of course, but that's not the same as unit testing it... I can't easily prove that we're optimizing either the ICollection<T> or "correct guess at size" cases.
Conclusion
It's always interesting to see what else springs to mind when I'm writing up the operator as opposed to just implementing it in Visual Studio. I'd got as far as the second implementation but not the third when I started this post.
It's possible that in the end, the 4-overload ToDictionary operator will actually end up being simpler than ToArray. Who'd have thought?