<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://msmvps.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>Jon Skeet: Coding Blog : Google, Benchmarking</title><link>http://msmvps.com/blogs/jon_skeet/archive/tags/Google/Benchmarking/default.aspx</link><description>Tags: Google, Benchmarking</description><dc:language>en</dc:language><generator>CommunityServer 2008.5 SP2 (Build: 40407.4157)</generator><item><title>There's a hole in my abstraction, dear Liza, dear Liza</title><link>http://msmvps.com/blogs/jon_skeet/archive/2010/07/29/there-s-a-hole-in-my-abstraction-dear-liza-dear-liza.aspx</link><pubDate>Thu, 29 Jul 2010 20:41:06 GMT</pubDate><guid isPermaLink="false">d67277c4-116b-43f1-b688-e9ef184ea916:1774950</guid><dc:creator>skeet</dc:creator><slash:comments>16</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://msmvps.com/blogs/jon_skeet/rsscomments.aspx?PostID=1774950</wfw:commentRss><wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://msmvps.com/blogs/jon_skeet/commentapi.aspx?PostID=1774950</wfw:comment><comments>http://msmvps.com/blogs/jon_skeet/archive/2010/07/29/there-s-a-hole-in-my-abstraction-dear-liza-dear-liza.aspx#comments</comments><description>&lt;p&gt;I had an interesting day at work today. I thought my code had broken... but it turns out it was just a strange corner case which made it work very slowly. Usually when something interesting happens in my code it&amp;#39;s quite hard to blog about it, because of all the confidentiality issues involved. In this case, it&amp;#39;s extremely easy to reproduce the oddity in an entirely vanilla manner. All we need is the Java collections API.&lt;/p&gt;  &lt;p&gt;I have a set - a &lt;code&gt;HashSet&lt;/code&gt;, in fact. I want to remove some items from it... and many of the items may well not exist. In fact, in our test case, &lt;em&gt;none&lt;/em&gt; of the items in the &amp;quot;removals&amp;quot; collection will be in the original set. This sounds - and indeed &lt;em&gt;is&lt;/em&gt; - extremely easy to code. After all, we&amp;#39;ve got &lt;a href="http://download.oracle.com/javase/6/docs/api/java/util/Set.html#removeAll(java.util.Collection)"&gt;&lt;code&gt;Set&amp;lt;T&amp;gt;.removeAll&lt;/code&gt;&lt;/a&gt; to help us, right?&lt;/p&gt;  &lt;p&gt;Let&amp;#39;s make this concrete, and look at a little test. We specify the size of the &amp;quot;source&amp;quot; set and the size of the &amp;quot;removals&amp;quot; collection on the command line, and build both of them. The source set contains only non-negative integers; the removals set contains only negative integers. We measure how long it takes to remove all the elements using &lt;code&gt;System.currentTimeMillis()&lt;/code&gt;, which isn&amp;#39;t the world most accurate stopwatch but is more than adequate in this case, as you&amp;#39;ll see. Here&amp;#39;s the code:&lt;/p&gt;  &lt;div class="code"&gt;&lt;span class="Namespace"&gt;import&lt;/span&gt; java.util.*;     &lt;br /&gt;    &lt;br /&gt;&lt;span class="Modifier"&gt;public&lt;/span&gt;&amp;#160;&lt;span class="Type"&gt;class&lt;/span&gt; Test     &lt;br /&gt;{     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160; &lt;span class="Modifier"&gt;public&lt;/span&gt;&amp;#160;&lt;span class="Modifier"&gt;static&lt;/span&gt;&amp;#160;&lt;span class="Type"&gt;void&lt;/span&gt; main(String[] args)     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160; {     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; &lt;span class="Type"&gt;int&lt;/span&gt; sourceSize = Integer.parseInt(args[0]);     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; &lt;span class="Type"&gt;int&lt;/span&gt; removalsSize = Integer.parseInt(args[1]);     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; Set&amp;lt;Integer&amp;gt; source = &lt;span class="Keyword"&gt;new&lt;/span&gt; HashSet&amp;lt;Integer&amp;gt;();     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; Collection&amp;lt;Integer&amp;gt; removals = &lt;span class="Keyword"&gt;new&lt;/span&gt; ArrayList&amp;lt;Integer&amp;gt;();     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; &lt;span class="Statement"&gt;for&lt;/span&gt; (&lt;span class="Type"&gt;int&lt;/span&gt; i = 0; i &amp;lt; sourceSize; i++)     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; {     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; source.add(i);     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; }     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; &lt;span class="Statement"&gt;for&lt;/span&gt; (&lt;span class="Type"&gt;int&lt;/span&gt; i = 1; i &amp;lt;= removalsSize; i++)     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; {     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; removals.add(-i);     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; }     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; &lt;span class="Type"&gt;long&lt;/span&gt; start = System.currentTimeMillis();     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; source.removeAll(removals);     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; &lt;span class="Type"&gt;long&lt;/span&gt; end = System.currentTimeMillis();     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; System.out.println(&lt;span class="String"&gt;&amp;quot;Time taken: &amp;quot;&lt;/span&gt; + (end - start) + &lt;span class="String"&gt;&amp;quot;ms&amp;quot;&lt;/span&gt;);     &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160; }     &lt;br /&gt;} &lt;/div&gt;  &lt;p&gt;Let&amp;#39;s start off by giving it an easy job: a source set of 100 items, and 100 to remove: &lt;/p&gt;  &lt;div class="output"&gt;c:\Users\Jon\Test&amp;gt;java Test 100 100&lt;br /&gt;Time taken: 1ms &lt;/div&gt;  &lt;p&gt;Okay, so we hadn&amp;#39;t expected it to be slow... clearly we can ramp things up a bit. How about a source of one &lt;i&gt;million&lt;/i&gt; items&lt;sup&gt;1&lt;/sup&gt; and 300,000 items to remove?&lt;/p&gt;  &lt;div class="output"&gt;c:\Users\Jon\Test&amp;gt;java Test 1000000 300000&lt;br /&gt;Time taken: 38ms &lt;/div&gt;  &lt;p&gt;Hmm. That still seems pretty speedy. Now I feel I&amp;#39;ve been a little bit cruel, asking it to do all that removing. Let&amp;#39;s make it a bit easier - 300,000 source items and 300,000 removals:&lt;/p&gt;  &lt;div class="output"&gt;c:\Users\Jon\Test&amp;gt;java Test 300000 300000&lt;br /&gt;Time taken: 178131ms &lt;/div&gt;  &lt;p&gt;Excuse me? Nearly three &lt;i&gt;minutes&lt;/i&gt;? Yikes! Surely it ought to be easier to remove items from a &lt;i&gt;smaller&lt;/i&gt; collection than the one we managed in 38ms? Well, it does all make sense, eventually. &lt;code&gt;HashSet&amp;lt;T&amp;gt;&lt;/code&gt; extends &lt;code&gt;AbstractSet&amp;lt;T&amp;gt;&lt;/code&gt;, which includes this snippet in its &lt;a href="http://download.oracle.com/javase/6/docs/api/java/util/AbstractSet.html#removeAll(java.util.Collection)"&gt;documentation for the &lt;code&gt;removeAll&lt;/code&gt; method&lt;/a&gt;:&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;This implementation determines which is the smaller of this set and the specified collection, by invoking the &lt;tt&gt;size&lt;/tt&gt; method on each. If this set has fewer elements, then the implementation iterates over this set, checking each element returned by the iterator in turn to see if it is contained in the specified collection. If it is so contained, it is removed from this set with the iterator&amp;#39;s &lt;tt&gt;remove&lt;/tt&gt; method. If the specified collection has fewer elements, then the implementation iterates over the specified collection, removing from this set each element returned by the iterator, using this set&amp;#39;s &lt;tt&gt;remove&lt;/tt&gt; method.&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;Now that &lt;em&gt;sounds&lt;/em&gt; reasonable on the surface of it - iterate through the smaller collection, check for the presence in the bigger collection. However, this is where the abstraction is leaky. Just because we &lt;em&gt;can&lt;/em&gt; ask for the presence of an item in a large collection doesn&amp;#39;t mean it&amp;#39;s going to be fast. In our case, the collections are the same size - but checking for the presence of an item in the &lt;code&gt;HashSet&lt;/code&gt; is O(1) whereas checking in the &lt;code&gt;ArrayList&lt;/code&gt; is O(N)... whereas the cost of &lt;em&gt;iterating&lt;/em&gt; is going to be the same for each collection. Basically by choosing to iterate over the HashSet and check for presence in the &lt;code&gt;ArrayList&lt;/code&gt;, we&amp;#39;ve got an O(M * N) solution overall instead of an O(N) solution. Ouch. The &lt;code&gt;removeAll&lt;/code&gt; method is making an &amp;quot;optimization&amp;quot; based on assumptions which just aren&amp;#39;t valid in this case.&lt;/p&gt;  &lt;h2&gt;Then fix it, dear Henry, dear Henry, dear Henry&lt;/h2&gt;  &lt;p&gt;There are two simple ways of fixing the problem. The first is to simply change the type of the collection we&amp;#39;re removing from. Simply changing &lt;code&gt;ArrayList&amp;lt;Integer&amp;gt;&lt;/code&gt; to &lt;code&gt;HashSet&amp;lt;Integer&amp;gt;&lt;/code&gt; gets us back down to the 34ms range. We don&amp;#39;t even need to change the declared type of &lt;code&gt;removals&lt;/code&gt;.&lt;/p&gt;  &lt;p&gt;The second approach is to change the API we use: if we &lt;i&gt;know&lt;/i&gt; we want to iterate over &lt;code&gt;removals&lt;/code&gt; and perform the lookup in &lt;code&gt;source&lt;/code&gt;, that&amp;#39;s easy to do:&lt;/p&gt;  &lt;div class="code"&gt;&lt;span class="Statement"&gt;for&lt;/span&gt; (Integer value : removals)    &lt;br /&gt;{    &lt;br /&gt;&amp;#160;&amp;#160;&amp;#160; source.remove(value);    &lt;br /&gt;} &lt;/div&gt;  &lt;p&gt;In fact, on my machine that performs &lt;i&gt;slightly better&lt;/i&gt; than &lt;code&gt;removeAll&lt;/code&gt; - it doesn&amp;#39;t need to check the return value of &lt;code&gt;remove&lt;/code&gt; on each iteration, which &lt;code&gt;removeAll&lt;/code&gt; does in order to return whether or not &lt;i&gt;any&lt;/i&gt; items were removed. The above runs in about 28ms. (I&amp;#39;ve tested it with rather larger datasets, and it really &lt;em&gt;is&lt;/em&gt; faster than the dual-hash-set approach.)&lt;/p&gt;  &lt;p&gt;However, &lt;em&gt;both&lt;/em&gt; of these approaches require comments in the source code to explain why we&amp;#39;re not using the most obvious code (a list and &lt;code&gt;removeAll&lt;/code&gt;). I can&amp;#39;t complain about the documentation here - it says exactly what it will do. It&amp;#39;s just not obvious that you need to worry about it, until you run into such a problem.&lt;/p&gt;  &lt;p&gt;So what &lt;em&gt;should&lt;/em&gt; the implementation do? Arguably, it really needs to know what&amp;#39;s cheap in each of the collections it&amp;#39;s dealing with. The idea of probing for performance characteristics before you decide on a strategy is completely anathema to clean abstraction we like to consider with frameworks like Java collections... but maybe in this case it would be a good idea.&lt;/p&gt;  &lt;hr /&gt;  &lt;p&gt;&lt;sup&gt;1&lt;/sup&gt; Please perform Dr Evil impression while reading this. I&amp;#39;m watching you through your webcam, and I&amp;#39;ll be disappointed if I don&amp;#39;t see you put your little finger to your mouth.&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://msmvps.com/aggbug.aspx?PostID=1774950" width="1" height="1"&gt;</description><category domain="http://msmvps.com/blogs/jon_skeet/archive/tags/Java/default.aspx">Java</category><category domain="http://msmvps.com/blogs/jon_skeet/archive/tags/Google/default.aspx">Google</category><category domain="http://msmvps.com/blogs/jon_skeet/archive/tags/Benchmarking/default.aspx">Benchmarking</category><category domain="http://msmvps.com/blogs/jon_skeet/archive/tags/Evil+Code/default.aspx">Evil Code</category></item></channel></rss>