<?xml version="1.0" encoding="UTF-8" ?>
<?xml-stylesheet type="text/xsl" href="http://msmvps.com/utility/FeedStylesheets/rss.xsl" media="screen"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:wfw="http://wellformedweb.org/CommentAPI/"><channel><title>the blog =&gt; anything goes : Data Mining</title><link>http://msmvps.com/blogs/senthil/archive/tags/Data+Mining/default.aspx</link><description>Tags: Data Mining</description><dc:language>en</dc:language><generator>CommunityServer 2008.5 SP2 (Build: 40407.4157)</generator><item><title>Yowza! My computer is seeing patterns!</title><link>http://msmvps.com/blogs/senthil/archive/2009/04/09/yowza-my-computer-is-seeing-patterns.aspx</link><pubDate>Thu, 09 Apr 2009 20:08:24 GMT</pubDate><guid isPermaLink="false">d67277c4-116b-43f1-b688-e9ef184ea916:1686343</guid><dc:creator>Senthil</dc:creator><slash:comments>3</slash:comments><wfw:commentRss xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://msmvps.com/blogs/senthil/rsscomments.aspx?PostID=1686343</wfw:commentRss><wfw:comment xmlns:wfw="http://wellformedweb.org/CommentAPI/">http://msmvps.com/blogs/senthil/commentapi.aspx?PostID=1686343</wfw:comment><comments>http://msmvps.com/blogs/senthil/archive/2009/04/09/yowza-my-computer-is-seeing-patterns.aspx#comments</comments><description>&lt;p&gt;Ever since I happened to stumble upon &lt;a href="http://www-users.cs.umn.edu/~kumar/dmbook/index.php" target="_blank"&gt;this book&lt;/a&gt; on Data Mining, I&amp;#39;ve been hooked. So much so that I&amp;#39;ve been brushing up on statistics and probability distributions just to follow along the book. &lt;/p&gt; &lt;p&gt;I&amp;#39;m currently reading the chapter on &lt;a href="http://databases.about.com/od/datamining/g/classification.htm" target="_blank"&gt;classification&lt;/a&gt; using &lt;a href="http://en.wikipedia.org/wiki/Decision_Trees" target="_blank"&gt;decision trees&lt;/a&gt;, and what appeals to me is the how the technique reduces huge sets of data (records) into simple models that can both describe the data and can predict the class of previously unseen records. If all that goes right above your head, here&amp;#39;s a simple example.&lt;/p&gt; &lt;blockquote&gt; &lt;table cellspacing="10"&gt;  &lt;tr&gt; &lt;td&gt;&lt;strong&gt;P1&lt;/strong&gt;&lt;/td&gt; &lt;td&gt;&lt;strong&gt;P2&lt;/strong&gt;&lt;/td&gt; &lt;td&gt;&lt;strong&gt;Class (Result)&lt;/strong&gt;&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;True&lt;/td&gt; &lt;td&gt;True&lt;/td&gt; &lt;td&gt;True&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;False&lt;/td&gt; &lt;td&gt;False&lt;/td&gt; &lt;td&gt;False&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;True&lt;/td&gt; &lt;td&gt;False&lt;/td&gt; &lt;td&gt;False&lt;/td&gt;&lt;/tr&gt; &lt;tr&gt; &lt;td&gt;False&lt;/td&gt; &lt;td&gt;True&lt;/td&gt; &lt;td&gt;False&lt;/td&gt;&lt;/tr&gt;&lt;/table&gt; &lt;p&gt;Given the above tabular values, can the computer figure out that it is the truth table for P1 ^ P2 (with ^ representing &lt;a href="http://en.wikipedia.org/wiki/Logical_conjunction" target="_blank"&gt;logical conjunction&lt;/a&gt;, not XOR) ? If it can, it has &lt;/p&gt; &lt;p&gt;1. A model that describes the data, i.e., the data is the conjunction of the two boolean variables.&lt;/p&gt; &lt;p&gt;2. An evaluator that can calculate the result, given P1 and P2, i.e., predict the class. Since all combination of values are already available, there are no unseen records in this case.&lt;/p&gt;&lt;/blockquote&gt; &lt;p&gt;A decision tree for the above table looks like this visually :-&lt;/p&gt; &lt;p&gt;&lt;a href="http://msmvps.com/cfs-file.ashx/__key/CommunityServer.Blogs.Components.WeblogFiles/senthil/image_5F00_4.png"&gt;&lt;img height="202" alt="image" src="http://msmvps.com/cfs-file.ashx/__key/CommunityServer.Blogs.Components.WeblogFiles/senthil/image_5F00_thumb.png" width="213" border="0" /&gt;&lt;/a&gt; &lt;/p&gt; &lt;p&gt;with green edges and red edges representing true and false values for the attribute represented by the node, respectively. As you can see, the non-leaf nodes each represent an attribute, with one edge for each possible value of that attribute. The leaf nodes represent the result (class). Since we are dealing with binary attributes, the decision tree turns out to be a binary tree. Note that you can find out the result of { P1 = True, P2 = True } by simply navigating the tree and reading the value of the leaf node ( green edge from P1 to P2, green edge from P2 to T, and then read the value, T = true).&lt;/p&gt; &lt;p&gt;This was so interesting that I actually wrote a generic decision tree builder in C# (&lt;a href="http://cid-1db38b528599dd02.skydrive.live.com/self.aspx/.Public" target="_blank"&gt;download&lt;/a&gt;). It uses &lt;a href="http://www.google.co.in/search?q=Hunt%27s+Algorithm&amp;amp;ie=utf-8&amp;amp;oe=utf-8&amp;amp;aq=t&amp;amp;rls=org.mozilla:en-US:official&amp;amp;client=firefox-a" target="_blank"&gt;Hunt&amp;#39;s Algorithm&lt;/a&gt; to build the decision tree and uses Entropy as the measure to select attributes. At the moment, it can work only on records with nominal attributes (attributes like enums whose values are limited to a finite set). After training, it returns a decision tree, which can then be exported to XML or can be used to predict classes for new records. &lt;/p&gt; &lt;p&gt;Enough of the theory, let&amp;#39;s see it in action. Here&amp;#39;s how code that uses my tree builder looks.&lt;/p&gt;&lt;pre class="code"&gt;&lt;span style="color:blue;"&gt;class &lt;/span&gt;&lt;span style="color:#2b91af;"&gt;TruthTableEntry &lt;/span&gt;: &lt;span style="color:#2b91af;"&gt;Record&lt;/span&gt;&amp;lt;&lt;span style="color:blue;"&gt;bool&lt;/span&gt;&amp;gt;
{
    &lt;span style="color:blue;"&gt;public bool &lt;/span&gt;P1 { &lt;span style="color:blue;"&gt;get&lt;/span&gt;; &lt;span style="color:blue;"&gt;set&lt;/span&gt;; }
    &lt;span style="color:blue;"&gt;public bool &lt;/span&gt;P2 { &lt;span style="color:blue;"&gt;get&lt;/span&gt;; &lt;span style="color:blue;"&gt;set&lt;/span&gt;; }

    &lt;span style="color:blue;"&gt;public override bool &lt;/span&gt;Equals(&lt;span style="color:blue;"&gt;object &lt;/span&gt;obj)
    {
        &lt;span style="color:blue;"&gt;var &lt;/span&gt;other = obj &lt;span style="color:blue;"&gt;as &lt;/span&gt;&lt;span style="color:#2b91af;"&gt;TruthTableEntry&lt;/span&gt;;
        &lt;span style="color:blue;"&gt;if &lt;/span&gt;(other == &lt;span style="color:blue;"&gt;null&lt;/span&gt;)
            &lt;span style="color:blue;"&gt;return false&lt;/span&gt;;

        &lt;span style="color:blue;"&gt;return &lt;/span&gt;other.P1 == &lt;span style="color:blue;"&gt;this&lt;/span&gt;.P1 &amp;amp;&amp;amp; other.P2 == &lt;span style="color:blue;"&gt;this&lt;/span&gt;.P2;
    }

    &lt;span style="color:blue;"&gt;public override int &lt;/span&gt;GetHashCode()
    {
        &lt;span style="color:blue;"&gt;return &lt;/span&gt;P1.GetHashCode() ^ P2.GetHashCode();
    }
}&lt;/pre&gt;&lt;pre class="code"&gt;&amp;nbsp;&lt;/pre&gt;&lt;pre class="code"&gt;&lt;span style="color:blue;"&gt;static void &lt;/span&gt;Main(&lt;span style="color:blue;"&gt;string&lt;/span&gt;[] args)
{
&lt;span style="color:#2b91af;"&gt;   TruthTableEntry&lt;/span&gt;[] table = 
   {
       &lt;span style="color:blue;"&gt;new &lt;/span&gt;&lt;span style="color:#2b91af;"&gt;TruthTableEntry&lt;/span&gt;() { P1 = &lt;span style="color:blue;"&gt;true&lt;/span&gt;, P2 = &lt;span style="color:blue;"&gt;true&lt;/span&gt;, Class = &lt;span style="color:blue;"&gt;true &lt;/span&gt;},
       &lt;span style="color:blue;"&gt;new &lt;/span&gt;&lt;span style="color:#2b91af;"&gt;TruthTableEntry&lt;/span&gt;() { P1 = &lt;span style="color:blue;"&gt;true&lt;/span&gt;, P2 = &lt;span style="color:blue;"&gt;false&lt;/span&gt;, Class = &lt;span style="color:blue;"&gt;false&lt;/span&gt;},
       &lt;span style="color:blue;"&gt;new &lt;/span&gt;&lt;span style="color:#2b91af;"&gt;TruthTableEntry&lt;/span&gt;() { P1 = &lt;span style="color:blue;"&gt;false&lt;/span&gt;, P2 = &lt;span style="color:blue;"&gt;true&lt;/span&gt;, Class = &lt;span style="color:blue;"&gt;false&lt;/span&gt;},
       &lt;span style="color:blue;"&gt;new &lt;/span&gt;&lt;span style="color:#2b91af;"&gt;TruthTableEntry&lt;/span&gt;() { P1 = &lt;span style="color:blue;"&gt;false&lt;/span&gt;, P2 = &lt;span style="color:blue;"&gt;false&lt;/span&gt;, Class = &lt;span style="color:blue;"&gt;false &lt;/span&gt;}
   };

   &lt;span style="color:blue;"&gt;var &lt;/span&gt;tree = &lt;span style="color:blue;"&gt;new &lt;/span&gt;&lt;span style="color:#2b91af;"&gt;TreeBuilder&lt;/span&gt;&amp;lt;&lt;span style="color:#2b91af;"&gt;TruthTableEntry&lt;/span&gt;, &lt;span style="color:blue;"&gt;bool&lt;/span&gt;&amp;gt;().Train(table);
   &lt;span style="color:blue;"&gt;string &lt;/span&gt;predicate = FormPredicate(tree.RootNode, &lt;span style="color:#a31515;"&gt;&amp;quot;&amp;quot;&lt;/span&gt;);
   &lt;span style="color:#2b91af;"&gt;Console&lt;/span&gt;.WriteLine(predicate);
}&lt;/pre&gt;
&lt;p&gt;&lt;/p&gt;&lt;a href="http://11011.net/software/vspaste"&gt;&lt;/a&gt;&lt;a href="http://11011.net/software/vspaste"&gt;&lt;/a&gt;
&lt;p&gt;The TruthTableEntry class represents a row in the truth table above, and an array of those records are given to the TreeBuilder instance&amp;#39;s Train method. The decision tree returned by it is then translated into a predicate by the FormPredicate method. Here&amp;#39;s how the tree looks in XML if you call Save on the tree.&lt;/p&gt;&lt;pre class="code"&gt;&lt;span style="color:blue;"&gt;&amp;lt;?&lt;/span&gt;&lt;span style="color:#a31515;"&gt;xml &lt;/span&gt;&lt;span style="color:red;"&gt;version&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;1.0&lt;/span&gt;&amp;quot; &lt;span style="color:red;"&gt;encoding&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;utf-8&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;?&amp;gt;
&amp;lt;&lt;/span&gt;&lt;span style="color:#a31515;"&gt;Node &lt;/span&gt;&lt;span style="color:red;"&gt;ParentConditionValue&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&amp;quot; &lt;span style="color:red;"&gt;SplitAttribute&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;P1&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;&amp;gt;
  &amp;lt;&lt;/span&gt;&lt;span style="color:#a31515;"&gt;Node &lt;/span&gt;&lt;span style="color:red;"&gt;ParentConditionValue&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;True&lt;/span&gt;&amp;quot; &lt;span style="color:red;"&gt;SplitAttribute&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;P2&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;&amp;gt;
    &amp;lt;&lt;/span&gt;&lt;span style="color:#a31515;"&gt;Node &lt;/span&gt;&lt;span style="color:red;"&gt;ParentConditionValue&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;True&lt;/span&gt;&amp;quot; &lt;span style="color:red;"&gt;Class&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;True&lt;/span&gt;&amp;quot; &lt;span style="color:blue;"&gt;/&amp;gt;
    &amp;lt;&lt;/span&gt;&lt;span style="color:#a31515;"&gt;Node &lt;/span&gt;&lt;span style="color:red;"&gt;ParentConditionValue&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;False&lt;/span&gt;&amp;quot; &lt;span style="color:red;"&gt;Class&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;False&lt;/span&gt;&amp;quot; &lt;span style="color:blue;"&gt;/&amp;gt;
  &amp;lt;/&lt;/span&gt;&lt;span style="color:#a31515;"&gt;Node&lt;/span&gt;&lt;span style="color:blue;"&gt;&amp;gt;
  &amp;lt;&lt;/span&gt;&lt;span style="color:#a31515;"&gt;Node &lt;/span&gt;&lt;span style="color:red;"&gt;ParentConditionValue&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;False&lt;/span&gt;&amp;quot; &lt;span style="color:red;"&gt;Class&lt;/span&gt;&lt;span style="color:blue;"&gt;=&lt;/span&gt;&amp;quot;&lt;span style="color:blue;"&gt;False&lt;/span&gt;&amp;quot; &lt;span style="color:blue;"&gt;/&amp;gt;
&amp;lt;/&lt;/span&gt;&lt;span style="color:#a31515;"&gt;Node&lt;/span&gt;&lt;span style="color:blue;"&gt;&amp;gt;&lt;/span&gt;&lt;/pre&gt;
&lt;p&gt;which exactly matches what we expected.&lt;/p&gt;
&lt;p&gt;&lt;span style="color:blue;"&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span style="color:blue;"&gt;&lt;/span&gt;&lt;/p&gt;&lt;pre class="code"&gt;&lt;a href="http://msmvps.com/cfs-file.ashx/__key/CommunityServer.Blogs.Components.WeblogFiles/senthil/image_5F00_10.png"&gt;&lt;img height="202" alt="image" src="http://msmvps.com/cfs-file.ashx/__key/CommunityServer.Blogs.Components.WeblogFiles/senthil/image_5F00_thumb_5F00_1.png" width="213" border="0" /&gt;&lt;/a&gt; &lt;/pre&gt;
&lt;p&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;The FormPredicate method attempts to transform the tree into a predicate clause, as expressed mathematically in predicate logic. For the above tree, it returns&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;font face="Courier New"&gt;(P1 ^ P2) v (~P1 ^ False)&lt;/font&gt;&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;which when reduced through the laws of predicate logic&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;&lt;font face="Courier New"&gt;(P1 ^ P2) v (~P1 ^ False) &lt;/font&gt;&lt;/p&gt;
&lt;p&gt;&lt;font face="Courier New"&gt;= &lt;font face="Courier New"&gt;(P1 ^ P2) v False (since False ^ P == False)&lt;/font&gt;&lt;/font&gt;&lt;/p&gt;
&lt;p&gt;&lt;font face="Courier New"&gt;= (P1 ^ P2)&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; (since False v P == P)&lt;/font&gt;&lt;/p&gt;&lt;/blockquote&gt;
&lt;p&gt;And there you have it, the computer figured it out correctly. Yowza :)&lt;/p&gt;&lt;div style="clear:both;"&gt;&lt;/div&gt;&lt;img src="http://msmvps.com/aggbug.aspx?PostID=1686343" width="1" height="1"&gt;</description><category domain="http://msmvps.com/blogs/senthil/archive/tags/software+tools/default.aspx">software tools</category><category domain="http://msmvps.com/blogs/senthil/archive/tags/software/default.aspx">software</category><category domain="http://msmvps.com/blogs/senthil/archive/tags/C_2300_+3.0/default.aspx">C# 3.0</category><category domain="http://msmvps.com/blogs/senthil/archive/tags/Data+Mining/default.aspx">Data Mining</category></item></channel></rss>