Tuesday, April 17, 2012

Postpruning - First Whack


Zoo dataset A-B, before and after post-pruning.
It's funny I'm working on pruning for decision trees while earlier I was out in the yard pruning real trees.

It would be nice to eliminate the construction of these redundant nodes in the first place, but that seems to mean comparing the counts of the dataset and sub-datasets to see if the test is redundant which in Linq could be expensive.

It's a good thing I implemented edges because in previous versions of this function I was trying to remove and add nodes which threw an exception because you can't modify collections while your iterating over them.  I got around this by filling a list of replacement structures which was iterated over after the main routine was finished.  What was really a deal breaker is that this changed the order of the branches.  I finally realized I don't need to modify any collections I can just modify the nodes and edges.

Yay edges!

        private DtBranch Postprune(DtBranch branch)
        {
            DtBranch bottom = branch;
            //
            if (branch.HasAllLeafChildren)
                return branch;
            else if (branch.HasSingleChild)
            {
                DtBranch child = branch.Children.Single() as DtBranch;
                bottom = Postprune(child);
                bottom.Edge = branch.Edge;
                bottom.Edge.To = bottom;
            }
            else
                foreach (DtNode child in branch.Children)
                    if(child.Kind == DtElementKind.Branch)
                        Postprune(child as DtBranch);
            //
            return bottom;
        }

Saturday, April 14, 2012

Decision Tree Workshop up on Google Code


Decision Tree Workshop is now available at http://code.google.com/p/decision-tree-workshop/.

It's a demo translated from python to C# based upon the article Building Decision Trees in Python
 by Christopher Roach.  I highly recommend reading it if you are unfamiliar with decision trees.

I decided to use Linq for the ID3 code because it's basically a bunch of database operations, and it makes for more understandable code in my opinion.  Besides it's just a demo, I'm not worried about speed at this point.

If the code seems strange it's because I'd like to experiment building the tree in parallel in the future.  So that is why I'm implementing worker classes instead of just using recursion.  Ideally these workers would be able to farm out some work to the GPU.  Is there Linq to GPU?  Hmmm, it appears so.  I think I will look at this Brahma thing.

I've driven myself nuts trying to validate it.  When you  translate a program you don't even really understand, things tend to get lost in translation and then you have to go back and compare the two sets of code over and over until things work.  However, it's that same process that helps to actually understand the algorithms underlying the code.  Well, I'm happy to say I'm satisfied enough with it now or I wouldn't even be writing this post.

Figure 1.  The Purchase dataset, target attribute = purchase?
Figure 1 is an example of a good tree, ideally you want the tree to be pyramid shaped which means that at run-time it requires the least amount of tests to get from the root to a result.

Figure 2.  The Iris dataset, target = class.
Figure 2 is an example of a bad tree, because there are too many child nodes under most of the nodes.  This is because the Iris dataset's attributes are all numeric except for the class attribute.  The solution is to create split points based upon the distribution of the attribute.  This is what C4.5 does.

Figure 3 looks especially depressing because at run-time it will perform a bunch of unnecessary tests.  What's happening is that the records in these branches have the same values for the attribute node that is being constructed, and then finally when it runs out of attributes to split on it reaches the target attribute and branches to multiple results.  The solution here is to use post-pruning like C4.5 does and remove the redundant nodes.

Well that about sums it up.  Next phase will be to clean up ID3 some more and then implement C4.5.
Figure 3.  The Zoo dataset, target = animal.