Decision Tree Workshop is now available at
http://code.google.com/p/decision-tree-workshop/.
It's a demo translated from python to C# based upon the article
Building Decision Trees in Python
by Christopher Roach. I highly recommend reading it if you are unfamiliar with decision trees.
I decided to use
Linq for the
ID3 code because it's basically a bunch of database operations, and it makes for more understandable code in my opinion. Besides it's just a demo, I'm not worried about speed at this point.
If the code seems strange it's because I'd like to experiment building the tree in parallel in the future. So that is why I'm implementing worker classes instead of just using recursion. Ideally these workers would be able to farm out some work to the
GPU. Is there
Linq to GPU? Hmmm, it appears so. I think I will look at this
Brahma thing.
I've driven myself nuts trying to validate it. When you translate a program you don't even really understand, things tend to get lost in translation and then you have to go back and compare the two sets of code over and over until things work. However, it's that same process that helps to actually understand the algorithms underlying the code. Well, I'm happy to say I'm satisfied enough with it now or I wouldn't even be writing this post.
|
Figure 1. The Purchase dataset, target attribute = purchase? |
Figure 1 is an example of a good tree, ideally you want the tree to be pyramid shaped which means that at run-time it requires the least amount of tests to get from the root to a result.
|
Figure 2. The Iris dataset, target = class. |
Figure 2 is an example of a bad tree, because there are too many child nodes under most of the nodes. This is because the Iris dataset's attributes are all numeric except for the class attribute. The solution is to create split points based upon the distribution of the attribute. This is what
C4.5 does.
Figure 3 looks especially depressing because at run-time it will perform a bunch of unnecessary tests. What's happening is that the records in these branches have the same values for the attribute node that is being constructed, and then finally when it runs out of attributes to split on it reaches the target attribute and branches to multiple results. The solution here is to use post-pruning like C4.5 does and remove the redundant nodes.
Well that about sums it up. Next phase will be to clean up ID3 some more and then implement C4.5.
|
Figure 3. The Zoo dataset, target = animal. |