It's a demo translated from python to C# based upon the article Building Decision Trees in Python
by Christopher Roach. I highly recommend reading it if you are unfamiliar with decision trees.
I decided to use Linq for the ID3 code because it's basically a bunch of database operations, and it makes for more understandable code in my opinion. Besides it's just a demo, I'm not worried about speed at this point.
If the code seems strange it's because I'd like to experiment building the tree in parallel in the future. So that is why I'm implementing worker classes instead of just using recursion. Ideally these workers would be able to farm out some work to the GPU. Is there Linq to GPU? Hmmm, it appears so. I think I will look at this Brahma thing.
I've driven myself nuts trying to validate it. When you translate a program you don't even really understand, things tend to get lost in translation and then you have to go back and compare the two sets of code over and over until things work. However, it's that same process that helps to actually understand the algorithms underlying the code. Well, I'm happy to say I'm satisfied enough with it now or I wouldn't even be writing this post.
Figure 1. The Purchase dataset, target attribute = purchase? |
Figure 2. The Iris dataset, target = class. |
Figure 3 looks especially depressing because at run-time it will perform a bunch of unnecessary tests. What's happening is that the records in these branches have the same values for the attribute node that is being constructed, and then finally when it runs out of attributes to split on it reaches the target attribute and branches to multiple results. The solution here is to use post-pruning like C4.5 does and remove the redundant nodes.
Well that about sums it up. Next phase will be to clean up ID3 some more and then implement C4.5.
Figure 3. The Zoo dataset, target = animal. |
No comments:
Post a Comment