I'm studing C4.5 algorithm and trying to make it in java by myself, but in the part of pruning, I don't understand what it compute in book of C4.5.
In this book,it said:"When N training cases are covered by a leaf, E of them incorrectly....For a given confidence level CF, ..., this upper limit is here written UCF(E,N)(default confidence level is 25%)....a leaf covering N training cases with a predicted error rate of UCF(E,N) would give rise to predicted N*UCF(E,N) errors."
author show a example, in this subtree:
- education cost == n: democrat(6)
- education cost == y: democrat(9)
- education cost == u: republican(1)
For the first leaf, N=6, E=0, U25%(0,6)=0.206, for the remaining leaves, U25%(0,9)=0.143 and U25%(0,1)=0.75, so the number of predicted errors for this subtree is 60.206+90.143+1*0.75=3.273.
If the subtree were replace by a democrat, the predicted errors is 16U25%(1,16)=160.157=2.512, it is less than 3.273, so it should pruned to a leaf.
But I don't know how it computed U25%(0,6)=0.206, U25%(0,9)=0.143 and so on, and inceased the number of errors observed at each leaf by 0.5, where were put 0.5 into? I have search on internet, there are many different formula so I don't know which one is correct.