I have used the OneR algorithm of the FSelecter Package to find the Attribut with the lowest error rate. My class Attribut is yes and no. My characteristics of the attributs are also yes and no.
The result of the OneR algorithm is:
Ranking-No. 1
Atribut-Name: OR1:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25243-------------------0
1(Class: -------------------1459-------------------18
Error-Rate: 1459 (0 + 1459)
Ranking-No. 2
Atribut-Name: OR2:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25243-------------------0
1(Class: -------------------1460-------------------17
Error-Rate: 1460 (0 + 1460)
However, if I use the correlation function on the same data Frame the best attributs have got a lower error rate than the attributs, which i get with the oneR function.
Atribut-Name: CO4:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25204-------------------39
1(Class: -------------------1348-------------------129
Error-Rate: 1387 (39 + 1348)
Can anybody tell me, why the OneR algorithm does not show the CO4 Attribut as the best Attribut (based on the error rate)?
Which criterias does the OneR algorithm use?
--- Addition to better understand my question ---
The complete data are too big to show it. I have constructed a new datapool, which has the same effect
DELAYED - OR1 - CO4 ..
1 ---------1--------1--
0 ---------0--------0--
0 ---------0--------1--
1 ---------0--------1--
0 ---------0--------0--
1 ---------0--------1--
0 ---------0--------0--
1 ---------0--------1--
The code for show the error rate for a single attribute:
print(table(datapool_stackoverflow$DELAYED, datapool_stackoverflow$OR1))
The code the OneR function:
library(FSelector)
oneR_stackoverflow <- oneR(DELAYED~., datapool_stackoverflow)
subset_stackoverflow <- cutoff.k(oneR_stackoverflow, 2)
print(subset_stackoverflow)
The code for the correlation:
cor(as.numeric(datapool_stackoverflow$DELAYED), as.numeric(datapool_stackoverflow$OR1))
In this case the results are:
Error-Rate: OR1 Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):---------------------4-------------------------0
1(Class: ---------------------3-------------------------1
Manuel calculated Error-Rate: 3(0 + 3)
Error-Rate: CO4 Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-----------------------3-----------------------1
1(Class: -----------------------0-----------------------4
Error-Rate: 1(1 + 0)
Correlation: Attribut OR1: 0.377 Attribut CO4: 0.77
OneR: "OR1", "CO4"
Why, does the OneR function provide the OR1 Attribut as the best Attribut to classify?
No, the
CO4
should be chosen, choosing the other attribute is wrong - see what the OneR package (available on CRAN) gives:You can find more information about the OneR package here: https://github.com/vonjd/OneR
(full disclosure: I am the author of this package)