I have used the OneR algorithm of the FSelecter Package to find the Attribut with the lowest error rate. My class Attribut is yes and no. My characteristics of the attributs are also yes and no.
The result of the OneR algorithm is:
Ranking-No. 1
Atribut-Name: OR1:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25243-------------------0
1(Class: -------------------1459-------------------18
Error-Rate: 1459 (0 + 1459)
Ranking-No. 2
Atribut-Name: OR2:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25243-------------------0
1(Class: -------------------1460-------------------17
Error-Rate: 1460 (0 + 1460)
However, if I use the correlation function on the same data Frame the best attributs have got a lower error rate than the attributs, which i get with the oneR function.
Atribut-Name: CO4:
Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-------------------25204-------------------39
1(Class: -------------------1348-------------------129
Error-Rate: 1387 (39 + 1348)
Can anybody tell me, why the OneR algorithm does not show the CO4 Attribut as the best Attribut (based on the error rate)?
Which criterias does the OneR algorithm use?
--- Addition to better understand my question ---
The complete data are too big to show it. I have constructed a new datapool, which has the same effect
DELAYED - OR1 - CO4 ..
1 ---------1--------1--
0 ---------0--------0--
0 ---------0--------1--
1 ---------0--------1--
0 ---------0--------0--
1 ---------0--------1--
0 ---------0--------0--
1 ---------0--------1--
The code for show the error rate for a single attribute:
print(table(datapool_stackoverflow$DELAYED, datapool_stackoverflow$OR1))
The code the OneR function:
library(FSelector)
oneR_stackoverflow <- oneR(DELAYED~., datapool_stackoverflow)
subset_stackoverflow <- cutoff.k(oneR_stackoverflow, 2)
print(subset_stackoverflow)
The code for the correlation:
cor(as.numeric(datapool_stackoverflow$DELAYED), as.numeric(datapool_stackoverflow$OR1))
In this case the results are:
Error-Rate: OR1 Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):---------------------4-------------------------0
1(Class: ---------------------3-------------------------1
Manuel calculated Error-Rate: 3(0 + 3)
Error-Rate: CO4 Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics
0(Class):-----------------------3-----------------------1
1(Class: -----------------------0-----------------------4
Error-Rate: 1(1 + 0)
Correlation: Attribut OR1: 0.377 Attribut CO4: 0.77
OneR: "OR1", "CO4"
Why, does the OneR function provide the OR1 Attribut as the best Attribut to classify?
You haven't given the types of your data, but I'm assuming that you have numerical values. FSelector discretizes these values before using them in
oneRand it seems that bad things happen there (which may be a bug in RWeka'sDiscretizefunction). However, you probably want factor variables anyway and not numeric data as you have only 0-1 values. Then everything works fine for me:As you can see, co4 now has a much higher importance than or1, as it should have.