Procedure of the OneR algorithm in R

Question

Procedure of the OneR algorithm in R

667 Views Asked by Tom Maier At 20 September 2015 at 09:34

I have used the OneR algorithm of the FSelecter Package to find the Attribut with the lowest error rate. My class Attribut is yes and no. My characteristics of the attributs are also yes and no.

The result of the OneR algorithm is:

Ranking-No. 1

Atribut-Name: OR1: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25243-------------------0

1(Class: -------------------1459-------------------18

Error-Rate: 1459 (0 + 1459)

Ranking-No. 2

Atribut-Name: OR2: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25243-------------------0

1(Class: -------------------1460-------------------17

Error-Rate: 1460 (0 + 1460)

However, if I use the correlation function on the same data Frame the best attributs have got a lower error rate than the attributs, which i get with the oneR function.

Atribut-Name: CO4: 

Matrix: ------ 0(Attribut-Characteristic)  -- 1 (Attribut Characteristics

0(Class):-------------------25204-------------------39

1(Class: -------------------1348-------------------129

Error-Rate: 1387 (39 + 1348)

Can anybody tell me, why the OneR algorithm does not show the CO4 Attribut as the best Attribut (based on the error rate)?

Which criterias does the OneR algorithm use?

--- Addition to better understand my question ---

The complete data are too big to show it. I have constructed a new datapool, which has the same effect

DELAYED - OR1 - CO4 ..

1 ---------1--------1--

0 ---------0--------0--

0 ---------0--------1--

1 ---------0--------1--

0 ---------0--------0--

1 ---------0--------1--

0 ---------0--------0--

1 ---------0--------1--

The code for show the error rate for a single attribute:

print(table(datapool_stackoverflow$DELAYED, datapool_stackoverflow$OR1))

The code the OneR function:

library(FSelector)

oneR_stackoverflow <- oneR(DELAYED~., datapool_stackoverflow)

subset_stackoverflow <- cutoff.k(oneR_stackoverflow, 2)

print(subset_stackoverflow)

The code for the correlation:

cor(as.numeric(datapool_stackoverflow$DELAYED), as.numeric(datapool_stackoverflow$OR1))

In this case the results are:

Error-Rate: OR1 Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics

0(Class):---------------------4-------------------------0

1(Class: ---------------------3-------------------------1

Manuel calculated Error-Rate: 3(0 + 3)

Error-Rate: CO4 Matrix: ------ 0(Attribut-Characteristic) -- 1 (Attribut Characteristics

0(Class):-----------------------3-----------------------1

1(Class: -----------------------0-----------------------4

Error-Rate: 1(1 + 0)

Correlation: Attribut OR1: 0.377 Attribut CO4: 0.77

OneR: "OR1", "CO4"

Why, does the OneR function provide the OR1 Attribut as the best Attribut to classify?

Original Q&A

There are 3 best solutions below

**Lars Kotthoff** · Answer 1 · 2015-09-20T16:29:55.787000

You haven't given the types of your data, but I'm assuming that you have numerical values. FSelector discretizes these values before using them in oneR and it seems that bad things happen there (which may be a bug in RWeka's Discretize function). However, you probably want factor variables anyway and not numeric data as you have only 0-1 values. Then everything works fine for me:

> df = data.frame(delayed=factor(c(1,0,0,1,0,1,0,1)), or1 = factor(c(1,0,0,0,0,0,0,0)), co4 = factor(c(1,0,1,1,0,1,0,1)))
> library(FSelector)
> oneR(delayed~., df)
    attr_importance
or1       0.2000000
co4       0.4285714

As you can see, co4 now has a much higher importance than or1, as it should have.

**Tom Maier** · Answer 2 · 2015-09-21T09:54:19.840000

Ok, i have the solution. The algorithm calculates the sum of the error rate of the characteristcs in a attribut (in relation to the max value of a characteristc)

In this example:

Attribut OR1: 3/7 + 0/1 = 3/7

Attribut CO4: 0/3 + 1/5 = 0.2

**vonjd** · Answer 3 · 2016-06-15T19:09:53.033000

No, the CO4 should be chosen, choosing the other attribute is wrong - see what the OneR package (available on CRAN) gives:

> library(OneR)
> DELAYED <- c(1, 0, 0, 1, 0, 1, 0, 1)
> OR1 <- c(1, rep(0, 7))
> CO4 <- c(1, 0, 1, 1, 0, 1, 0, 1)
> 
> data <- data.frame(DELAYED, OR1, CO4)
> 
> model <- OneR(formula = DELAYED ~., data = data, verbose = T)

    Attribute Accuracy
1 * CO4       87.5%   
2   OR1       62.5%   
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'

> summary(model)

Rules:
If CO4 = 0 then DELAYED = 0
If CO4 = 1 then DELAYED = 1

Accuracy:
7 of 8 instances classified correctly (87.5%)

Contingency table:
       CO4
DELAYED   0   1 Sum
    0   * 3   1   4
    1     0 * 4   4
    Sum   3   5   8
---
Maximum in each column: '*'

Pearson's Chi-squared test:
X-squared = 2.1333, df = 1, p-value = 0.1441

> 
> model_2 <- OneR(formula = DELAYED ~ OR1, data = data)
> summary(model_2)

Rules:
If OR1 = 0 then DELAYED = 0
If OR1 = 1 then DELAYED = 1

Accuracy:
5 of 8 instances classified correctly (62.5%)

Contingency table:
       OR1
DELAYED   0   1 Sum
    0   * 4   0   4
    1     3 * 1   4
    Sum   7   1   8
---
Maximum in each column: '*'

Pearson's Chi-squared test:
X-squared = 0, df = 1, p-value = 1

You can find more information about the OneR package here: https://github.com/vonjd/OneR

(full disclosure: I am the author of this package)

Procedure of the OneR algorithm in R

There are 3 best solutions below

Related Questions in R

Related Questions in FSELECTOR

Trending Questions

Popular # Hahtags

Popular Questions