RWeka J48 Classification issue in R and MovieLense data sets

583 Views Asked by At

I wanna classify Movielense users table demographic data but the result of J48 is weird, I classify my data with C5.0 and every thing was fine But I must work on this algorithm (j48)

structure of my data is like below

$ user_id   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ age       : Factor w/ 7 levels "1","18","25",..: 1 7 3 5 3 6 4 3 3 4 ...
 $ occupation: Factor w/ 21 levels "0","1","2","3",..: 11 17 16 8 21 10 2 13 18 2 ...
 $ gender    : Factor w/ 2 levels "F","M": 1 2 2 2 2 1 2 2 2 1 ...
 $ Class     : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 3 2 2 2 2 4 ...

and head of data is

head(data)
  user_id age occupation gender Class
1       1   1         10      F     2
2       2  56         16      M     2
3       3  25         15      M     2
4       4  45          7      M     2
5       5  25         20      M     3
6       6  50          9      F     2

all column except user_id are nominal type and should be factor in R

Code for classification:

library(RWeka)
fit <- J48(data$Class~., data=data[,-c(1)], control = Weka_control(C=0.25))
currentUserClass = predict(fit,data[,-c(1)])
table(currentUserClass , data$Class)

and wrong table of summary result is

currentUserClass    1    2    3    4
               1    0    0    0    0
               2  216 3630 1549  645
               3    0    0    0    0
               4    0    0    0    0

When I fit my model with C5.0 result is like below that I except from both algorithm

predictions    1    2    3    4
          1  216    0    0    0
          2    0 3630    0    0
          3    0    0 1549    0
          4    0    0    0  645

More Try

  1. I change the structure of my data and convert my factor columns to separate columns and nothing changes
  2. I change C controller value the result goes a little better in C=0.75 but It's totally wrong

event after normalization and changing data nothing happened

> head(data)
  user_id       age1      age18      age25      age35      age45      age50
1       1  5.1188737 -0.4726289 -0.7289391 -0.4960755 -0.3164894 -0.2990841
2       2 -0.1953231 -0.4726289 -0.7289391 -0.4960755 -0.3164894 -0.2990841
3       3 -0.1953231 -0.4726289  1.3716296 -0.4960755 -0.3164894 -0.2990841
4       4 -0.1953231 -0.4726289 -0.7289391 -0.4960755  3.1591400 -0.2990841
5       5 -0.1953231 -0.4726289  1.3716296 -0.4960755 -0.3164894 -0.2990841
6       6 -0.1953231 -0.4726289 -0.7289391 -0.4960755 -0.3164894  3.3429880
       age56 occupation1 occupation2 occupation3 occupation4 occupation5
1 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
2  3.8590505  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
3 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
4 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
5 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
6 -0.2590882  -0.3094756  -0.2150398  -0.1717035  -0.3790765  -0.1374418
  occupation6 occupation7 occupation8 occupation9 occupation10 occupation11
1  -0.2016306  -0.3558574 -0.05312294  -0.1243576    5.4744311   -0.1477163
2  -0.2016306  -0.3558574 -0.05312294  -0.1243576   -0.1826371   -0.1477163
3  -0.2016306  -0.3558574 -0.05312294  -0.1243576   -0.1826371   -0.1477163
4  -0.2016306   2.8096490 -0.05312294  -0.1243576   -0.1826371   -0.1477163
5  -0.2016306  -0.3558574 -0.05312294  -0.1243576   -0.1826371   -0.1477163
6  -0.2016306  -0.3558574 -0.05312294   8.0399919   -0.1826371   -0.1477163
  occupation12 occupation13 occupation14 occupation15 occupation16 occupation17
1   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
2   -0.2619865   -0.1551514   -0.2293967   -0.1562667    4.9049217   -0.3010506
3   -0.2619865   -0.1551514   -0.2293967    6.3982549   -0.2038431   -0.3010506
4   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
5   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
6   -0.2619865   -0.1551514   -0.2293967   -0.1562667   -0.2038431   -0.3010506
  occupation18 occupation19 occupation20    genderM Class
1   -0.1082744   -0.1098287   -0.2208735 -1.5917949     2
2   -0.1082744   -0.1098287   -0.2208735  0.6281176     2
3   -0.1082744   -0.1098287   -0.2208735  0.6281176     2
4   -0.1082744   -0.1098287   -0.2208735  0.6281176     2
5   -0.1082744   -0.1098287    4.5267283  0.6281176     3
6   -0.1082744   -0.1098287   -0.2208735 -1.5917949     2
> fit <- J48(data$Class~., data=data, control = Weka_control(C=0.25))
> currentUserClass = predict(fit,data)
> table(currentUserClass , data$Class)

currentUserClass    1    2    3    4
               1    7    1    2    2
               2  201 3601 1470  617
               3    8   28   75   14
               4    0    0    2   12
1

There are 1 best solutions below

1
On

J48 is implementing the C4.5 decision tree algorithm. The performance of C5.0 and C4.5 may differ. That said, the parameters of J48 within Weka can be modified (as you have shown in your code above). Perhaps that will help satisfy your needs.

To start, your tree is likely a single leaf predicting class 2. This can be checked by printing the decision tree. The code below does so with the "mtcars" dataset (a built in dataset with R).

dat <- mtcars 
dat$carb <- factor(dat$carb)
model1 <- J48(carb ~., data = dat)
model1

However, if the tree is rebuilt with a smaller number of minimum objects in a leaf and no pruning, the tree will be larger.

model2 <- J48(carb ~., data = dat, control= Weka_control(M=1,U=TRUE))
model2

The following can be used to check the valid parameters of J48:

WOW(J48)

You should change the default parameters of J48 to fit your particular need. I recommend comparing the parameters used in your C5.0 to the default parameters of J48 and making modifications where necessary.