I wanna classify Movielense users table demographic data but the result of J48 is weird, I classify my data with C5.0 and every thing was fine But I must work on this algorithm (j48)
structure of my data is like below
$ user_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ age : Factor w/ 7 levels "1","18","25",..: 1 7 3 5 3 6 4 3 3 4 ...
$ occupation: Factor w/ 21 levels "0","1","2","3",..: 11 17 16 8 21 10 2 13 18 2 ...
$ gender : Factor w/ 2 levels "F","M": 1 2 2 2 2 1 2 2 2 1 ...
$ Class : Factor w/ 4 levels "1","2","3","4": 2 2 2 2 3 2 2 2 2 4 ...
and head of data is
head(data)
user_id age occupation gender Class
1 1 1 10 F 2
2 2 56 16 M 2
3 3 25 15 M 2
4 4 45 7 M 2
5 5 25 20 M 3
6 6 50 9 F 2
all column except user_id
are nominal type
and should be factor in R
Code for classification:
library(RWeka)
fit <- J48(data$Class~., data=data[,-c(1)], control = Weka_control(C=0.25))
currentUserClass = predict(fit,data[,-c(1)])
table(currentUserClass , data$Class)
and wrong table of summary result is
currentUserClass 1 2 3 4
1 0 0 0 0
2 216 3630 1549 645
3 0 0 0 0
4 0 0 0 0
When I fit my model with C5.0 result is like below that I except from both algorithm
predictions 1 2 3 4
1 216 0 0 0
2 0 3630 0 0
3 0 0 1549 0
4 0 0 0 645
More Try
- I change the structure of my data and convert my factor columns to separate columns and nothing changes
- I change
C controller value
the result goes a little better inC=0.75
but It's totally wrong
event after normalization and changing data nothing happened
> head(data)
user_id age1 age18 age25 age35 age45 age50
1 1 5.1188737 -0.4726289 -0.7289391 -0.4960755 -0.3164894 -0.2990841
2 2 -0.1953231 -0.4726289 -0.7289391 -0.4960755 -0.3164894 -0.2990841
3 3 -0.1953231 -0.4726289 1.3716296 -0.4960755 -0.3164894 -0.2990841
4 4 -0.1953231 -0.4726289 -0.7289391 -0.4960755 3.1591400 -0.2990841
5 5 -0.1953231 -0.4726289 1.3716296 -0.4960755 -0.3164894 -0.2990841
6 6 -0.1953231 -0.4726289 -0.7289391 -0.4960755 -0.3164894 3.3429880
age56 occupation1 occupation2 occupation3 occupation4 occupation5
1 -0.2590882 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
2 3.8590505 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
3 -0.2590882 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
4 -0.2590882 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
5 -0.2590882 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
6 -0.2590882 -0.3094756 -0.2150398 -0.1717035 -0.3790765 -0.1374418
occupation6 occupation7 occupation8 occupation9 occupation10 occupation11
1 -0.2016306 -0.3558574 -0.05312294 -0.1243576 5.4744311 -0.1477163
2 -0.2016306 -0.3558574 -0.05312294 -0.1243576 -0.1826371 -0.1477163
3 -0.2016306 -0.3558574 -0.05312294 -0.1243576 -0.1826371 -0.1477163
4 -0.2016306 2.8096490 -0.05312294 -0.1243576 -0.1826371 -0.1477163
5 -0.2016306 -0.3558574 -0.05312294 -0.1243576 -0.1826371 -0.1477163
6 -0.2016306 -0.3558574 -0.05312294 8.0399919 -0.1826371 -0.1477163
occupation12 occupation13 occupation14 occupation15 occupation16 occupation17
1 -0.2619865 -0.1551514 -0.2293967 -0.1562667 -0.2038431 -0.3010506
2 -0.2619865 -0.1551514 -0.2293967 -0.1562667 4.9049217 -0.3010506
3 -0.2619865 -0.1551514 -0.2293967 6.3982549 -0.2038431 -0.3010506
4 -0.2619865 -0.1551514 -0.2293967 -0.1562667 -0.2038431 -0.3010506
5 -0.2619865 -0.1551514 -0.2293967 -0.1562667 -0.2038431 -0.3010506
6 -0.2619865 -0.1551514 -0.2293967 -0.1562667 -0.2038431 -0.3010506
occupation18 occupation19 occupation20 genderM Class
1 -0.1082744 -0.1098287 -0.2208735 -1.5917949 2
2 -0.1082744 -0.1098287 -0.2208735 0.6281176 2
3 -0.1082744 -0.1098287 -0.2208735 0.6281176 2
4 -0.1082744 -0.1098287 -0.2208735 0.6281176 2
5 -0.1082744 -0.1098287 4.5267283 0.6281176 3
6 -0.1082744 -0.1098287 -0.2208735 -1.5917949 2
> fit <- J48(data$Class~., data=data, control = Weka_control(C=0.25))
> currentUserClass = predict(fit,data)
> table(currentUserClass , data$Class)
currentUserClass 1 2 3 4
1 7 1 2 2
2 201 3601 1470 617
3 8 28 75 14
4 0 0 2 12
J48 is implementing the C4.5 decision tree algorithm. The performance of C5.0 and C4.5 may differ. That said, the parameters of J48 within Weka can be modified (as you have shown in your code above). Perhaps that will help satisfy your needs.
To start, your tree is likely a single leaf predicting class 2. This can be checked by printing the decision tree. The code below does so with the "mtcars" dataset (a built in dataset with R).
However, if the tree is rebuilt with a smaller number of minimum objects in a leaf and no pruning, the tree will be larger.
The following can be used to check the valid parameters of J48:
You should change the default parameters of J48 to fit your particular need. I recommend comparing the parameters used in your C5.0 to the default parameters of J48 and making modifications where necessary.