I am new to Julia and trying to fit a simple classification tree
Packages import and env activation:
using Pkg
Pkg.activate(".")
using CSV
using DataFrames
using Random
using Downloads
using ARFFFiles
using ScientificTypes
using DataFramesMeta
using DynamicPipe
using MLJ
using MLJDecisionTreeInterface
Data:
titanic_reader = CSV.File("/home/andrea/dev/julia/titanic.csv"; header = 1);
titanic = DataFrame(titanic_reader);
# remove missing values
titanic = dropmissing(titanic);
titanic = @transform(titanic,
:class=categorical(:class),
:sex=categorical(:sex),
:survived=categorical(:survived)
);
Check the data
first (titanic , 3)
3×4 DataFrame
Row │ class sex age survived
│ Cat… Cat… Float64 Cat…
─────┼──────────────────────────────────
1 │ 3 male 22.0 N
2 │ 1 female 38.0 Y
3 │ 3 female 26.0 Y
Check the data schema
schema(titanic);
┌──────────┬───────────────┬───────────────────────────────────┐
│ names │ scitypes │ types │
├──────────┼───────────────┼───────────────────────────────────┤
│ class │ Multiclass{3} │ CategoricalValue{Int64, UInt32} │
│ sex │ Multiclass{2} │ CategoricalValue{String7, UInt32} │
│ age │ Continuous │ Float64 │
│ survived │ Multiclass{2} │ CategoricalValue{String1, UInt32} │
└──────────┴───────────────┴───────────────────────────────────┘
Schema seems ok to me
Prepare data for modelling:
# target and features
y, X = unpack(titanic, ==(:survived), rng = 123);
# partitiont training & test
(X_trn, X_tst), (y_trn, y_tst) = partition((X, y), 0.75, multi=true, rng=123);
Fit the model:
# model
mod = @load DecisionTreeClassifier pkg = "DecisionTree" ;
fm = mod() ;
fm_mach = machine(fm, X_trn, y_trn);
and here is the problem:
Warning: The number and/or types of data arguments do not match what the specified model
│ supports. Suppress this type check by specifying `scitype_check_level=0`.
│
│ Run `@doc DecisionTree.DecisionTreeClassifier` to learn more about your model's requirements.
│
│ Commonly, but non exclusively, supervised models are constructed using the syntax
│ `machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
│ constructed with `machine(model, X)`. Here `X` are features, `y` a target, and `w`
│ sample or class weights.
│
│ In general, data in `machine(model, data...)` is expected to satisfy
│
│ scitype(data) <: MLJ.fit_data_scitype(model)
│
│ In the present case:
│
│ scitype(data) = Tuple{Table{Union{AbstractVector{Continuous}, AbstractVector{Multiclass{3}}, AbstractVector{Multiclass{2}}}}, AbstractVector{Multiclass{2}}}
│
│ fit_data_scitype(model) = Tuple{Table{<:Union{AbstractVector{<:Continuous}, AbstractVector{<:Count}, AbstractVector{<:OrderedFactor}}}, AbstractVector{<:Finite}}
└ @ MLJBase ~/.julia/packages/MLJBase/eCnWm/src/machines.jl:231
Clearly when fitting the model:
fit!(fm_mach)
I get an error
[ Info: It seems an upstream node in a learning network is providing data of incompatible scitype. See above.
ERROR: ArgumentError: Unordered CategoricalValue objects cannot be tested for order using <. Use isless instead, or call the ordered! function on the parent array to change this
Stacktrace:
I am almost sure the error depends on the data type specification but, I cannot work the solution out.
I can replicate your issue by using the Titan dataset from the MLJ function OpenML:
and then cleaning it a bit to get exactly the same dataset you are using:
Now the issue is that the
DecisionTreeClassifiermodel from theDeicisonTreepackage is very efficient (fast!) but it requires ordered data only.In this case you could perhaps coerce the
classas an ordered field. An alternative is to use theDecisionTreeClassifiermodel fromBetaML, that at the cost of being a bit slower can use any kind of input, including the Missing ones (so no need to drop them or use only that few fields - the originaltitandataset has many more fields):Note that there is a nice tutorial exactly on fitting the Titan database with Decision Tree and MLJ here: https://forem.julialang.org/mlj/julia-boards-the-titanic-1ne8 .