Based on the official distributed model training example (https://github.com/rapidsai/cuml/blob/branch-0.18/notebooks/random_forest_mnmg_demo.ipynb), I used the Iris dataset to train a random forest model on a multi GPU dask cluster (one scheduler node, three worker nodes), but the model can't be trained. The results are as following:
CuML accuracy: 0.36666666666666664
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
File "cuml/ensemble/randomforestclassifier.pyx", line 334, in cuml.ensemble.randomforestclassifier.RandomForestClassifier.__del__
File "cuml/ensemble/randomforestclassifier.pyx", line 350, in cuml.ensemble.randomforestclassifier.RandomForestClassifier._reset_forest_data
AttributeError: 'NoneType' object has no attribute 'free_treelite_model'
Process finished with exit code 0
My environment is constructed by the conda command:
conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \
-c defaults rapids-blazing=0.18 python=3.8 cudatoolkit=10.2
The code I use for RAPIDs RandomForestClassifier is:
import pandas as pd
import cudf
import cuml
from cuml import train_test_split
from cuml.metrics import accuracy_score
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf
from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF
# start dask cluster
c = Client('node0:8786')
# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8 # Performance optimization
# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000
# Read data
pdf = pd.read_csv('/data/iris.csv',header = 0, delimiter = ',') # Get complete CSV
cdf = cudf.from_pandas(pdf) # Get cuda dataframe
features = cdf.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
labels = cdf.iloc[:, 4].astype('category').cat.codes.astype('int32') # Get label column
# Split train and test data
X_train, X_test, y_train, y_test = train_test_split(feature, label, train_size=0.8, shuffle=True)
# Distribute data to worker GPUs
n_partitions = n_workers
X_train_dask = dask_cudf.from_cudf(X_train, npartitions=n_partitions)
X_test_dask = dask_cudf.from_cudf(X_test, npartitions=n_partitions)
y_train_dask = dask_cudf.from_cudf(y_train, npartitions=n_partitions)
# Train the distributed cuML model
cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)
cuml_model.fit(X_train_dask, y_train_dask)
wait(cuml_model.rfs) # Allow asynchronous training tasks to finish
# Predict and check accuracy
cuml_y_pred = cuml_model.predict(X_test_dask).compute().to_array()
print("CuML accuracy: ", accuracy_score(y_test.to_array(), cuml_y_pred))
The results have not changed by using the LocalCUDACluster.
Can you point out my mistake and give me the correct code? And if I want to evaluate decision trees on the trained random forest model, how can I get those trained decision trees?
Thank you.