CUML: Random Forest Model Can't Be Trained on a Multi GPU Dask Cluster

1.1k Views Asked by At

Based on the official distributed model training example (https://github.com/rapidsai/cuml/blob/branch-0.18/notebooks/random_forest_mnmg_demo.ipynb), I used the Iris dataset to train a random forest model on a multi GPU dask cluster (one scheduler node, three worker nodes), but the model can't be trained. The results are as following:

CuML accuracy:   0.36666666666666664
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "cuml/ensemble/randomforestclassifier.pyx", line 334, in cuml.ensemble.randomforestclassifier.RandomForestClassifier.__del__
  File "cuml/ensemble/randomforestclassifier.pyx", line 350, in cuml.ensemble.randomforestclassifier.RandomForestClassifier._reset_forest_data
AttributeError: 'NoneType' object has no attribute 'free_treelite_model'

Process finished with exit code 0

My environment is constructed by the conda command:

conda create -n rapids-0.18 -c rapidsai -c nvidia -c conda-forge \
    -c defaults rapids-blazing=0.18 python=3.8 cudatoolkit=10.2

The code I use for RAPIDs RandomForestClassifier is:

import pandas as pd
import cudf
import cuml
from cuml import train_test_split
from cuml.metrics import accuracy_score
from dask.distributed import Client, wait
from dask_cuda import LocalCUDACluster
import dask_cudf
from cuml.dask.ensemble import RandomForestClassifier as cumlDaskRF

# start dask cluster
c = Client('node0:8786')

# Query the client for all connected workers
workers = c.has_what().keys()
n_workers = len(workers)
n_streams = 8  # Performance optimization

# Random Forest building parameters
max_depth = 12
n_bins = 16
n_trees = 1000

# Read data
pdf = pd.read_csv('/data/iris.csv',header = 0, delimiter = ',') # Get complete CSV
cdf = cudf.from_pandas(pdf) # Get cuda dataframe
features = cdf.iloc[:, [0, 1, 2, 3]].astype('float32') # Get data columns
labels = cdf.iloc[:, 4].astype('category').cat.codes.astype('int32') # Get label column

# Split train and test data
X_train, X_test, y_train, y_test = train_test_split(feature, label, train_size=0.8, shuffle=True)

# Distribute data to worker GPUs
n_partitions = n_workers
X_train_dask = dask_cudf.from_cudf(X_train, npartitions=n_partitions)
X_test_dask = dask_cudf.from_cudf(X_test, npartitions=n_partitions)
y_train_dask = dask_cudf.from_cudf(y_train, npartitions=n_partitions)

# Train the distributed cuML model
cuml_model = cumlDaskRF(max_depth=max_depth, n_estimators=n_trees, n_bins=n_bins, n_streams=n_streams)
cuml_model.fit(X_train_dask, y_train_dask)

wait(cuml_model.rfs)  # Allow asynchronous training tasks to finish

# Predict and check accuracy
cuml_y_pred = cuml_model.predict(X_test_dask).compute().to_array()
print("CuML accuracy:  ", accuracy_score(y_test.to_array(), cuml_y_pred))

The results have not changed by using the LocalCUDACluster.

Can you point out my mistake and give me the correct code? And if I want to evaluate decision trees on the trained random forest model, how can I get those trained decision trees?

Thank you.

0

There are 0 best solutions below