I am trying to train a ml model using dask. I am training on my local machine with 1 GPU. My GPU has 24 GiBs of memory.
from dask_cuda import LocalCUDACluster
from dask.distributed import Client, LocalCluster
import dask.dataframe as dd
import pandas as pd
import numpy as np
import os
import xgboost as xgb
np.random.seed(42)
def get_columns(filename):
return pd.read_csv(filename, nrows=10).iloc[:, :NUM_FEATURES].columns
def get_data(filename, target):
import dask_cudf
X = dask_cudf.read_csv(filename)
# X = dd.read_csv(filename, assume_missing=True)
y = X[[target]]
X = X.iloc[:, :NUM_FEATURES]
return X, y
def main(client: Client) -> None:
X, y = get_data(FILENAME, TARGET)
model = xgb.dask.DaskXGBRegressor(
tree_method="gpu_hist",
objective="reg:squarederror",
seed=42,
max_depth=5,
eta=0.01,
n_estimators=10)
model.client = client
model.fit(X, y, eval_set=[(X, y)])
print("Saving the model..")
model.get_booster().save_model("xgboost.model")
print("Doing model importance..")
columns = get_columns(FILENAME)
pd.Series(model.feature_importances_, index=columns).sort_values(ascending=False).to_pickle("~/yolo.pkl")
if __name__ == "__main__":
os.environ["MALLOC_TRIM_THRESHOLD_"]="65536"
with LocalCUDACluster(device_memory_limit="15 GiB", rmm_pool_size="20 GiB") as cluster:
# with LocalCluster() as cluster:
with Client(cluster) as client:
print(client)
main(client)
Error as follows.
MemoryError: std::bad_alloc: out_of_memory: RMM failure at:/workspace/.conda-bld/work/include/rmm/mr/device/pool_memory_resource.hpp:192: Maximum pool size exceeded
Basically my GPU runs out of memory when I call model.fit. It works when I use a csv with 64100 rows and fails when I use a csv with 128198 rows (2x rows). These aren't large files so I assume I am doing something wrong.
I have tried fiddling around with
- LocalCUDACluster: device_memory_limit and rmm_pool_size
- dask_cudf.read_csv: chunksize
Nothing has worked.
I have been stuck on this all day so any help would be much appreciated.
You cannot train an xgboost model where the model grows larger than the remaining GPU memory size. You can scale out with dask_xgboost, but you need to ensure that the total GPU memory is sufficient.
Here is a great blog on this by Coiled: https://coiled.io/blog/dask-xgboost-python-example/