Out of memory error with Dask and cudf loop

498 Views Asked by At

I am using Dask and Rapidsai to run an xgboost model on a large (6.9GB) dataset. The hardware is 4x 2080 TIs with 11 GB of memory each. The raw dataset has a few dozen target columns that have been one-hot encoded, so I am trying to run a loop that keeps one target column at a time, drops the rest, runs the model, and repeats.

Here's some pseudocode to show the process:

with LocalCUDACluster(n_workers=4) as cluster:
     with Client(cluster) as client:
          raw_data = dask_cudf.read_csv(MyFile)
          d_train, d_test = 'function to drop all target columns but one, then create 
                            DaskDMatrix objects for training and testing'
          model = 'Custom gridsearch function that loops through xgb.dask.train function and 
                  returns a dict of optimal params'

Ok, so if I run that on any individual target column, it works fine-- pushes up against available memory, but doesn't break. If I try to do it with a loop after the cluster/client assignment:

targets = ['target1', 'target2', ...]
with LocalCUDACluster(n_workers=4) as cluster:
     with Client(cluster) as client:
          for target in targets:
               same code as above

it breaks on the third run through the loop, running out of memory. If, instead, I assign the cluster/client within the loop:

targets = ['target1', 'target2', ...]
for target in targets:
     with LocalCUDACluster(n_workers=4) as cluster:
          with Client(cluster) as client:
             same code as above

it works fine.

So, I'm guessing that some of the objects I created in a loop run are obviously persisting, right? So, I tried setting d_train and d_test to empty lists at the end of each loop to clear out the memory, but that didn't work. It also doesn't matter if I read in the raw data before the loop or within it, or any other changes.

The only thing that seems to work is to reinitialize the cluster and client at the end of each loop.

Why is that? Is there a more efficient way of handling memory in a loop over dask partitions? I've read that dask is very inefficient in loops, anyway, so maybe I shouldn't even be doing things this way.

Any insight into memory management with dask or rapids would be greatly appreciated.

1

There are 1 best solutions below

0
On

As a first step, you can select which columns of data you want to read into dask_cudf. That might help you with your memory issues as you iterate though and create

targets = ['target1', 'target2', ...]
with LocalCUDACluster(n_workers=4) as cluster:
     with Client(cluster) as client:
          for target in targets:
              dask_cudf.read_csv(MyFile, usecols=target)
              ... ## keep going with your ETL

Here are the docs for reference and if you want to . The API is similar between cudf and dask_cudf, as it is with dask https://docs.rapids.ai/api/cudf/stable/api.html#cudf.io.csv.read_csv

Pro tip: If the column data is small enough, you might be able to get away with 1 GPU and just used cuDF, which will be more performant as there is less overhead from the scheduler.

Another thing you can do is pickle each xgboost model, save it to disk, and delete/overwrite the xgboost model, as the models will take up space on your GPU. Then, you can read it back in when you are done with this part of your ETL.

If you're interested in Hyper Parameter Optimization (HPO), you should check out our HPO notebooks and utilities, like https://github.com/rapidsai/cloud-ml-examples/blob/main/dask/kubernetes/Dask_cuML_Exploration.ipynb

Also, you may not need all the with statements, but I don't know what you're doing outside of this snippet, so for expediency in answering, I am leaving them in :).