I have a script running on an EC2 instance that reads vector embeddings from s3 and dumps them into a list variable; from there, it creates a dask dataframe that will be used in a Dask KMeans clustering call further down the code.
Here's the line that creates the dask df:
embeddings_dd = dask_cudf.from_cudf(cudf.from_pandas(pd.DataFrame(embeddings_list)), npartitions = 600)
There are 1339 files in total to be processed. If I read only 100 of them and put an input('Wait..') after that line, I get a memory usage of ~2.3 GB on the GPU. That said, I estimate that it should take around 31GB to process all files. I'm using an AWS instance that has 4 GPUs and a total of 96 GB of GPU Memory, so it should not be a problem.
However, if I try to load all files, I get a CUDA OOM error. Dask seems to be failing (probably my fault) in splitting the dataframe partitions across the GPUs and I'm not quite sure about the reason. It's curious, though, that all four GPUs are indeed being used during processing (KMeans fit), including their memory. So the problem is really in the reading part only.
Not sure if it's helpful, but the fit lines are as this:
clusterer = KMeansDask(init="k-means||",
n_clusters=n_clusters_iter,
random_state=0
)
clusterer.fit(embeddings)
where KMeansDask is defined at from cuml.dask.cluster.kmeans import KMeans as KMeansDask
I'm using the following lines to create the Dask cluster before the read part:
Kmeans_GPU_cluster = LocalCUDACluster()
Kmeans_client = Client(Kmeans_GPU_cluster)
How to make Dask use the memory available on all GPUs when I'm loading the data into GPU memory?