I have a workflow which is greatly benefited from GPU acceleration, but each task has relatively low memory requirements (2-4 GB). I'm using a combination of dask.dataframe
, dask.distributed.Client
, and dask_cuda.LocalCUDACluster
. The process would greatly benefit from more workers CUDA workers so I want to split the physical GPUs (Nvidia RTX A600, V100, A100) into multiple virtual/logical GPUs to increase the number of workers in my dask_cuda LocalCUDACluster
. My initial thought was to try and pass logical_gpus created in TensorFlow
to the LocalCUDACluster
, but I don't seem to be able to pass them into the cluster.
I'm working in a docker environment, and I'd like to keep these plitting inside python. This workflow will ideally scale from a local workstation to multinode MPI jobs, but I'm not sure this is possible and I'm open to any suggestions.
Adding a similar example.
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
from dask_cuda.initialize import initialize
import pandas as pd
import dask.dataframe as dd
import time
# fake function
def my_gpu_sim(x):
"""
GPU simulation which is independent of any others (calls a c++ program in real-world, which saves a
file.)
"""
...
return None
# fake data creation
dic = {'random':['apple' for i in range(40)], 'main':[i for i in range(40)]}
df = pd.DataFrame.from_dict(dic)
ddf = dd.from_pandas(df, npartitions=4)
# Configurations
protocol = "ucx"
enable_tcp_over_ucx = True
enable_nvlink = True
enable_infiniband = False
initialize(
create_cuda_context=True,
enable_tcp_over_ucx=enable_tcp_over_ucx,
enable_infiniband=enable_infiniband,
enable_nvlink=enable_nvlink,
)
cluster = LocalCUDACluster(local_directory="/tmp/USERNAME",
protocol=protocol,
enable_tcp_over_ucx=enable_tcp_over_ucx,
enable_infiniband=enable_infiniband,
enable_nvlink=enable_nvlink,
rmm_pool_size="35GB"
)
client = Client(cluster)
# Simulation
ddf.map_partitions(lambda df: df.apply(lambda x: my_gpu_sim(x.main), axis=1)).compute(scheduler=client)