I want to train yolov7 with ray but I cannot distribute dataset to the nodes.
I use ray with docker image. My docker image is rayproject/ray-ml. I run this command and I created a docker container.
docker run --rm --shm=5.01gb --gpus all -it --network host --name ray-head rayproject/ray-ml:latest-py39-gpu
Then, I started a ray cluster with this command:
ray start --head --port=6379
I run docker container another host (node) machine with this command:
docker run --rm --shm=5.01gb --gpus all -it --network host --name ray-worker rayproject/ray-ml:latest-py39-gpu
I started the ray another node with this command:
ray start --address='<ip>:6379'
I copied whole yolov7 folder and dataset to ray-head container. I copied just yolov7 folder to ray-worker container. I made a few changes to run ray remotely in train.py script. I made this to put the dataset into ray object storage:
dataloader = ray.put(dataloader)
dataset = ray.put(dataset)
//...
testloader = ray.put(testloader)
But I got this error:
raise NotImplementedError("{} cannot be pickled", self.__class__.__name__)
NotImplementedError: ('{} cannot be pickled', '_MultiProcessingDataLoaderIter')
When I tried in single machine (only head node). It works fine because dataset is in there. When I copied the dataset to the other nodes, it works fine but I want to use ray object store to complete training faster.
How to implement ray cluster structure and distributed training for yolov7?