How to implement ray cluster structer and distributed training in yolov7 with ray?

16 Views Asked by At

I want to train yolov7 with ray but I cannot distribute dataset to the nodes.

I use ray with docker image. My docker image is rayproject/ray-ml. I run this command and I created a docker container.

docker run --rm --shm=5.01gb --gpus all -it --network host --name ray-head rayproject/ray-ml:latest-py39-gpu

Then, I started a ray cluster with this command:

ray start --head --port=6379

I run docker container another host (node) machine with this command:

docker run --rm --shm=5.01gb --gpus all -it --network host --name ray-worker rayproject/ray-ml:latest-py39-gpu

I started the ray another node with this command:

ray start --address='<ip>:6379'

I copied whole yolov7 folder and dataset to ray-head container. I copied just yolov7 folder to ray-worker container. I made a few changes to run ray remotely in train.py script. I made this to put the dataset into ray object storage:

dataloader = ray.put(dataloader)
dataset = ray.put(dataset)

//...

testloader = ray.put(testloader)

But I got this error:

raise NotImplementedError("{} cannot be pickled", self.__class__.__name__)
NotImplementedError: ('{} cannot be pickled', '_MultiProcessingDataLoaderIter')

When I tried in single machine (only head node). It works fine because dataset is in there. When I copied the dataset to the other nodes, it works fine but I want to use ray object store to complete training faster.

How to implement ray cluster structure and distributed training for yolov7?

0

There are 0 best solutions below