PyTorch DDP (with Join Context Manager) consuming more power for uneven data distribution

101 Views Asked by At

I am using a 2 node distributed setup(each having a single GPU respectively) to train a Neural Network (NN). I utilize PyTorch Data Distributed Parallel with Join Context Manager to achieve this. I am measuring power consumption varying data distribution on those two nodes. I am noticing more power consumption in the node which has a lesser part of the data, e.g.,for a scenario where node 1 training on 20% of the dataset and node2 is training on the rest 80% of data, I am getting higher power consumption in node1 once it finishes training its part. I know how Join Context manager works and it is intuitive why node1 is consuming more power. But there is nothing mentioned in documentation about power consumption. Is this PyTorch implementation-specific or this is how any synchronous training work (in any framework - PyTorch, Tensorflow etc.) ?

I ran a round of experiments distributing training dataset unbalancedly. Let's say, at first, I experimented with 10%-90% data distribution between node1 and node2 respectively. Then I did experiment with 20%-80%, 30%-70%,..90%-10% data distribution between node1 and node2 respectively. Everytime I was getting more power consumtion in the node with lesser data. Actually more consumption is happening when training is done on the specific node and it join early. I used nvidia-smi to check power consumption in GPU and GPU communicatin is happening using NCCL.

I followed this resource from PyTorch documentation: https://pytorch.org/tutorials/advanced/generic_join.html#:~:text=The%20context%20manager%20allows%20the,shadowed%20are%20specified%20by%20hooks.

0

There are 0 best solutions below