Huggingface Trainer instant shutdown Ubuntu VM in Vcenter no warning no logs no errors

44 Views Asked by At

I have been troubleshooting this issue for over a week because the problem leaves zero trace of any errors in any logs of any kind. I'm asking this question to see if anyone else has experienced this.

No matter what notebook I use, or modules I install, or upgrade, or uninstall, the Trainer() module causes the VM to shutdown instantly.

I have an idea that it is GPU related since I have run this on the CPU with no problems.

I have made the devices visible (0,1) I have also enabled / disabled wandb and set report_to="none"

Is cuda available? True
Cuda torch version? 12.1
Is cuDNN version: 8902
cuDNN enabled?  True
Device count? 1
Current device? 0
Device name?  NVIDIA A30
tensor([[0.4543, 0.0545, 0.9293],
        [0.7722, 0.6535, 0.1276],
        [0.9957, 0.5621, 0.1621],
        [0.3164, 0.2845, 0.6874],
        [0.5489, 0.7582, 0.7139]])
# setting device on GPU if available, else CPU

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)
print()

#Additional Info when using cuda
if device.type == 'cuda':
    print(torch.cuda.get_device_name(0))
    print('Memory Usage:')
    print('Allocated:', round(torch.cuda.memory_allocated(0)/1024**3,1), 'GB')
    print('Cached:   ', round(torch.cuda.memory_reserved(0)/1024**3,1), 'GB')

has anyone experienced this before?

1

There are 1 best solutions below

1
On

This is likely one of two things

  1. a driver issue
  2. or a VM setup issue.

I'm leaning towards #2. Check out the link below as it may help with your problem. Your GPU passthru setting is likely incorrect.

https://mathiashueber.com/windows-virtual-machine-gpu-passthrough-ubuntu/