OCI runtime create failed: container_linux.go:349: starting container process caused on sagemaker

448 Views Asked by At

I am trying to run a model (python script) in script mode on AWS sagemaker . I try to use Tensorflow estimator to invoke script from notebook as shown below

from sagemaker.tensorflow import TensorFlow

tf_estimator = TensorFlow(
                         entry_point='train.py', 
                         role=role,
                         train_instance_count=1,
                         train_instance_type='local_gpu',
                         framework_version='1.12',
                         py_version='py3',
                         script_mode=True,
                         hyperparameters={'epochs': 10})

tf_estimator.fit({'training': training_path_input, 'validation': validation_path_input})

I get error as shown below.

>     Creating tmpvq65nmup_algo-1-wipol_1 ... 
>     ting tmpvq65nmup_algo-1-wipol_1 ... error
>     ERROR: for tmpvq65nmup_algo-1-wipol_1  Cannot start service algo-1-wipol: OCI runtime create failed: container_linux.go:349:
> starting container process caused "process_linux.go:449: container
> init caused \"process_linux.go:432: running prestart hook 1 caused
> \\\"error running hook: exit status 1, stdout: , stderr:
> nvidia-container-cli: initialization error: nvml error: driver not
> loaded\\\\n\\\"\"": unknown

I would like know how this can be fixed.

1

There are 1 best solutions below

0
On

Hi could you provide more information regarding the notebook instance you have, which kernel you were running the notebook example with?

The issue seems to be that the the nvidia driver was not installed.