Train DeepSpeech on Common Voice dataset gives error on gpu

829 Views Asked by At

I'm trying to train DeepSpeech model on Common Voice dataset as it's stated in documentation. But it gives the following error:

I0421 11:34:32.779112 140581195995008 utils.py:157] NumExpr defaulting to 2 threads.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
    self._extend_graph()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by {{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

     [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/DeepSpeech/DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 529, in train
    load_or_init_graph_for_training(session)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 137, in load_or_init_graph_for_training
    _load_or_init_impl(session, methods, allow_drop_layers=True)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 112, in _load_or_init_impl
    return _initialize_all_variables(session)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 88, in _initialize_all_variables
    session.run(v.initializer)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

     [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

My local machine spec is as follows:

python 3.7; Cuda 10.1; CuDNN 7.6.5; tensorflow-gpu 1.15.2; GPU GTX 1050 ti

I'm also installing the following packages and libraries to prepare the environment:

!apt-add-repository universe
!apt-get install sox libsox-fmt-mp3 cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
!python3.7 -m pip install sox
!python3.7 -m pip install deepspeech-gpu
!python3.7 -m pip install tensorflow-gpu==1.15.2
!python3.7 -m pip install numpy==1.19.5
!python3.7 -m pip install progressbar2
!python3.7 -m pip install progressbar
!python3.7 -m pip install progressbar33
!python3.7 -m pip install ds_ctcdecoder==0.10.0-alpha.3
!python3.7 -m pip install pyogg==0.6.14a1
!python3.7 -m pip install deepspeech
!git clone --branch v0.9.3 https://github.com/mozilla/DeepSpeech
!python3.7 -m pip install --upgrade --force-reinstall -e ./DeepSpeech/
!git clone https://github.com/kpu/kenlm.git
!mkdir -p build
!cmake kenlm
!make -j 4
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-checkpoint.tar.gz
!curl -LO "https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/native_client.amd64.cuda.linux.tar.xz"
!mkdir native_client
!tar xvf native_client.amd64.cuda.linux.tar.xz -C native_client

I'm having the same problem both on my local machine and on google colab vm.

EDIT: I also changed my cuda and cudnn versions to 10.0 and 7.5.6, respectively. But the error already exists.

3

There are 3 best solutions below

0
On BEST ANSWER

I have fixed the problem. The problem was caused by version of the Tensorflow. As I mentioned before, I used Tf 1.15.2, where I had to use Tf 1.15.4, instead.

1
On

I've seen a similar error posted on the DeepSpeech Discourse and the issue there was the CUDA installation.

What is the value of your $LD_LIBRARY_PATH environment variable?

You can find this by doing:

$ echo $LD_LIBRARY_PATH
/usr/lib/x86_64-linux-gnu:/usr/local/cuda/bin:/usr/local/cuda/lib64:/usr/local/cuda-11.2/targets/x86_64-linux/lib

My suspicion here is that CUDA is not able to find the right libraries.

1
On

Thank you for the additional information Soroush.

The LD_LIBRARY_PATH looks good, and I am going to assume that the libraries are actually in those paths.

Next, I want to ensure that the code is executing on the GPU itself.

There are a number of reasons why code may not be executing on the GPU. You mentioned that your environment is set up as per the DeepSpeech PlayBook, which means that it's using Docker. Is that correct? If so, are you is the Docker container spawned with the gpus -all parameter?

The next thing to check is if nvtop is reporting GPU activity from DeepSpeech. When the DeepSpeech.py script is running, this should cause high compute load, observable in nvtop. If you are not seeing this, it means the code is probably not being executed on the GPU, and that would explain the No OpKernel error.