I am trying to fine-tune a language model using the Huggingface libraries, following their guide (with another model and different data, but I don't think this is the crucial point). I am doing this on a Jupyter notebook inside VSCode. My OS is Ubuntu 20.04.
Apparently, my laptop does not have a Nvidia GPU: running
sudo lspci -v | less
reveals that my VGA controller is an Intel TigerLake-LP GT2 (Iris Xe Graphics).
When trying to run the last line trainer.train()
, the kernel crashes. When reducing the batch size to 1, it works, but it is exceeeedingly slooow... (more than 1000 hours expected)... It seems to run on the CPU (also from Intel). Is there any way to make use of my Intel GPU to make it faster?
Right now I am using Google Colab notebooks and it works ok for me, but I am still interested to know if there is a way to make it work on my own GPU.
Old version of the question for reference (partly solved already)
I am trying to fine-tune a language model using the Huggingface libraries, following their guide (with another model and different data, but I don't think this is the crucial point). I am doing this on a Jupyter notebook inside VSCode. My OS is Ubuntu 20.04.
When importing the evaluate library, the following message appears:
2024-02-02 17:18:24.904495: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used. 2024-02-02 17:18:24.934699: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2024-02-02 17:18:25.505165: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT /home/sg/.local/lib/python3.8/site-packages/torch/cuda/init.py:628: UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML")
Later on, I can still do
metric = evaluate.load('accuracy')
,
so the import seemed to work anyway.
Everything goes on fine, until
trainer.train()
where the kernel crashes. The Jupyter log lists the previously cited message as reason for this.
Apparently, my laptop does not have a Nvidia GPU: running
sudo lspci -v | less
reveals a lot of Intel stuff, one Realtek and one KIOXIA device.
After having read a lot today, I get the impression that the evaluate library requires tensorflow, tensorflow requires cuda, and cuda requires a Nvidia GPU. Is my impression correct? Can I make it work anyway, and how? (I know I could probably use a Google Colab notebook, but I would still like to do it locally if possible.) One answer to this related question pointed out the existence of OpenCL - would that be of any use in my situation?
Trying to solve this problem was essentially today's work for me. Most relevantly:
- I updated tensorflow and a bunch of other packages to the most recent version.
- I followed the instructions here to install the cuda toolkit. I also adapted the PATH and LD_LIBRARY_PATH variables.
- I installed nvidia-tensorrt and also adapted LD_LIBRARY_PATH. But I am slightly unsure if I did it right: I appended the path /home/sg/.local/lib/python3.8/site-packages/tensorrt which is where I found the libnvinfer file.