Loading "llama-2" 8 bit quantized version onto the GPU

69 Views Asked by At

Based on this repo GitHub Link I am trying to build a system which answers users queries.

I was able to run the model on a CPU with response time of ~60s, now I want to improve the response time, so I am trying to load the model onto a GPU.

System specs

  • Processor - Intel(R) Xeon(R) Gold 6238R CPU @ 2.20GHz, 2195 Mhz, 2 Core(s), 2 LogicalProcessor with 24GB RAM
  • GPU - Nvidia A40-12Q with 12gb.

So here are my queries

  1. How to load the llama 2 or any model onto the GPU?
  2. Can we improve the response time if we load the model onto a GPU?
  3. How to improve the answer quality?
  4. How should we make the model to answer the questions related only to the documents?

The CODE

from langchain.llms import CTransformers
from dotenv import find_dotenv, load_dotenv
import box
import yaml
from accelerate import Accelerator
import torch
from torch import cuda
from ctransformers import AutoModelForCausalLM

# Check if GPU is available and set device accordingly
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using Device: {device} in llm.py file")



# Load environment variables from .env file
load_dotenv(find_dotenv())

# Import config vars
with open('config/config.yml', 'r', encoding='utf8') as ymlfile:
    cfg = box.Box(yaml.safe_load(ymlfile))

accelerator = Accelerator()


def build_llm():

    config = {'max_new_tokens': cfg.MAX_NEW_TOKENS,
                            'temperature': cfg.TEMPERATURE,
                             'gpu_layers': 150
                             }
    llm = CTransformers(model=cfg.MODEL_BIN_PATH,
                        model_type=cfg.MODEL_TYPE,
                        config= config
                        )
    llm,config = accelerator.prepare(llm,config)
    return llm

This is the part which loads in the model, but while querying, the cpu utilization shoots up till 100% and the GPU utilization remains at 2%

2

There are 2 best solutions below

1
Karl On
  1. How to load the llama 2 or any model onto the GPU?

Since you're using accelerate, the best way to do this is check the accelerate docs. Note that standard Llama2 is too large for your GPU, so you may need to use a quantized version.

  1. Can we improve the response time if we load the model onto a GPU?

It depends. Most speed gains from GPU inference come from batch inference. If you're running inference on a single item at a time, you might not see major speed improvements. Inference on a single item tends to be more bottlenecked by memory transfers rather than flops, which is why codebases like llama.cpp get good performance on laptops.

  1. How to improve the answer quality?

You can try improve the prompt you give the model or curate a dataset of proper question/answer pairs for fine-tuning.

  1. How should we make the model to answer the questions related only to the documents?

This is an open research question

0
Vigya2115 Python Paradox On

When I got in contact with the people who are working on this, informed me that "ctransformers[cuda]" is used to load the model onto the GPU, but The CUDA version used by ctransformers is 11.7.1 link , but I am using CUDA 11.4.2 which does not support this, that is what is restricting me from loading the model onto the GPU.

Thanks,

Will keep you posted on this issue.