llama-cpp-python on GPU: Delay between prompt submission and first token generation with longer prompts

595 Views Asked by At

I've been building a RAG pipeline using the llama-cpp-python OpenAI compatible server functionality and have been working my way up from running on just a laptop to running this on a dedicated workstation VM with access to an Nvidia A100. After the most recent transition to a machine with access to this A100 i was expecting (naively?) this RAG pipeline to be blazing fast, but I've been surprised to find that this is not currently the case.

What im experiencing is a seemingly linear relationship between the length of my prompt and the time it takes to get back the first response tokens (with streaming enabled):

  • a few sentences --> very short time to first response tokens
  • a few paragraphs (~2600 tokens) --> around 1 minute to first response tokens

But once the tokens start streaming the response time is very acceptable.

The culprit for the initial delay seems to be the first run of the self.eval(tokens) method.

Im very new to LLMs and GPUs so im trying to understand:

  1. why this first run of self.eval(tokens) takes so long for longer prompts
  2. is there anything that I can do to improve this delay?
    • Have I configured something wrong and this eval step is running on the CPU instead of GPU? Or is this just the way it is and there's no way to improve with my current setup?

If there is nothing to improve my current setup is there any reason to believe that other tools to run Llama2 like HuggingFace's Text Generation Interface or vLLM would somehow be faster?

Other useful details:

  • Nvidia A100 GPU
  • Im fairly certain that the GPU is actually being fully utilized to llama-cpp-python server's fullest abilities given the debugging output:
    llm_load_tensors: using CUDA for GPU acceleration
    llm_load_tensors: mem required  =  107.56 MiB
    llm_load_tensors: offloading 40 repeating layers to GPU
    llm_load_tensors: offloading non-repeating layers to GPU
    llm_load_tensors: offloaded 41/41 layers to GPU
    llm_load_tensors: VRAM used: 8694.21 MiB
    
  • call to start the server:
    python -m llama_cpp.server --model D:\LLM_Work\cache\TheBloke\llama-2-13b-chat.Q5_K_M.gguf --n_gpu_layers -1 --n_ctx 3900 --cache False
    
0

There are 0 best solutions below