llama-cpp-python on GPU: Delay between prompt submission and first token generation with longer prompts

589 Views Asked by jhthompson12 At 17 August 2025 at 12:33

I've been building a RAG pipeline using the llama-cpp-python OpenAI compatible server functionality and have been working my way up from running on just a laptop to running this on a dedicated workstation VM with access to an Nvidia A100. After the most recent transition to a machine with access to this A100 i was expecting (naively?) this RAG pipeline to be blazing fast, but I've been surprised to find that this is not currently the case.

What im experiencing is a seemingly linear relationship between the length of my prompt and the time it takes to get back the first response tokens (with streaming enabled):

a few sentences --> very short time to first response tokens
a few paragraphs (~2600 tokens) --> around 1 minute to first response tokens

But once the tokens start streaming the response time is very acceptable.

The culprit for the initial delay seems to be the first run of the self.eval(tokens) method.

Im very new to LLMs and GPUs so im trying to understand:

why this first run of self.eval(tokens) takes so long for longer prompts
is there anything that I can do to improve this delay?
- Have I configured something wrong and this eval step is running on the CPU instead of GPU? Or is this just the way it is and there's no way to improve with my current setup?

If there is nothing to improve my current setup is there any reason to believe that other tools to run Llama2 like HuggingFace's Text Generation Interface or vLLM would somehow be faster?

Other useful details:

Nvidia A100 GPU

Im fairly certain that the GPU is actually being fully utilized to llama-cpp-python server's fullest abilities given the debugging output:

llm_load_tensors: using CUDA for GPU acceleration
llm_load_tensors: mem required  =  107.56 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 41/41 layers to GPU
llm_load_tensors: VRAM used: 8694.21 MiB

call to start the server:

python -m llama_cpp.server --model D:\LLM_Work\cache\TheBloke\llama-2-13b-chat.Q5_K_M.gguf --n_gpu_layers -1 --n_ctx 3900 --cache False

Original Q&A

llama-cpp-python on GPU: Delay between prompt submission and first token generation with longer prompts

There are 0 best solutions below

Related Questions in GPU

Related Questions in HUGGINGFACE

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in LLAMA-CPP-PYTHON

Related Questions in LLAMACPP

Trending Questions

Popular # Hahtags

Popular Questions