If you use ChatGPT web app it answers typing token by token. It you use it through API you get the whole answer at once.
My assumption was that they provide token by token answers in the web app for UX reasons (easier reading maybe, a sneaky way to limit the amount of user's prompts by making them wait longer for the answer).
Today I downloaded llama cpp app and played around with the models from Hugging Face.
What made me wonder was that the llama CLI was also printing the answers token by token. While it is typing it is using ~70% of my CPU. The moment it stops typing the CPU usage drops to 0%. If the output is long the CPU stays on 70% for longer.
It looks like the answer tokens are actually pulled from the model one by one and the more tokens you want, the longer it takes to generate.
However my initial understanding was that a model always returns the answers of the same length (just 0 padded if less text makes more sense). I also assumed that the model repose time is invariant to the length of the prompt and the generated output.
What am I missing? How does it really work?
The behavior you're observing in these models is related to how language models like GPT (Generative Pre-trained Transformer) generate text, and it's not just for user experience reasons. Language models like GPT generate text token by token. A token can be a word, part of a word, or even punctuation. The model predicts the next token based on the previous ones. This process continues until a complete response is generated.
As for the CPU usage you're observing, it is due to the active computation involved in generating each token. The model iteratively calculates the probability of the next token in the sequence, which is a computationally intensive task. Once the generation is complete, the CPU usage drops down to zero because the model is no longer performing these kind of heavy calculations.
It's also not necessarily true that language models always return answers of the same length. Most models, including GPT and LLAMA, generate output based on a stopping criterion, such as reaching a maximum length, encountering a specific token (like an end-of-sequence token), or when the model's confidence in its predictions falls below a certain threshold. This means the length of the output can vary.
Padding with zeros is more relevant in the context of training and batching inputs for neural networks, where inputs need to be of uniform size. In inference (like generating text responses in your case), padding is generally not used in the same way.