Huggingface Steraming Inference without TGI

455 Views Asked by Muhammad Fhadli At 02 November 2023 at 04:36

I found this tutorial for using TGI (Text Generation Inference) with the docker image at Text Generation Inference.

However, I’m having trouble using a GPU in a docker container. I was wondering if there is another way to stream the output of the model. I have tried using TextStreamer, but it can only output the result to standard output. In my case, I’m trying to send the stream output to the frontend, similar to how it works in ChatGPT

Original Q&A

There are 2 best solutions below

fucalost On 02 November 2023 at 09:37

You should probably proceed with TGI.

To use a GPU within a Docker container, do the following:

Install the NVIDIA Container Toolkit
Configure Docker to use the NVIDIA runtime

sudo nvidia-ctk runtime configure --runtime=docker

Run your container like so:

docker run --runtime=nvidia --gpus all -it <YOUR_IMAGE_TAG>

Muhammad Fhadli On 12 November 2023 at 06:44

I have found the answer, we can do this in transformers

from threading import Thread
from transformers import TextIteratorStreamer,

inputs = tokenizer(prompt_template, return_tensors="pt").input_ids.cuda()
streamer = TextIteratorStreamer(tokenizer, skip_prompt=True)
generation_kwargs = {
    "inputs": inputs,
    "streamer": streamer,
    "max_new_tokens": 512,
    "stopping_criteria": stop_criteria,
    "temperature": 0.7,
}
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for _, new_text in enumerate(streamer):
    yield new_text

Huggingface Steraming Inference without TGI

There are 2 best solutions below

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in STREAMLIT

Related Questions in LANGCHAIN

Related Questions in HUGGINGFACE

Related Questions in HUGGINGFACE-TOKENIZERS

Trending Questions

Popular # Hahtags

Popular Questions