How can I achieve streaming with AutoGPTQ, TextIteratorStreamer, Langchain create_json_chat agent with Local LLM (not OpenAI)

32 Views Asked by Samuel Tosan Ayo At 17 August 2025 at 15:43

I don't know what I'm missing, maybe it's because it's a bit unconventional but I want to stream the output of a langchain create_json_chat agent using agentExecutor. my model is TheBloke/Mistral-7B-V0.2-GPTQ and I'm using AutoGPTQ from huggingface. my code:

from auto_gptq import AutoGPTQForCausalLM

model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(
    model_name_or_path,
    device_map="auto",
    use_safetensors=True,
    trust_remote_code=True,
    device=DEVICE,
)

generation_config = GenerationConfig.from_pretrained(model_name_or_path)
streamer = TextIteratorStreamer(
                tokenizer, timeout=40.0, skip_prompt=True, skip_special_tokens=True
            )
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_length=15000,
    return_full_text=True,
    temperature=0.1,
    do_sample=True,
    torch_dtype=torch.bfloat16,
    repetition_penalty=1.15,
    #num_return_sequences=1,
    generation_config=generation_config,
    # batch_size=4,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id,
    streamer=streamer
)
llm = HuggingFacePipeline(pipeline=pipe,
                          model_kwargs={"temperature": 0.8})
agent = create_json_chat_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent,
                               tools=tools,
                               max_iterations=2,
                               early_stopping_method='generate',
                               handle_parsing_errors=True,
                               verbose=False)

Then I tried this

def run():
    thread = Thread(target=agent_executor.invoke, args=(
                               {"input":"explain the concept of Ai"},
                              ))
    thread.start()
    for new_text in streamer:
        yield new_text
    
    thread.join()

for text in run():
    print(text, end="", flush=True)

It doesn't stream. What am I missing?

Comment on proposed duplicate

All the questions and answers seems to focus on OpenAI LLMs, which have a different implementation.

Original Q&A

How can I achieve streaming with AutoGPTQ, TextIteratorStreamer, Langchain create_json_chat agent with Local LLM (not OpenAI)

Comment on proposed duplicate

There are 0 best solutions below

Related Questions in FASTAPI

Related Questions in HUGGINGFACE-TRANSFORMERS

Related Questions in LANGCHAIN

Related Questions in LARGE-LANGUAGE-MODEL

Related Questions in MISTRAL-7B

Trending Questions

Popular # Hahtags

Popular Questions