I don't know what I'm missing, maybe it's because it's a bit unconventional but I want to stream the output of a langchain create_json_chat agent using agentExecutor. my model is TheBloke/Mistral-7B-V0.2-GPTQ and I'm using AutoGPTQ from huggingface. my code:
from auto_gptq import AutoGPTQForCausalLM
model_name_or_path = "TheBloke/Mistral-7B-Instruct-v0.2-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
model = AutoGPTQForCausalLM.from_quantized(
model_name_or_path,
device_map="auto",
use_safetensors=True,
trust_remote_code=True,
device=DEVICE,
)
generation_config = GenerationConfig.from_pretrained(model_name_or_path)
streamer = TextIteratorStreamer(
tokenizer, timeout=40.0, skip_prompt=True, skip_special_tokens=True
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_length=15000,
return_full_text=True,
temperature=0.1,
do_sample=True,
torch_dtype=torch.bfloat16,
repetition_penalty=1.15,
#num_return_sequences=1,
generation_config=generation_config,
# batch_size=4,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
streamer=streamer
)
llm = HuggingFacePipeline(pipeline=pipe,
model_kwargs={"temperature": 0.8})
agent = create_json_chat_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent,
tools=tools,
max_iterations=2,
early_stopping_method='generate',
handle_parsing_errors=True,
verbose=False)
Then I tried this
def run():
thread = Thread(target=agent_executor.invoke, args=(
{"input":"explain the concept of Ai"},
))
thread.start()
for new_text in streamer:
yield new_text
thread.join()
for text in run():
print(text, end="", flush=True)
It doesn't stream. What am I missing?
Comment on proposed duplicate
All the questions and answers seems to focus on OpenAI LLMs, which have a different implementation.