I'm trying to make a TTS on a open source LLM with local API that is streaming me the response to my questions but it's very hard to do it and I find nothing on that subject.
Here's the code:
import threading
from queue import Queue
from openai import OpenAI
import time
tts_engine = pyttsx3.init()
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")
history = [
{"role": "system", "content": "Vous êtes un assistant intelligent appelé Bob. Vous fournissez toujours des réponses rapides et précises, à la fois justes et utiles et toujours en langue française."},
{"role": "user", "content": "Bonjour, présentez-vous à quelqu'un qui ouvre ce programme pour la première fois. Soyez concis."},
]
while True:
user_input = input("> ")
history.append({"role": "user", "content": user_input})
start_time = time.time() # Temps de début de la requête à l'API
completion = client.chat.completions.create(
model="local-model",
messages=history,
temperature=0.8,
stream=True,
)
new_message = {"role": "assistant", "content": ""}
for chunk in completion:
if chunk.choices[0].delta.content:
generated_text = chunk.choices[0].delta.content
print(generated_text, end="", flush=True)
new_message["content"] += generated_text
history.append(new_message)
end_time = time.time()
response_time = end_time - start_time
print("\nTemps de réponse de l'API:", response_time, "secondes")
So I tried multiple thing like a loop that is looking for the new word that is being generating but I think that is not good approach and it doesn't work.
Any suggestions?