I will be using LLM (like GPT) to generate an answer - which would then be converted to speech, which I want to send over to the browser using aiortc. However, since LLM take time to produce complete output, instead of waiting for it to complete, we can read partial answers as soon it appears, and every few words, generate mp3 file for those many words, and then stream those. So not all the mp3 files would be available immediately, and instead, I need to keep on adding them as soon they appear (say every 4-5 words) from LLM.
I wrote a custom MediaStreamTrack to achieve the same. I have tried this with 2 files, a.mp3 and b.mp3.
I ran across 2 issues:
The last few hundred millisecond of a.mp3 sound stretched. The initial 2 (or so) second of b.mp3 go blank, and then after that it plays Clearly, the addition of frames need to be done better so that this can work. I am definitely missing something here - would be great if someone can point me in the right direction.
class CombinedAudioTrack(MediaStreamTrack):
"""
An audio track which reads from multiple mp3
"""
kind = "audio"
currentMediaPlayer:MediaStreamTrack = None
queue = asyncio.Queue()
_stop: float = False
def __init__(self) -> None:
super().__init__()
# self.readyState = "live"
def addNewMP3File(self, mp3File, last:bool = False):
self.queue.put_nowait(mp3File)
if last:
self._stop = True
async def getNextMediaStreamTrack(self):
mp3File = await self.queue.get()
self.currentMediaPlayer = MediaPlayer(os.path.join(ROOT, mp3File)).audio
async def recv(self) -> Frame:
print("Came in recv")
try:
# Should only happen first time
if not self.currentMediaPlayer:
await self.getNextMediaStreamTrack()
frame = await self.currentMediaPlayer.recv()
print(frame)
return frame
except MediaStreamError:
# Its time to move the current media player forward
if(self._stop):
# self.stop()
raise MediaStreamError()
await self.getNextMediaStreamTrack()
return await self.currentMediaPlayer.recv()