I am creating a video chat app (WebRTC API) with separate React frontend and Node.js backend. I want the ability to take the audio part of the MediaStream and run it through azure's speech translation api to obtain a real-time translation of the audio and display it as subtitles on the client side of the app.
I am confused on wether i should sent the MediaStream to the backend using sockets or just translate it on the frontend. Sending to the node.js backend would be difficult as I can't directly send a raw MediaStream in real time over the network.
Is it possible to just npm install the azure speech service on the frontend and just do the translation on the client side? If not, how best do I achieve my goal?
I am aware of the issue of keys being on the frontend and I am try a workaround for that.
I have tried researching on how to convert a MediaStream to a ReadableStream to allow transmission on the network possible but i haven't found anything worthwhile yet. Some other suggestions involved converting the MediaStream to an arrayBuffer of PCM chunks but I suspect this is overkill and more so outside my skill and understanding level.
The default input format is WAV. If the input audio is compressed (e.g., in MP3 format), you need to convert it to the WAV format and decode the audio buffers. The Speech SDK for JavaScript supports WAV files with a sampling rate of 16 kHz or 8 kHz, 16-bit, and mono PCM.
Use sockets for security and scalability, especially for real-time translation in a production environment.
Refer to the DOC on how to translate speech and use the code provided in this DOC.
Backend (Node.js)
Frontend (React):
Output: