How to use azure-speech api to process a MediaStream when the frontend (react) and backend (node.js) are separate

105 Views Asked by At

I am creating a video chat app (WebRTC API) with separate React frontend and Node.js backend. I want the ability to take the audio part of the MediaStream and run it through azure's speech translation api to obtain a real-time translation of the audio and display it as subtitles on the client side of the app.

I am confused on wether i should sent the MediaStream to the backend using sockets or just translate it on the frontend. Sending to the node.js backend would be difficult as I can't directly send a raw MediaStream in real time over the network.

Is it possible to just npm install the azure speech service on the frontend and just do the translation on the client side? If not, how best do I achieve my goal?

I am aware of the issue of keys being on the frontend and I am try a workaround for that.

I have tried researching on how to convert a MediaStream to a ReadableStream to allow transmission on the network possible but i haven't found anything worthwhile yet. Some other suggestions involved converting the MediaStream to an arrayBuffer of PCM chunks but I suspect this is overkill and more so outside my skill and understanding level.

1

There are 1 best solutions below

0
Sampath On

The default input format is WAV. If the input audio is compressed (e.g., in MP3 format), you need to convert it to the WAV format and decode the audio buffers. The Speech SDK for JavaScript supports WAV files with a sampling rate of 16 kHz or 8 kHz, 16-bit, and mono PCM.

  • Use sockets for security and scalability, especially for real-time translation in a production environment.

  • Refer to the DOC on how to translate speech and use the code provided in this DOC.

Backend (Node.js)


  socket.on('audioData', (audioData) => {
    translateAndBroadcast(socket, audioData);
  });

  socket.on('disconnect', () => {
    console.log('User disconnected');
  });
});

async function translateAndBroadcast(socket, audioData) {
  // Translate audio data using Azure Speech Translation API
  const translationResult = await translateAudio(audioData);
  
  // Broadcast translated results to all connected clients
  io.emit('translationResult', translationResult);
}

async function translateAudio(audioData) {
  return new Promise((resolve, reject) => {
    // Implement logic to process audioData and obtain translation
    // For example, you might need to convert audioData to PCM format

    // Use Azure Speech Translation SDK to translate
    translationRecognizer.recognizeOnceAsync(
      (result) => {
        if (result.reason === ResultReason.RecognizedSpeech) {
          const translation = result.translations.get("YOUR_TARGET_LANGUAGE_CODE");
          resolve(translation);
        } else {
          reject(`Recognition failed: ${result.reason}`);
        }
      },
      (err) => {
        reject(`Error: ${err}`);
      }
    );
  });
}

Frontend (React):



function App() {
  const [translation, setTranslation] = useState('');

  useEffect(() => {
    // Handle incoming translated results from the backend
    socket.on('translationResult', (result) => {
      setTranslation(result);
    });

    // Cleanup on component unmount
    return () => {
      socket.disconnect();
    };
  }, []);

  // Implement logic to capture audio from WebRTC video chat
  // and send it to the backend
  const sendAudioDataToBackend = (audioData) => {
    socket.emit('audioData', audioData);
  };

  return (
    <div>
  
      <p>{translation}</p>
    </div>
  );
}


Output: enter image description here