Azure Speech SDK with FastAPI web Socket not working inside docker and callbacks not sending back websocket bytes

166 Views Asked by At

I am trying to build a real time speech recognizer with azure SDK and FastAPI with websocket, I am sending base64 encoded binary string as input, The azure session starts recognizes text and prints in connected events, but I want to send back recognized text to websocket so I have a callback ,but looks like the print inside callback is working but the send is not working.

Please let me know what is the issue if some oen is able to help

 async def process_stream(stream,data,speech_recognizer,websocket):


    def recognized_callback(evt):
        recognized_text = evt.result.text
        print("I am in websocket callback : " + str(recognized_text)+" "+str(websocket))

        async def send_data():
            await websocket.send_text(recognized_text)

        asyncio.gather(send_data())


    # The number of bytes to push per buffer
    n_bytes = 4096
    data = io.BytesIO(data)

    speech_recognizer.recognized.connect(recognized_callback)

    # Start pushing data until all data has been read from the file
    try:
        speech_recognizer.start_continuous_recognition()
        while True:

            frames = data.read(n_bytes // 2)
            print('read {} bytes'.format(len(frames)))
            if not frames:
                speech_recognizer.stop_continuous_recognition()
                break
            stream.write(frames)

            await asyncio.sleep(0.03)
    finally:
        stream.close()
@app.websocket("/asr/en")
async def root(websocket:WebSocket):
    await websocket.accept()
    audio_format = AudioStreamFormat(
        channels=1,
        samples_per_second=16000,
        bits_per_sample=16
    )
    stream = speechsdk.audio.PushAudioInputStream(audio_format)

    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config,
                                                   audio_config=speechsdk.audio.AudioConfig(stream=stream))

    try:
        while True:
            # Receive audio data from the client

            data = await websocket.receive_bytes()
            break

        await process_stream(stream,data,speech_recognizer,websocket)

    except Exception as e:
        logger.exception(f"An error occurred: {e}")

LOGS:

INFO: Shutting down INFO: Waiting for application shutdown. INFO: Application shutdown complete. INFO: Finished server process [62395] INFO: Started server process [62529] INFO: Waiting for application startup. INFO: Application startup complete. INFO: ('127.0.0.1', 53888) - "WebSocket /asr/en" [accepted] INFO: connection open SESSION STARTED: SessionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=4011e2273ad742aa9e2df99eb3e8a854, text="thank you for", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=8bc91a19b5c2433d9fcc4dbe5125ea9c, text="thank you for contact", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=8ad3f7d5a8dc40e09fdab8dbaa13fc89, text="thank you for contacting", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=49d386798e3d4ebb8172a9b681d929fc, text="thank you for contacting us", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=f1d16bcf07da4b4ea2345e38b35c0300, text="Thank you for contacting us.", reason=ResultReason.RecognizedSpeech)) /Users/parikshit.mukherjee/PycharmProjects/pythonProject/./main.py:37: RuntimeWarning: coroutine 'root..send_text_async' was never awaited send_text_async(evt.result.text) RuntimeWarning: Enable tracemalloc to get the object allocation traceback RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=a16037fd2bec4790be32ec31b7430126, text="lands", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=508fa0fc387c43779c548dcc966133f7, text="lands had", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=8de79e5f5ac54b919cb8987dde425bae, text="yan's incorrectly", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=bfa9673b38be481d9a07263df3643794, text="lands had currently busy", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=382841509bba4ae68d5507cd594f2a9d, text="Yan's incorrectly busy.", reason=ResultReason.RecognizedSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=5c05c9b20acc4d679a547bef5c97b473, text="how", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=c3b7c690ff61450ca2cb02f9a17d70e5, text="how pain", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=b1ee047ae4c143b78f17b75e6d306dca, text="how pain is", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=f396e7705c894f63a485378d0490d7f1, text="how pain is very", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=b502f32f7a1d47508a9af27b47abe24f, text="how pain is very important", reason=ResultReason.RecognizingSpeech)) RECOGNIZING: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=618e549690534b7ca20e07d087de202f, text="how pain is very important to us", reason=ResultReason.RecognizingSpeech)) RECOGNIZED: SpeechRecognitionEventArgs(session_id=02f165702334418a8635a40c4c16ea1d, result=SpeechRecognitionResult(result_id=8af7a291290e45329b38bd172a1ddf65, text="How pain is very important to us.", reason=ResultReason.RecognizedSpeech)) /Users/parikshit.mukherjee/PycharmProjects/pythonProject/venv/lib/python3.9/site-packages/azure/cognitiveservices/speech/speech.py:652: RuntimeWarning: coroutine 'root..session_stopped_cb' was never awaited cb(payload) RuntimeWarning: Enable tracemalloc to get the object allocation traceback INFO: connection closed

1

There are 1 best solutions below

5
Dasari Kamali On

I tried the FastAPI with WebSocket code below to convert speech-to-text using a Dockerfile.

Code :

from fastapi import FastAPI, WebSocket, HTTPException
from fastapi.responses import JSONResponse
import base64
import os
import azure.cognitiveservices.speech as speechsdk

app = FastAPI()

speech_key = "<speech_key>"
service_region = "<speech_region>"

def speech_recognize_continuous_from_stream(audio_data):
    speech_config = speechsdk.SpeechConfig(subscription=speech_key, region=service_region)
    audio_config = speechsdk.audio.AudioConfig(filename=audio_data)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)

    result = speech_recognizer.recognize_once()
    return result.text if result.reason == speechsdk.ResultReason.RecognizedSpeech else ""

@app.post("/convert")
async def convert_audio(audioBase64: dict):
    try:
        audio_bytes = base64.b64decode(audioBase64['audioBase64'])
        with open("temp.wav", "wb") as audio_file:
            audio_file.write(audio_bytes)
        transcription_result = speech_recognize_continuous_from_stream("temp.wav")
        os.unlink("temp.wav")
        return {"text": transcription_result}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    try:
        await websocket.accept()
        print("WebSocket connection accepted")
        while True:
            data = await websocket.receive_text()
            print("Received data from client:", data[:50])  
            text = "Speech recognition failed"
            await websocket.send_text(text)
    except Exception as e:
        print("WebSocket error:", e)

Output :

The following code ran successfully:

enter image description here

I received the text output with the input base64 data as follows:

{
    "audioBase64": "your_base64_audio_data_here"
}

enter image description here

Next, I added the Dockerfile below to the code:

Dockerfile :

FROM python:3.10

WORKDIR /app
COPY main.py .
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

I successfully built, ran, and checked the logs of the Dockerfile using the following commands:

docker build -t my_fastapi_app .
docker run -d -p 8000:8000 my_fastapi_app
docker logs <CONTAINER_ID>

enter image description here