Converting a real-time MP3 audio stream to 8000/mulaw in Python

1.8k Views Asked by At

I'm working with an API that streams real-time audio in the MP3 format (44.1kHz/16bit) and I need to convert this stream to 8000/mulaw. I've tried several solutions, but all have run into issues due to the structure of the MP3 data.

My current approach is to decode and process each chunk of audio as it arrives, using PyDub and Python's audioop module. However, I often encounter errors that seem to arise from trying to decode a chunk of data that doesn't contain a complete MP3 frame.

Here's a simplified version of my current code:

from pydub import AudioSegment
import audioop
import io

class StreamConverter:
    def __init__(self):
        self.state = None  
        self.buffer = b''  

    def convert_chunk(self, chunk):
        # Add the chunk to the buffer
        self.buffer += chunk

        # Try to decode the buffer
        try:
            audio = AudioSegment.from_mp3(io.BytesIO(self.buffer))
        except CouldntDecodeError:
            return None

        # If decoding was successful, empty the buffer
        self.buffer = b''

        # Ensure audio is mono
        if audio.channels != 1:
            audio = audio.set_channels(1)

        # Get audio data as bytes
        raw_audio = audio.raw_data

        # Sample rate conversion
        chunk_8khz, self.state = audioop.ratecv(raw_audio, audio.sample_width, audio.channels, audio.frame_rate, 8000, self.state)

        # μ-law conversion
        chunk_ulaw = audioop.lin2ulaw(chunk_8khz, audio.sample_width)

        return chunk_ulaw

# This is then used as follows:
for chunk in audio_stream:
    if chunk is not None:
        ulaw_chunk = converter.convert_chunk(chunk)
        # send ulaw_chunk to twilio api

I believe my issue stems from the fact that MP3 data is structured in frames, and I can't reliably decode the audio if a chunk doesn't contain a complete frame. Also, a frame could potentially be split between two chunks, so I can't decode them independently.

Does anyone have any ideas on how I can handle this? Is there a way to process an MP3 stream in real-time while converting to 8000/mulaw, possibly using a different library or approach?

1

There are 1 best solutions below

1
On

Strategy 1:

You could use librosa: https://librosa.org/ to decode the MP3 stream in real time. Librosa has a function called load() that can decode an MP3 stream into a numpy array. You can then use this numpy array to perform the sample rate conversion and mulaw conversion. Here's a sample code:

import librosa
import numpy as np

def convert_chunk(chunk):
    audio = librosa.load(io.BytesIO(chunk), sr=44100, mono=True)
    chunk_8khz = librosa.resample(audio, 8000)
    chunk_ulaw = audioop.lin2ulaw(chunk_8khz, audio.sample_width)
    return chunk_ulaw

This will decode the MP3 stream in real time and convert it to 8000/mulaw. The output of the code is a byte array that can be sent to the Twilio API.

Strategy 2:

Convert the MP3 stream to a WAV stream first, and then perform the necessary conversions. Like this-

    def convert_chunk(self, chunk):
        # Add the chunk to the buffer
        self.buffer += chunk

        # Try to decode the buffer as WAV
        try:
            audio = AudioSegment.from_mp3(io.BytesIO(self.buffer))
            wav_data = audio.export(format='wav').read() # Convert to WAV
        except Exception:
            return None

        # If decoding was successful, empty the buffer
        self.buffer = b''

        # Ensure audio is mono and 16-bit
        if audio.channels != 1 or audio.sample_width != 2:
            audio = audio.set_channels(1).set_sample_width(2)

        # Sample rate conversion
        chunk_8khz, self.state = audioop.ratecv(wav_data, 2, 1, audio.frame_rate, 8000, self.state)

        # μ-law conversion
        chunk_ulaw = audioop.lin2ulaw(chunk_8khz, 2)

        return chunk_ulaw

By converting the MP3 stream to WAV format first, you can overcome the challenges of incomplete MP3 frames and ensure a reliable conversion process.

Note that the sample width is set to 2 (16-bit) during the conversions. If your MP3 audio stream has a different sample width, you may need to adjust it accordingly.