quality issue with offline voice-to-text using Sphinx4

Question

quality issue with offline voice-to-text using Sphinx4

788 Views Asked by Alex Woolford At 17 August 2025 at 01:25

I'd like to perform voice recognition on a large number of .wav files that are continually being generated.

There are a growing number of online voice-to-text API services (e.g. Google Cloud Speech, Amazon Lex, Twilio Speech Recognition, Nexmo Voice, etc.) which would work well for connected applications, but aren't suitable for this use case due to cost and bandwidth.

A quick google search suggested CMUSphinx (CMU = Carnegie Mellon University) is popular for speech recognition.

I tried the 'hello world' example:

import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;

public class Main {

    public static void main(String[] args) throws IOException {

        Configuration configuration = new Configuration();

        configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
        configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
        configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");

        StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
        InputStream stream = new FileInputStream(new File("src/main/resources/test.wav"));

        recognizer.startRecognition(stream);
        SpeechResult result;
        while ((result = recognizer.getResult()) != null) {
            System.out.format("Hypothesis: %s\n", result.getHypothesis());
        }
        recognizer.stopRecognition();

    }
}

The result was slightly disappointing. The 'test.wav' file contains the following audio:

This is the first interval of speaking. After the first moment of silent, this is the second interval of speaking. After the third moment of silence, this the third interval of speaking and the last one.

This was interpreted as:

this is the first interval speaking ... for the first moment of silence is the second of all speaking ... for the for the moment of silence this is the f***ing several speaking in the last

Most of the words have been captured, but the output is garbled to the extent that the meaning is lost. I then downloaded a news story where the enunciation was crystal clear, and the transcription was complete gibberish. It captured as much as a very drunk person would listening to a foreign language.

I'm curious to know if anyone's using Sphinx4 successfully and, if so, what tweaks were done to make it work? Are there alternative acoustic/language models, dictionaries etc... that perform better? Any other open source suggestions for offline speech-to-text I should consider?

Original Q&A

There are 1 best solutions below

**Alex Woolford** · Accepted Answer

This turned out to be a trivial issue that's documented in the FAQ: "Q: What is sample rate and how does it affect accuracy"

[...] we can not detect sample rate yet. So before using decoder you need to make sure that both sample rate of the decoder matches the sample rate of the input audio and the bandwidth of the audio matches the bandwidth that was used to train the model. A mismatch results in very bad accuracy.

The news footage was BBC audio stereo, recorded at 44.1 khz.

$ soxi GlobalNewsPodcast-20170828-CatastrophicFloodsRisin.wav

Input File     : 'GlobalNewsPodcast-20170828-CatastrophicFloodsRisin.wav'
Channels       : 2
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:29:23.79 = 77783087 samples = 132284 CDDA sectors
File Size      : 311M
Bit Rate       : 1.41M
Sample Encoding: 16-bit Signed Integer PCM

I converted it to mono:

$ sox GlobalNewsPodcast-20170828-CatastrophicFloodsRisin.wav GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav remix 1,2
$ soxi GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav

Input File     : 'GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav'
Channels       : 1
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:29:23.79 = 77783087 samples = 132284 CDDA sectors
File Size      : 156M
Bit Rate       : 706k
Sample Encoding: 16-bit Signed Integer PCM

Then downsampled to 16khz:

$ sox GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono.wav -r 16k GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono16k.wav
$ soxi GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono16k.wav

Input File     : 'GlobalNewsPodcast-20170828-CatastrophicFloodsRisinMono16k.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:29:23.79 = 28220621 samples ~ 132284 CDDA sectors
File Size      : 56.4M
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

Now it's working pretty well. Here's a snippet of transcribed audio from the news article:

emergency officials said they expect the hall from million people to seek assistance in texas bolton flashy thousand people already being cared for in temporary shelter is on the engine is a big on releasing water from two downs that protect houston city sense of ...

quality issue with offline voice-to-text using Sphinx4

There are 1 best solutions below

Related Questions in SPEECH-RECOGNITION

Related Questions in VOICE-RECOGNITION

Related Questions in SPEECH-TO-TEXT

Related Questions in CMUSPHINX

Related Questions in SPHINX4

Trending Questions

Popular # Hahtags

Popular Questions