I'd like to perform voice recognition on a large number of .wav
files that are continually being generated.
There are a growing number of online voice-to-text API services (e.g. Google Cloud Speech, Amazon Lex, Twilio Speech Recognition, Nexmo Voice, etc.) which would work well for connected applications, but aren't suitable for this use case due to cost and bandwidth.
A quick google search suggested CMUSphinx (CMU = Carnegie Mellon University) is popular for speech recognition.
I tried the 'hello world' example:
import edu.cmu.sphinx.api.Configuration;
import edu.cmu.sphinx.api.SpeechResult;
import edu.cmu.sphinx.api.StreamSpeechRecognizer;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
public class Main {
public static void main(String[] args) throws IOException {
Configuration configuration = new Configuration();
configuration.setAcousticModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us");
configuration.setDictionaryPath("resource:/edu/cmu/sphinx/models/en-us/cmudict-en-us.dict");
configuration.setLanguageModelPath("resource:/edu/cmu/sphinx/models/en-us/en-us.lm.bin");
StreamSpeechRecognizer recognizer = new StreamSpeechRecognizer(configuration);
InputStream stream = new FileInputStream(new File("src/main/resources/test.wav"));
recognizer.startRecognition(stream);
SpeechResult result;
while ((result = recognizer.getResult()) != null) {
System.out.format("Hypothesis: %s\n", result.getHypothesis());
}
recognizer.stopRecognition();
}
}
The result was slightly disappointing. The 'test.wav' file contains the following audio:
This is the first interval of speaking. After the first moment of silent, this is the second interval of speaking. After the third moment of silence, this the third interval of speaking and the last one.
This was interpreted as:
this is the first interval speaking ... for the first moment of silence is the second of all speaking ... for the for the moment of silence this is the f***ing several speaking in the last
Most of the words have been captured, but the output is garbled to the extent that the meaning is lost. I then downloaded a news story where the enunciation was crystal clear, and the transcription was complete gibberish. It captured as much as a very drunk person would listening to a foreign language.
I'm curious to know if anyone's using Sphinx4 successfully and, if so, what tweaks were done to make it work? Are there alternative acoustic/language models, dictionaries etc... that perform better? Any other open source suggestions for offline speech-to-text I should consider?
This turned out to be a trivial issue that's documented in the FAQ: "Q: What is sample rate and how does it affect accuracy"
The news footage was BBC audio stereo, recorded at 44.1 khz.
I converted it to mono:
Then downsampled to 16khz:
Now it's working pretty well. Here's a snippet of transcribed audio from the news article: