Identify start/stop times of spoken words within a phrase using Sphinx

683 Views Asked by At

I'm trying to identify the start/end time of individual words within a phrase. I have a WAV file of the phrase AND the text of the utterance.

Is there an intelligent way of combining these two data (audio, text) to improve Sphinx's recognition abilities? What I'd like as output are accurate start/stop times for each word within the phrase.

(I know you can pass -time yes to pocketsphinx to get the time data I'm looking for -- however, the speech recognition itself is not very accurate.)

The solution cannot be for a specific speaker, as the corpus I'm working with contains a lot of different speakers, although they are all using US English.

1

There are 1 best solutions below

2
On BEST ANSWER

We have a specific tool for that - audio aligner in sphinx4. You can check

http://cmusphinx.sourceforge.net/2014/07/long-audio-aligner-landed-in-trunk/