How to determine length of observation sequence for HMM in speech recognition

135 Views Asked by At

I'm re-learning how to use Hidden Markov Models for speech recognition and I have a question. It seems that most/all discussions of using HMM's consider the case of a known sequence of observation: [O1, O2, O3,...,OT] where T is a known number. However, if we were to try to use a trained HMM on speech in real time, or in a WAV file where someone was speaking one sentence after another, how exactly does one select the value of T? In other words, how does one know when the speaker has ended one sentence and started another? Does a practical HMM for speech recognition just use a fixed value for T and periodically recomputes the optimal state sequence up to the current observation using a fixed size window of length T into the past? Or is there some better way for dynamically selecting T at any instance of time?

1

There are 1 best solutions below

0
Nikolay Shmyrev On

Does a practical HMM for speech recognition just use a fixed value for T and periodically recomputes the optimal state sequence up to the current observation using a fixed size window of length T into the past?

Viterbi decoding algorithm works frame by frame, so you just iterate over frames, you can iterate indefinitely until backtracking matrix fills all the memory.

Training algorithm considers audios that are prepared before training, usually 1-30 seconds. For training audio length is already known.

how does one know when the speaker has ended one sentence and started another?

There are different strategies here. Decoders search for the silence to wrap around decoding. Silence doesn't necessary mean the break between sentences, there could be no break between sentences at all. There could be break in the middle of a sentence too.

So to find silence decoder can use standalone voice activity detection algorithm and break when VAD detects silence or decoder can analyze backtrack information to decide if silence appeared. The second method is a bit more reliable.