Why word-level language model should help in beam search decoding in ASR?

575 Views Asked by At

I was experimenting with beam search decoding of an acoustic model trained with CTC loss trained on an automatic speech recognition task. The version I was using was based on this paper. However, even though many sources describe integration of similar word level language model as beneficial to word error rate performance, in my case, the integration of LM worsened the results.

It actually does not surprise me too much, because the language model scores only prefixes with finished words at the end, and scoring means multiplying the probability of the prefix by the LM probability, which decreases the probability of the whole prefix. This way, probability of prefixes that end with a word from a vocabulary is systematically decreased by the language model, while the prefixes that do not end with a complete word yet are not scored by the LM at all. At each time step, the prefixes ending with complete words seem to be discarted due to the lowered score, while the incomplete prefixes survive in the beam.

My question is, why should word level LM integration work, if it decreases the probability of valid prefixes? I would understand that some character-level LM that scores everything at every step or some look-ahead word-level LM could help. For example Graves describes the integration of a word-level language model by using sums of probabilities of all posible word with given prefix and by applying the LM update at each time step, which seems reasonable even though the computational cost could be much larger.

0

There are 0 best solutions below