How to get NBest alternatives with azure speech-to-text

461 Views Asked by At

I want to get more than one alternative transcription for a single speech utterance using azure speech-to-text.

I have set the format=detailed argument, and the response does include a field called NBest. But that field only every contains one transcription.

Is there something else I need to set on the input side?

Thx.

3

There are 3 best solutions below

0
On

I am joining the comment to ensure what mechanism you're using:

  1. REST API for short audio or
  2. Speech SDK

If you're using Speech CLI or would like to try it, then do this:

First set:

spx config recognize @default.output --set @@output.all.detailed

then:

spx recognize --file FILE --output all itn text --output all file type json

or

spx recognize --file FILE --output all lexical text --output all file type json
0
On

Answer to this question (if using REST API) :

According to a Microsoft document, authored by a MS employee on 1 September 2021, if you are using the REST API, you can only get one alternate:

"The rest API returns only the best result. There has been no change in the behavior of this api for a long while."

See: https://learn.microsoft.com/en-us/answers/questions/534368/azure-speech-to-text-how-to-receive-nbext-with-res.html

This is unusual because Microsoft's own documentation for the REST API for Short Audio shows a "sample return" containing two (2) alternates.

0
On

I believe you have defined everything that needs to be defined on the input side.

But with more information about the surrounding context, it would be easier to figure out how to answer precisely. For example, I'm not sure if it behaves the same in ContinuousRecognition mode or in RecognizeOnce mode.

In the following C# code, I do obtain results where NBest array contains 5 Results. Note, however, that in the code sample I found, and which you'll find below integrated with my own, the NBest property is defined as a List. I'm unsure if this, in the framework you're using, could be the source of your NBest object containing a single result.

SpeechConfig _speechConfig = SpeechConfig.FromSubscription(SUBSCRIPTION_KEY, SUBSCRIPTION_REGION);
_speechConfig.SpeechRecognitionLanguage = SPEECH_RECOGNITION_LANGUAGE;
_speechConfig.OutputFormat = OutputFormat.Detailed;

AudioConfig _audioConfig = AudioConfig.FromDefaultMicrophoneInput();
_recognizer = new SpeechRecognizer(_speechConfig, _audioConfig);

_recognizer.Recognized += (s, e) => OnRecognized(e);

    private void OnRecognized(SpeechRecognitionEventArgs e)
    {
        if (e.Result.Reason == ResultReason.RecognizedSpeech)
        {
            SpeechRecognitionResult result = e.Result;
            PropertyCollection propertyCollection = result.Properties;
            string jsonResult = propertyCollection.GetProperty(PropertyId.SpeechServiceResponse_JsonResult);
            var structuredResult = JsonConvert.DeserializeObject<Result>(jsonResult);
            var bestResult = structuredResult?.NBest[0]; // <= pick your favorite NBest
            // Do something with the bestResult of your choice
        }
    }

    public class Word
    {
        public int Duration { get; set; }
        public int Offset { get; set; }
        public string word { get; set; }
    }

    public class NBest
    {
        public double Confidence { get; set; }
        public string Display { get; set; }
        public string ITN { get; set; }
        public string Lexical { get; set; }
        public string MaskedITN { get; set; }
        public List<Word> Words { get; set; }
    }

    public class Result
    {
        public string DisplayText { get; set; }
        public int Duration { get; set; }
        public string Id { get; set; }
        public List<NBest> NBest { get; set; }
        public Int64 Offset { get; set; }
        public string RecognitionStatus { get; set; }
    }