Task: We are attempting to build a Dialogflow agent that will interact with callers via our Cisco telephony stack. We will be attempting to collect alphanumeric credentials from the caller.

Here is our proposed architecture:

enter image description here

Problem: In order to send text inputs to Dialogflow, we are using Google Cloud’s Speech-to-Text (STT) API to convert the caller’s audio to text. However, the STT API does not always perform as desired. For example, if a caller wishes to say his/her DOB is 04-04-90, the transcribed audio may come back as oh for oh 490. Yet, the transcribed audio can be greatly improved by passing phrase hints to the API, and so we would need to dynamically send these hints based on the scenario. Unfortunately, we're struggling to understand how we can dynamically pass these phrase hints through the UniMRCP server, specifically the Google Speech Recognition plugin.

Question: Section 5.2 of the Google Speech Recognition manual outlines using Dynamic Speech Contexts.

The example provided is:

<grammar mode="voice" root="booking" version="1.0" xml:lang="en-US" xmlns="http://www.w3.org/2001/06/grammar">
    <meta name="scope" content="hint"/>
    <rule id="booking">
        <one-of>
            <item> 04 04 1990</item>
            <item> 04 04 90</item>
            <item> April 4th 1990</item>
        </one-of>
    </rule>
</grammar>

Does this still transcribe all user input similar to how the builtin grammar builtin:speech/transcribe would behave?

For example, if I were to say March 5th 1980, would the Google’s STT return March 5th 1980, or only one of the provided items?

To be clear, I would want Google’s STT to be able to return more than just the provided items, and so if the user says March 5th 1980, I would want that returned through the UniMRCP, VBB, CVP and passed along to Dialogflow. I am being told that even if STT returned March 5th 1980 the CVP or the Voice browser would potentially evaluate it as "no match".

1

There are 1 best solutions below

5
On

Dialogflow accepts more than text inputs.

It can either do intent detection based on audio or an audio stream.