How to use the AWS Transcribe javascript sdk

4.3k Views Asked by At

I am trying to use the @aws-sdk/client-transcribe-streaming in an Angular project, without any luck.

The below code is the only example provided by AWS

// ES6+ example
import {
  TranscribeStreamingClient,
  StartStreamTranscriptionCommand,
} from "@aws-sdk/client-transcribe-streaming";

// a client can be shared by different commands.
const client = new TranscribeStreamingClient({ region: "REGION" });

const params = {
  /** input parameters */
};
const command = new StartStreamTranscriptionCommand(params);

As said by the SDK's documentation, the StartStreamTranscriptionCommand object expects the params parameter to be of type StartStreamTranscriptionCommandInput.

This StartStreamTranscriptionCommandInput object has an AudioStream field that is of type AsyncIterable<AudioStream>, which is, I assume, the audio stream that will be sent to be transcribed by AWS.

The problem is that I don't know how to create this AudioStream object, and the only hint that the documentation gives us is that it is a "PCM-encoded stream of audio blobs. The audio stream is encoded as an HTTP2 data frame."

Any help on how to create an AsyncIterable<AudioStream> will be greatly appreciated.

2

There are 2 best solutions below

0
On BEST ANSWER

It turns out that they removed the only explanation on how to get this AsyncIterable<AudioStream> from their readme for some reason. By searching through their GitHub issues, someone pointed me to this old version of the readme from an old commit. This version contains a few examples on how to create this object.

3
On

I don't have a direct answer to your question, but I might have some helpful pointers for you. I managed to implement Transcribe websocket audio streaming in a website I set up recently. I used VueJS, but the process should be very similar. I did NOT use the AWS Transcribe Javascript SDK, but instead based my code on information from an AWS blog post, using the Github link they provided.

Both of those resources were crucial in getting it to work. If you clone the git repo and run the code you should have a working example if I remember correctly. To this day I don't fully understand how the code works as I don't have the understanding of audio-stuff, but it works.

I ended up modifying and implementing the Github code in some JS files, which I then added to my code. Then I had to calculate some AWS Signature V4 stuff, which I could send to the Transcribe API, which would then return a websocket link I could open using JS. The data that's sent to the Transcribe websocket comes from an attached microphone, which can be found using MediaDevices.getUserMedia(). The Github code mentioned above contains files to convert the microphone audio to what Transcribe needs as it will only accept bitrates of 8000 and 16000 depending on your selected language.

It was tricky to understand the Transcribe documentation and to find all the pieces I had to put together as it seems that streaming to Transcribe is a bit of an edge case, but I hope the resources I mentioned will make it a bit easier for you.

EDIT: Added source code

Getting a Transcribe websocket link.

I have this set up in an AWS Lambda function running node, but you can copy everything inside exports.handler into a normal JS file. You will need the cryptojs, aws-sdk and the moment node modules!

//THIS SCRIPT IS BASED ON https://docs.aws.amazon.com/transcribe/latest/dg/websocket.html
const crypto = require('crypto-js');
const moment = require('moment');
const aws = require('aws-sdk');
const awsRegion = '!YOUR-REGION!'
const accessKey = '!YOUR-IAM-ACCESS-KEY!';
const secretAccessKey = '!YOUR-IAM-SECRET-KEY!';

exports.handler = async (event) => {
    console.log(event);
    
    // let body = JSON.parse(event.body); I made a body object below for you to test with
    let body = {
        languageCode: "en-US", //or en-GB etc. I found en-US works better, even for British people due to the higher sample rate, which makes the audio clearer.
        sampleRate: 16000
    }
    
    console.log(crypto.enc.Hex.stringify(signature_key));
    
    
    let method = "GET"
    let region = awsRegion;
    let endpoint = "wss://transcribestreaming." + region + ".amazonaws.com:8443"
    let host = "transcribestreaming." + region + ".amazonaws.com:8443"
    let amz_date = new moment().format('yyyyMMDDTHHmmss') + 'Z';
    let datestamp = new moment().format('yyyyMMDD');
    let service = 'transcribe';
    let linkExpirationSeconds = 60;
    let signatureString = crypto.enc.Hex.stringify(signature_key);
    let languageCode = body.languageCode;
    let sampleRate = body.sampleRate
    let canonical_uri = "/stream-transcription-websocket"
    let canonical_headers = "host:" + host + "\n"
    let signed_headers = "host" 
    let algorithm = "AWS4-HMAC-SHA256"
    let credential_scope = datestamp + "%2F" + region + "%2F" + service + "%2F" + "aws4_request"
    // Date and time of request - NOT url formatted
    let credential_scope2 = datestamp + "/" + region + "/" + service + "/" + "aws4_request"
  
    
    let canonical_querystring  = "X-Amz-Algorithm=" + algorithm
    canonical_querystring += "&X-Amz-Credential="+ accessKey + "%2F" + credential_scope
    canonical_querystring += "&X-Amz-Date=" + amz_date 
    canonical_querystring += "&X-Amz-Expires=" + linkExpirationSeconds
    canonical_querystring += "&X-Amz-SignedHeaders=" + signed_headers
    canonical_querystring += "&language-code=" + languageCode + "&media-encoding=pcm&sample-rate=" + sampleRate
    
    //Empty hash as playload is unknown
    let emptyHash = crypto.SHA256("");
    let payload_hash = crypto.enc.Hex.stringify(emptyHash);
    
    let canonical_request = method + '\n' 
    + canonical_uri + '\n' 
    + canonical_querystring + '\n' 
    + canonical_headers + '\n' 
    + signed_headers + '\n' 
    + payload_hash
    
    let hashedCanonicalRequest = crypto.SHA256(canonical_request);
    
    let string_to_sign = algorithm + "\n"
    + amz_date + "\n"
    + credential_scope2 + "\n"
    + crypto.enc.Hex.stringify(hashedCanonicalRequest);
    
    //Create the signing key
    let signing_key = getSignatureKey(secretAccessKey, datestamp, region, service);
    
    //Sign the string_to_sign using the signing key
    let inBytes = crypto.HmacSHA256(string_to_sign, signing_key);
    
    let signature = crypto.enc.Hex.stringify(inBytes);
    
    canonical_querystring += "&X-Amz-Signature=" + signature;
    
    let request_url = endpoint + canonical_uri + "?" + canonical_querystring;
    
    //The final product
    console.log(request_url);
    
    let response = {
        statusCode: 200,
        headers: {
          "Access-Control-Allow-Origin": "*"  
        },
        body: JSON.stringify(request_url)
    };
    return response;    
};

function getSignatureKey(key, dateStamp, regionName, serviceName) {
    var kDate = crypto.HmacSHA256(dateStamp, "AWS4" + key);
    var kRegion = crypto.HmacSHA256(regionName, kDate);
    var kService = crypto.HmacSHA256(serviceName, kRegion);
    var kSigning = crypto.HmacSHA256("aws4_request", kService);
    return kSigning;
};

Code for opening a websocket, seding audio and receiving a response

Install npm modules: microphone-stream (not sure if that's still available, but it's in the source code of that Github repo, I might have just pasted it into the node_modules folder), @aws-sdk/util-utf8-node, @aws-sdk/eventstream-marshaller

import audioUtils from "../js/audioUtils.js"; //For encoding audio data as PCM
import mic from "microphone-stream"; //Collect microphone input as a stream of raw bytes
import * as util_utf8_node from "@aws-sdk/util-utf8-node"; //Utilities for encoding and decoding UTF8
import * as marshaller from "@aws-sdk/eventstream-marshaller"; //For converting binary event stream messages to and from JSON

let micstream;
let mediastream;
let inputSampleRate; // The sample rate your mic is producting
let transcribeSampleRate = 16000 //The sample rate you requested from Transcribe
let transcribeLanguageCode = "en-US"; //The language you want Transcribe to use
let websocket;

// first we get the microphone input from the browser (as a promise)...
let mediaStream;
try {
    mediaStream = await window.navigator.mediaDevices.getUserMedia({
            video: false,
            audio: true
        })
}
catch (error) {
    console.log(error);
    alert("Error. Please make sure you allow this website to access your microphone");
    return;
}



this.eventStreamMarshaller = new marshaller.EventStreamMarshaller(util_utf8_node.toUtf8, util_utf8_node.fromUtf8);
            //let's get the mic input from the browser, via the microphone-stream module
            micStream = new mic();

            micStream.on("format", data => {
                inputSampleRate = data.sampleRate;
            });

            micStream.setStream(mediaStream);


//THIS IS WHERE YOU NEED TO GET YOURSELF A LINK FROM TRANSCRIBE
//AS MENTIONED I USED AWS LAMBDA FOR THIS
//LOOK AT THE ABOVE CODE FOR GETTING A TRANSCRIBE LINK

getTranscribeLink(transcribeLanguageCode, transcribeSampleRate) // Not a real funtion, you need to make this! The options are what would be in the body object in AWS Lambda

let url = "!YOUR-GENERATED-URL!"


//Configure your websocket
websocket = new WebSocket(url);
websocket.binaryType = "arraybuffer";

websocket.onopen = () => {
    //Make the spinner disappear
    micStream.on('data', rawAudioChunk => {
        // the audio stream is raw audio bytes. Transcribe expects PCM with additional metadata, encoded as binary
        let binary = convertAudioToBinaryMessage(rawAudioChunk);

        if (websocket.readyState === websocket.OPEN)
            websocket.send(binary);
    }
)};

// handle messages, errors, and close events
websocket.onmessage = async message => {

    //convert the binary event stream message to JSON
    var messageWrapper = this.eventStreamMarshaller.unmarshall(Buffer(message.data));

    var messageBody = JSON.parse(String.fromCharCode.apply(String, messageWrapper.body)); 
    
    //THIS IS WHERE YOU DO SOMETHING WITH WHAT YOU GET FROM TRANSCRIBE
    console.log("Got something from Transcribe!:");
    console.log(messageBody);
}






// FUNCTIONS

function convertAudioToBinaryMessage(audioChunk) {
    var raw = mic.toRaw(audioChunk);
    if (raw == null) return; // downsample and convert the raw audio bytes to PCM
    var downsampledBuffer = audioUtils.downsampleBuffer(raw, inputSampleRate, transcribeSampleRate);
    var pcmEncodedBuffer = audioUtils.pcmEncode(downsampledBuffer); // add the right JSON headers and structure to the message

    var audioEventMessage = this.getAudioEventMessage(Buffer.from(pcmEncodedBuffer)); //convert the JSON object + headers into a binary event stream message

    var binary = this.eventStreamMarshaller.marshall(audioEventMessage);
    return binary;
}

function getAudioEventMessage(buffer) {
    // wrap the audio data in a JSON envelope
    return {
        headers: {
            ':message-type': {
                type: 'string',
                value: 'event'
            },
            ':event-type': {
                type: 'string',
                value: 'AudioEvent'
            }
        },
        body: buffer
    };
}

audioUtils.js

export default {
    pcmEncode: pcmEncode,
    downsampleBuffer: downsampleBuffer
}

export function pcmEncode(input) {
    var offset = 0;
    var buffer = new ArrayBuffer(input.length * 2);
    var view = new DataView(buffer);
    for (var i = 0; i < input.length; i++, offset += 2) {
        var s = Math.max(-1, Math.min(1, input[i]));
        view.setInt16(offset, s < 0 ? s * 0x8000 : s * 0x7FFF, true);
    }
    return buffer;
}

export function downsampleBuffer(buffer, inputSampleRate = 44100, outputSampleRate = 16000) {
        
    if (outputSampleRate === inputSampleRate) {
        return buffer;
    }

    var sampleRateRatio = inputSampleRate / outputSampleRate;
    var newLength = Math.round(buffer.length / sampleRateRatio);
    var result = new Float32Array(newLength);
    var offsetResult = 0;
    var offsetBuffer = 0;
    
    while (offsetResult < result.length) {

        var nextOffsetBuffer = Math.round((offsetResult + 1) * sampleRateRatio);

        var accum = 0,
        count = 0;
        
        for (var i = offsetBuffer; i < nextOffsetBuffer && i < buffer.length; i++ ) {
            accum += buffer[i];
            count++;
        }

        result[offsetResult] = accum / count;
        offsetResult++;
        offsetBuffer = nextOffsetBuffer;

    }

    return result;

}

I think that's all. It should certainly be enough for you to get it working