Google Speech-to-Text for streaming audio with Java

1.4k Views Asked by At

I am trying to use the Google Speech-to-Text API to do some voice-to-voice translation (also using Translation and Text-to-Speech). I would like for a person to speak into the microphone and for that text to be transcribed to text. I used the streaming audio tutorial found in the google documentation as a base for this method. I would also like the audio stream to stop when the person has stopped speaking.

Here is the modified method:

public static String streamingMicRecognize(String language) throws Exception {

        ResponseObserver<StreamingRecognizeResponse> responseObserver = null;
        try (SpeechClient client = SpeechClient.create()) {

            responseObserver =
                    new ResponseObserver<StreamingRecognizeResponse>() {
                ArrayList<StreamingRecognizeResponse> responses = new ArrayList<>();

                public void onStart(StreamController controller) {}

                public void onResponse(StreamingRecognizeResponse response) {
                    responses.add(response);
                }

                public void onComplete() {
                    SPEECH_TO_TEXT_ANSWER = "";
                    for (StreamingRecognizeResponse response : responses) {
                        StreamingRecognitionResult result = response.getResultsList().get(0);
                        SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
                        System.out.printf("Transcript : %s\n", alternative.getTranscript());
                        SPEECH_TO_TEXT_ANSWER = SPEECH_TO_TEXT_ANSWER + alternative.getTranscript();
                    }
                }

                public void onError(Throwable t) {
                    System.out.println(t);
                }
            };

            ClientStream<StreamingRecognizeRequest> clientStream =
                    client.streamingRecognizeCallable().splitCall(responseObserver);

            RecognitionConfig recognitionConfig =
                    RecognitionConfig.newBuilder()
                    .setEncoding(RecognitionConfig.AudioEncoding.LINEAR16)
                    .setLanguageCode(language)
                    .setSampleRateHertz(16000)
                    .build();
            StreamingRecognitionConfig streamingRecognitionConfig =
                    StreamingRecognitionConfig.newBuilder().setConfig(recognitionConfig).build();

            StreamingRecognizeRequest request =
                    StreamingRecognizeRequest.newBuilder()
                    .setStreamingConfig(streamingRecognitionConfig)
                    .build(); // The first request in a streaming call has to be a config

            clientStream.send(request);
            // SampleRate:16000Hz, SampleSizeInBits: 16, Number of channels: 1, Signed: true,
            // bigEndian: false
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            DataLine.Info targetInfo =
                    new Info(
                            TargetDataLine.class,
                            audioFormat); // Set the system information to read from the microphone audio stream

            if (!AudioSystem.isLineSupported(targetInfo)) {
                System.out.println("Microphone not supported");
                System.exit(0);
            }
            // Target data line captures the audio stream the microphone produces.
            TargetDataLine targetDataLine = (TargetDataLine) AudioSystem.getLine(targetInfo);
            targetDataLine.open(audioFormat);
            targetDataLine.start();
            System.out.println("Start speaking");
            playMP3("beep-07.mp3");
            long startTime = System.currentTimeMillis();
            // Audio Input Stream
            AudioInputStream audio = new AudioInputStream(targetDataLine);
            long estimatedTime = 0, estimatedTimeStoppedSpeaking = 0, startStopSpeaking = 0;
            int currentSoundLevel = 0;
            Boolean hasSpoken = false;
            while (true) {
                estimatedTime = System.currentTimeMillis() - startTime;
                byte[] data = new byte[6400];
                audio.read(data);

                currentSoundLevel = calculateRMSLevel(data);
                System.out.println(currentSoundLevel);

                if (currentSoundLevel > 20) {
                    estimatedTimeStoppedSpeaking = 0;
                    startStopSpeaking = 0;
                    hasSpoken = true;
                }
                else {
                    if (startStopSpeaking == 0) {
                        startStopSpeaking = System.currentTimeMillis();
                    }
                    estimatedTimeStoppedSpeaking = System.currentTimeMillis() - startStopSpeaking;
                }

                if ((estimatedTime > 15000) || (estimatedTimeStoppedSpeaking > 1000 && hasSpoken)) { // 15 seconds or stopped speaking for 1 second
                    playMP3("beep-07.mp3");
                    System.out.println("Stop speaking.");
                    targetDataLine.stop();
                    targetDataLine.drain();
                    targetDataLine.close();
                    break;
                }
                request =
                        StreamingRecognizeRequest.newBuilder()
                        .setAudioContent(ByteString.copyFrom(data))
                        .build();
                clientStream.send(request);
            }
        } catch (Exception e) {
            System.out.println(e);
        }
        responseObserver.onComplete();
        String ans = SPEECH_TO_TEXT_ANSWER;
        return ans;
    }

The output is supposed to be transcribed text in string form. However, it is very inconsistent. Most of the time, it returns an empty string. However, sometimes the program does work and does return the transcribed text.

I have also tried to record the audio seperately while the program is running. Although the method returned an empty string, when I saved the audio file recorded seperately and sent that directly through the api, it returned the correct transcribed text.

I do not understand why/how the program is only working some of the time.

0

There are 0 best solutions below