Here my settings of Google Speech to Text AI

enter image description here

Here is the output file of Speech to Text AI : https://justpaste.it/speechtotext2

Here is the output file of YouTube's auto caption: https://justpaste.it/ytautotranslate

This is the video link : https://www.youtube.com/watch?v=IOMO-kcqxJ8&ab_channel=SoftwareEngineeringCourses-SECourses

This is the audio file of the video provided to Google Speech AI : https://storage.googleapis.com/text_speech_furkan/machine_learning_lecture_1.flac

Here I am providing time assigned SRT files

YouTube's SRT : https://drive.google.com/file/d/1yPA1m0hPr9VF7oD7jv5KF7n1QnV3Z82d/view?usp=sharing

Google Speech to Text API's SRT (timing assigned by YouTube) : https://drive.google.com/file/d/1AGzkrxMEQJspYenCbohUM4iuXN7H89wH/view?usp=sharing

I made comparison for some sentences and definitely YouTube's auto translation is better

For example

Google Speech to Text : Represent the **doctor** representation is one of the hardest part of computer AI you will learn about more about that in the future lessons.

What does this mean? Do you think this means that we are not just focused on behavior and **into doubt**. It is more about the reasoning when a human takes an action. There is a reasoning behind it.

YouTube's auto captioning : represent the **data** representation is one of the hardest part of computer ai you will we will learn more about that in the future lessons

what does this mean do you think this means that we are not just focused on behavior and **input** it is more about the reasoning when a human takes an action there is a reasoning behind it

I checked many cases and YouTube's guessing correct words is much better. How is this even possible?

This is the command I used to extract audio of the video : ffmpeg -i "input.mkv" -af aformat=s16:48000:output.flac

1

There are 1 best solutions below

0
On

Both the automatic captions of the Youtube Auto Caption feature and the transcription of the Speech to Text Recognition are generated by machine learning algorithms, in which case the quality of the transcription may vary according to different aspects.

It is important to note that he Speech to Text API utilizes machine learning algorithms for its transcription, the ones that are improved over time and the results can vary according to the input file and the request configuration. One way of helping the models of Google transcription is by enabling data logging, this will allow Google to collect data from your audio transcription requests that will help to improve its machine learning models used for recognizing speech audio, including enhanced models.

Additionally, on the request configuration of the Speech to Text API, you can specify the RecognitionConfig settings. This parameter contains the encoding, sampleRateHertz, languageCode, maxAlternatives, profanityFilter and the speechContext, every parameter plays an important role on the accuracy of the transcription of the file.

Specifically for FLAC audio files, a lossless compression helps in the quality of the audio provided, since there is no degradation in quality of the original digital sample, FLAC uses a compression level parameter from 0 (fastest) to 8 (smallest file size).

Also, the Speech to Text API offers different ways to improve the accuracy of the transcription, such as:

  • Speech adaptation : This feature allows you to specify words and/or phrases that STT should recognize more frequently in your audio data
  • Speech adaptation boost : This feature allows allows you to add numerical weights to words and/or phrases according to how frequently they should be recognized in your audio data.
  • Phrases hints : Send a list of words and phrases that provide hints to the speech recognition task

These features might help you with the accuracy of the Speech to Text API recognizing your audio files.

Finally, please refer to the Speech to Text best practices to improve the transcription of your audio files, these recommendations are designed for greater efficiency and accuracy as well as reasonable response times from the API.