vosk in python: get location of transcribed text in audio file

2.2k Views Asked by At

Using a file very similar to test_ffmpeg.py in the Vosk repository, I am exploring what text information I can get out of the audio file.

Here is the code of the whole script I'm using.

#!/usr/bin/env python3

from vosk import Model, KaldiRecognizer, SetLogLevel
import sys
import os
import wave
import subprocess
import json

SetLogLevel(0)

if not os.path.exists("model"):
    print ("Please download the model from https://alphacephei.com/vosk/models and unpack as 'model' in the current folder.")
    exit (1)

sample_rate=16000
model = Model("model")
rec = KaldiRecognizer(model, sample_rate)

process = subprocess.Popen(['ffmpeg', '-loglevel', 'quiet', '-i',
                            sys.argv[1],
                            '-ar', str(sample_rate) , '-ac', '1', '-f', 's16le', '-'],
                            stdout=subprocess.PIPE)

file = open(sys.argv[1]+".txt","w+")

while True:
    data = process.stdout.read(4000)
    if len(data) == 0:
        break
    if rec.AcceptWaveform(data):
        file.write(json.loads(rec.Result())['text']+"\n\n")
        #print(rec.Result())
    #else:
        #print(rec.PartialResult())
#print(json.loads(rec.Result())['text'])
file.write(json.loads(rec.Result())['text'])
file.close()

This example works well, however, the only return I can find out of rec.PartialResult() and rec.Result() is a string dictionary with the result. Is there a way to query the KaldiRecognizer on the timing individual words were found within the audio file?

As I'm typing this, I'm already thinking that elaborating on the result, and detecting changes in the partial result compared with the current samples will get me what I want, but I'm sticking this up here just in case it's already implemented.

1

There are 1 best solutions below

0
On

After some testing, it was pretty clear the output of ffmpeg seemed stable enough against the defined sample rate (16000), and the read bytes of 4000 turned out to be 8th's of a second. I created a counter in the while loop and divided it by a constant based on the sample rate. If you change the parameters to ffmpeg, it will probably throw this off.

I used some very stone age string comparison to only print when the partial result changes, and only contain the new characters added.

counter = 0
countinc = 2000/sample_rate
lastPR = ""
thisPR = ""
while True:
    data = process.stdout.read(4000)
    counter += 1
    if len(data) == 0:
        break
    rec.AcceptWaveform(data)
    thisPR = json.loads(rec.PartialResult())['partial']
    if lastPR != thisPR:
        print(counter*countinc,thisPR[len(lastPR):len(thisPR)])
        lastPR = thisPR