Split speech audio at spoken word

1.6k Views Asked by At

I have an audio file of a long text with different sections all beginning with the spoken word "Chapter" (narrated by the same speaker). Is there a way to split the audio file in smaller files at these words?

I am thinking of cutting out one of the word occurrences of "chapter", put it in a separate audio file and then use some tool for fuzzy matching of the original audio against the short snippet to find the "chapter" occurrences and split the original file at these occurrences.

Which tool can do this? SOX? Audacity?

2

There are 2 best solutions below

2
On

That would be doable. You need two steps:

  1. Detect times where the words occured
  2. Cut the audio based on the times.

To detect times you can use keyword spotting tool from pocketsphinx trunk, just checkout pocketsphinx from subversion and build it. It will install pocketsphinx_kws binary for keyword spotting. Then you can search for word times in an audio, which must be 16khz 16bit MSWAV format:

 pocketsphinx_kws -infile barnabyrudge_07_dickens.wav -kws "chapter"
 ...
 INFO: kws_search.c(229): >>>>DETECTED IN FRAME [2138]
 INFO: kws_search.c(229): >>>>DETECTED IN FRAME [2182]
 INFO: kws_search.c(229): >>>>DETECTED IN FRAME [92149]

Frame rate is 100 frames/second so you see that the chapter is detected at 21.38s and 921.49 s (when user said "end of chapter")

It's better to use longer phrase for detection, the longer phrase is the better the detection would be. For the best detection you can tune a threshold.

To cut the audio you can use sox, you can use trim command to delete the start and trim + reverse to cut the end.

1
On

This could be implemented with a Speech Recognition systems. In the answer Audio signal split at word level boundary there is functioning Python code to do word-level spitting. The code can easily be adapted to do splitting only at words that are "Chapter", to get the functionality wanted here.