I want to predict the occurrences of the word "repeat" in a speech as well as the word's approximate duration. For this task, I'm planning to build a Deep Learning model. I've around 50 positive as well as 50 negative utterances (I couldn't collect more).
Initially I've searched for any pretrained models for keyword spotting, but I couldn't get a good one.
Then I tried Speech Recognition models (Deep Speech), but it couldn't predict the exact repeat words as my data follows Indian accent. Also, I've thought that going for ASR models for this task would be a over-killing one.
Now, I've split the entire audio into chunk of 1 secs with 50% overlapping and tried a binary audio classification in each chunk that is whether the chunk has the word "repeat" or not. For building the classification model, I calculated the MFCC features and build a sequence model on the top of it. Nothing seems to work for me.
If anyone already worked with this kind of task, please provide me with a correct method/resources to build a DL model for this task. Thanks in advance!