I would like to ask a question regarding speech similarity check. Suppose I have 2 audio files that contain the same word, recorded by 2 different speakers, I would like to verify if these 2 audio files are similar or not, but I don't want to proceed with speech-to-text (because some audio files don't have a meaningful word).
I extracted mfccs vector after pre-processing the audios and applied DTW (Dynamic Time Warping), I got a 0 similarity score for the same audio (reference with reference) but when I applied it on 2 audio files recorded by 2 different speakers I got a high similarity score (indicating that they are not similar). Can anyone suggest me a method to solve this problem? And what is the mistake in my approach? here's the code after resampling the signals:
`from pydub import AudioSegment, silence
# Load the audio file
audio_file = AudioSegment.from_wav('C://Users//10Rs6//Desktop//testapb.wav')
# Set the minimum length of a non-silent segment
min_silence_len = 100 # in milliseconds
# Set the threshold for detecting silence
silence_thresh = -25 # in dBFS
# Split the audio into non-silent segments
non_silent_segments = silence.split_on_silence(audio_file,
min_silence_len=min_silence_len,
silence_thresh=silence_thresh)
# Concatenate the non-silent segments into a new audio file
trimmed_audio = AudioSegment.empty()
for segment in non_silent_segments:
trimmed_audio += segment
# Export the trimmed audio file
trimmed_audio.export('C://Users//10Rs6//Desktop//trimmed_audio5.wav', format='wav')
def preemphasis(signal, alpha=0.97):
"""
Applies a pre-emphasis filter on the input signal.
Parameters:
signal (array-like): The input signal to filter.
alpha (float): The pre-emphasis coefficient. Default is 0.97.
Returns:
The filtered signal.
"""
return lfilter([1, -alpha], [1], signal)
pre_emphasised_test=preemphasis(resampled_audio_test)
pre_emphasised_ref=preemphasis(resampled_audio_ref)
normalized_test = librosa.util.normalize(pre_emphasised_test)
normalized_ref=librosa.util.normalize(pre_emphasised_ref)
# extract MFCCs
mfccsT = librosa.feature.mfcc(y=pre_emphasised_test, sr=41100, n_mfcc=13)
# normalize MFCCs
mfccsT = np.mean(mfccsT.T, axis=0)
# print MFCCs vector
print(mfccsT)
mfccsT.shape
# extract MFCCs
mfccsR = librosa.feature.mfcc(y=pre_emphasised_ref, sr=41100, n_mfcc=13)
# normalize MFCCs
mfccsR = np.mean(mfccsR.T, axis=0)
# print MFCCs vector
print(mfccsR)
mfccsR.shape
# assuming your MFCCs are in a variable called mfccs
# reshape to a 2D array
mfccsT_2d = np.reshape(mfccsT, (mfccsT.shape[0], -1))
# normalize the MFCCs
scaler = StandardScaler()
scaler.fit(mfccsT_2d)
normalized_mfccsT_2d = scaler.transform(mfccsT_2d)
# reshape back to the original shape
normalized_mfccsT = np.reshape(normalized_mfccsT_2d, mfccsT.shape)
print(normalized_mfccsT)
# assuming your MFCCs are in a variable called mfccs
# reshape to a 2D array
mfccsR_2d = np.reshape(mfccsR, (mfccsR.shape[0], -1))
# normalize the MFCCs
scaler = StandardScaler()
scaler.fit(mfccsR_2d)
normalized_mfccsR_2d = scaler.transform(mfccsR_2d)
# reshape back to the original shape
normalized_mfccsR = np.reshape(normalized_mfccsR_2d, mfccsR.shape)
print(normalized_mfccsR)
from dtw import dtw
normalized_mfccsT = normalized_mfccsT.reshape(-1, 1)
normalized_mfccsR = normalized_mfccsR.reshape(-1, 1)
from dtw import dtw
# Here, we use L2 norm as the element comparison distance
l2_norm = lambda normalized_mfccsT, normalized_mfccsR: (normalized_mfccsT - normalized_mfccsR) ** 2
dist, cost_matrix, acc_cost_matrix, path = dtw(normalized_mfccsT, normalized_mfccsR, dist=l2_norm)
dist`
Thanks.
The MFCC values are not a good representation for speech content similarity, because there is still a lot of "acoustic" information present. Two different speakers speaking the same word will be quite different. Or even the same speaker recorded with two different microphones, or in two different locations (especially reverberation). What is desired here is a speaker-independent representation that is robust to device/environment/noise variation. A good Automatic Speech Recognition (ASR) system invariably have this property. And with some systems it is possible to get the learned vector representations, not just the predicted text sequence.
On top of such a feature vector sequence, one would create a similarity metric. Possibly reduce the feature dimensionality first, with a projection like PCA. And then one can try Dynamic Time Warping on that.
Wav2Vec
Wav2Vec is a self-supervised speech model. It is commonly used as a feature extractor for a wide range of speech and non-speech audio tasks. The Huggingface transformers library has a good and simple to use implementation in Wav2Vec2FeatureExtractor.
Allosaurus
Allosaurus is a pretrained universal phone recognizer. It outputs a vector representation of phones, which should work for any language in the world, and probably work quite good for non-text speech sounds also.