I am trying to preprocess audio clips for a keyword spotting task that uses machine learning models.
The first step is to calculate the spectrogram starting from the waveform and in order to do so I have found that there are two ways within the tensorflow framework.
The first one is to use the tf.signal library.
This means the functions:
stft = tf.signal.stft(signals, frame_length, frame_step)
spectrogram = tf.abs(stft)
# matrix computed beforehand
tf.tensordot(spectrogram, linear_to_mel_weight_matrix, 1)
log_mel_spectrogram = tf.math.log(mel_spectrogram + 1.e-6)
mfccs = tf.signal.mfccs_from_log_mel_spectrograms(log_mel_spectrogram)
The second is to use tf.raw_ops library.
This results in the following code:
# spectrogram computation
spectrogram = tf.raw_ops.AudioSpectrogram(
    input=sample,
    window_size=window_size_samples,
    stride=window_stride_samples
    )
# mfcc computation
mfcc_features = tf.raw_ops.Mfcc(
    spectrogram=spectrogram,
    sample_rate=sample_rate,
    dct_coefficient_count=dct_coefficient_count
)
The problem is that the second one is much faster (~10x). As you can see from this table.
| Operation | tf.signal | tf.raw_ops | 
|---|---|---|
| STFT | 5.09ms | 0.47ms | 
| Mel+MFCC | 3.05ms | 0.25ms | 
In both cases the same parameters were used (window size, hop size, number of coefficients...). I have done some tests and the output is the same up to the 3rd decimal digit.
My question is: does someone have some experience with these functions or is someone able to explain this behavior?