I have been trying to run the UBM.EM_Split() function. I created a feature file feat.h5 (3.8 MB) which stores features from 24 audio files. I tried to use this feature file as input for the feature_list argument in the function. However, the code has been running for over 72 hours with no output or response. On closer inspection, the line of code where the code is frozen is the following:
# Wait for all the tasks to finish
queue_in.join()
Here is the code I used (it is based on the UBM tutorial on the sidekit website):
import sidekit
import os
#Read all the files in the directory
all_files = os.listdir("D:/DatabaseFiles/Sidekit/")
extractor = sidekit.FeaturesExtractor(audio_filename_structure="D:/DatabaseFiles/Sidekit/{}",
feature_filename_structure="D:/Sidekit/Trial/feat.h5",
sampling_frequency=16000,
lower_frequency=200,
higher_frequency=3800,
filter_bank="log",
filter_bank_size=24,
window_size=0.04,
shift=0.01,
ceps_number=20,
vad="snr",
snr=40,
pre_emphasis=0.97,
save_param=["vad", "energy", "cep", "fb"],
keep_all_features=True)
#To iterate through a whole list
for x in all_files:
extractor.save(x)
server = sidekit.FeaturesServer(feature_filename_structure="D:/Sidekit/Trial/feat.h5",
sources=None,
dataset_list=["vad", "energy", "cep", "fb"],
feat_norm="cmvn",
global_cmvn=None,
dct_pca=False,
dct_pca_config=None,
sdc=False,
sdc_config=None,
delta=True,
double_delta=True,
delta_filter=None,
context=None,
traps_dct_nb=None,
rasta=True,
keep_all_features=True)
ubm = sidekit.Mixture()
ubm.EM_split(features_server=server,
feature_list="D:/Sidekit/Trial/feat.h5",
distrib_nb=32,
iterations=(1, 2, 2, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8),
num_thread=10,
save_partial=True,
ceil_cov=10,
floor_cov=1e-2
)
I also tried the following function call based on a suggestion received from an experienced user (feature_list = all_files). But, that didn't solve the problem either.
ubm.EM_split(features_server=server,
feature_list=all_files,
distrib_nb=32,
iterations=(1, 2, 2, 4, 4, 4, 4, 8, 8, 8, 8, 8, 8),
num_thread=10,
save_partial=True,
ceil_cov=10,
floor_cov=1e-2
)
I had the same problem in both Windows and Linux environments. Both systems have 32 GB RAM and mpi is set to be true.
Do you know what I am doing wrong? Should it take this long for an h5 file with features from 24 audio files (feat.h5 is 3.8 MB)?
I did some tweaking on your code, and managed to train the UBM using some wav-files I had lying as arbitrary training data.
After editing directory paths to my data, your code successfully extracted the features. When running the EM_split part, it failed, probably because of the same error as yours.
The problem is rather simple, and has to do with the internal directory structure of the HDF5 file produced by the feature extractor. It seems that the FeaturesServer object is not very flexible when it interprets the file lists. One option could therefore be to edit the source code (features_server.py). However, the simplest workaround is to change your list of feature files to something the FeaturesServer can interpret as it is.
Feature extraction:
Now you have one HDF5-file for each wav-file in the training data. Not really elegant since you could have managed with only one, but it works. The function extractor.save_list() is useful as it allows running multiple processes, which will speed up the feature extraction a lot.
We can now train the UBM:
I recommend adding the following line at the end to save your UBM:
There it is! Let me know if this works for you. Feature extraction and model training took less than 10 minutes. (Ubuntu 14.04, Python 3.5.3, Sidekit v 1.2, 30 minutes of training data with 16kHz sample rate).