HTK - What do MFCCs of an HMM model and Input WAV File represent?

Question

HTK - What do MFCCs of an HMM model and Input WAV File represent?

815 Views Asked by Ajay H At 28 July 2025 at 08:30

While creating MFCCs following Voxforge's tutorial for a Speech to Text System using HTK (Hidden Markov Model Tool Kit), we are required to define a prototype model for our phones. I am trying to wrap my head around this this file.

~o <VecSize> 25 <MFCC_0_D_N_Z>
~h "proto"
<BeginHMM>
  <NumStates> 5
  <State> 2
    <Mean> 25
      0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0  
    <Variance> 25
      1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 
 <State> 3
    <Mean> 25
      0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0  
    <Variance> 25
      1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
 <State> 4
    <Mean> 25
      0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 
    <Variance> 25
      1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
 <TransP> 5
  0.0 1.0 0.0 0.0 0.0
  0.0 0.6 0.4 0.0 0.0
  0.0 0.0 0.6 0.4 0.0
  0.0 0.0 0.0 0.7 0.3
  0.0 0.0 0.0 0.0 0.0
<EndHMM>

In this case, we are using a feature vector of Length 25 to represent every state of the HMM. However, I don't quite understand why we have 25 "Means" and "Variances" for every state. Do they represent the Mean and Variance of every Feature Vector?

Furthermore, why Do we have 3 states when is 5? Are <State>1 and <State>5 simply entry and exit points so they do not require a Mean and Variance?

Also, while taking sample wav files, I printed the MFCCs which displayed as below:

  0:     -15.769  -2.168   8.605   4.979   5.283   1.012   9.631  -0.619   3.622  10.977
             5.733   3.260  44.447  -0.153  -0.281  -0.810  -1.176   0.363  -0.658   0.676
            -1.569   1.363  -1.221   0.815  -0.759   1.427
    1:     -18.345  -3.220   7.177   0.293   7.232   3.111  17.942  -6.957   8.197   6.579
             9.102  -0.569  49.537   0.378  -0.337  -1.277  -1.709   0.623  -0.450   0.162
             0.315   2.088  -1.175   0.624   0.762   1.018
    2:     -15.244  -3.046   5.269   1.441   6.121  -3.326   8.854  -5.297   8.151   7.072
             8.122   1.379  49.036   0.543  -0.119  -1.162  -1.263   1.261  -0.388  -0.234
             0.816   1.195  -1.237  -0.288   1.600   0.244
    3:     -14.143  -3.413   3.887  -1.796   7.981   0.930  10.826   3.294  11.797   7.055
             7.661   8.011  47.243   0.613  -0.020  -0.568  -0.364   1.034  -0.165  -0.812
             2.525   0.351  -1.670  -1.086   1.493  -0.716
    4:     -15.156  -2.669   4.440  -0.293  11.213   0.162  12.020  -1.667   7.794   4.553
             5.013   6.968  46.813  -0.050  -0.092  -0.050  -0.329   0.325   0.585   0.751
             1.253  -0.008  -1.852  -0.845   0.058  -0.430
    5:     -15.323  -3.510   4.750  -0.660   9.856   0.545  12.301   3.855  10.132  -0.511
             5.224   4.104  47.068   0.073   0.151   0.163  -0.180  -0.186  -0.242  -0.335
            -0.577  -0.479  -0.745  -0.167  -1.565   0.013

For every "window", why do we have 26 coeffieincts instead of 25? What do they all represent? I believe:

1-12 are Cepstral Coeffiecients
14-25 are Delta Coefficients
26 is also a Delta Coeffieienct for the 13th number

But I have no idea what 13th number in each of these samples represent. They should be of the format <MFCC_0_D_N_Z> as defined in the prototype file displayed in the beginning, which is not explained well in the HTK Manual. But I can garner from page 80 of the Manual that :

MFCC_0 : MFCC Coefficients
_D : Delta Coefficients
_N : absolute Energy Suppressed
_Z : has Zero Mean Static Coef.

Any explanations would be appreciated.

Original Q&A

There are 1 best solutions below

**Nikolay Shmyrev** · Accepted Answer

Furthermore, why Do we have 3 states when is 5? Are 1 and 5 simply entry and exit points so they do not require a Mean and Variance?

Yes, boundary states are dummy.

For every "window", why do we have 26 coeffieincts instead of 25? What do they all represent? I believe:

MFCC type is MFCC_0_D as in Tutorial step 5, so those are 13 ceps and 13 deltas. You can also use HList -o -h to print the exact layout:

---------------------------------- Source: ar-03.mfc -----------------------------------
  Sample Bytes:  52       Sample Kind:   MFCC_D_C_K_0
  Num Comps:     26       Sample Period: 10000.0 us
  Num Samples:   648      File Format:   HTK
-------------------------------- Observation Structure ---------------------------------
x:      MFCC-1  MFCC-2  MFCC-3  MFCC-4  MFCC-5  MFCC-6  MFCC-7  MFCC-8  MFCC-9 MFCC-10
       MFCC-11 MFCC-12      C0   Del-1   Del-2   Del-3   Del-4   Del-5   Del-6   Del-7
         Del-8   Del-9  Del-10  Del-11  Del-12   DelC0

The type of features stored in mfc file might differ from the type of features used in HMM training, the HMM features are computed from mfc on the fly according to the proto specification, so on the disk you have 26 MFCC_0_D and when you compute you convert it to 25 coefficients MFCC_0_D_N_Z by dropping the energy and normalizing the mean.

I don't quite understand why we have 25 "Means" and "Variances" for every state. Do they represent the Mean and Variance of every Feature Vector?

Means and variances are gaussian parameters of the HMM emitting distribution for every HMM state, they are not the mean of feature vector. Check what HMM is.

HTK - What do MFCCs of an HMM model and Input WAV File represent?

There are 1 best solutions below

Related Questions in SPEECH-RECOGNITION

Related Questions in SPEECH-TO-TEXT

Related Questions in HIDDEN-MARKOV-MODELS

Related Questions in MFCC

Related Questions in HTK

Trending Questions

Popular # Hahtags

Popular Questions