I am currently hearing a lecture about automatic speech recognition (ASR). The last lecture was about vector quantization (VQ) and k nearest neighbors (kNN) as well as binary trees and gaussian mixture models (GMMs).
According to the lecturer, VQ is used to speed up the evaluation of GMMs by just calculating an approximate value of the GMM. This is done by finding the gaussian in a GMM which would have the highest value and looking the value of this vector up (from a previously built dictionary, stored as a binary tree). Each GMM has about 42 gaussians. According to the lecturer, this should speed the calculation up, because the calculation of the e-function (exp
, natural exponential function) is computationally expensive.
I was curious if this is (still) true, searched for the Python implementation and found this answer which explains that exp
is calculated by the hardware.
Todays CPUs (and GPUs) are complex and I have very limited knowledge of them. It could still be true that exp
is much more expensive than e.g. comparisons of floats, additions or multiplications.
Questions
- How expensive is
exp
in comparison to float comparisons, additions, multiplications and similar basic commands? - Did I eventually understand something wrong why VQ is done in ASR?
Experimental evaluation
I tried to get a result by starting an experiment. But it is difficult for me to eliminate other effects from making my numbers wrong (e.g. caches, variable lookup times, time of random number generator, ...).
Currently, I have
#!/usr/bin/env python
import math
import time
import random
# Experiment settings
numbers = 5000000
seed = 0
repetitions = 10
# Experiment
random.seed(seed)
values = [random.uniform(-5, 5) for _ in range(numbers)]
v2 = [random.uniform(-5, 5) for _ in range(numbers)]
# Exp
for i in range(repetitions):
t0 = time.time()
ret = [math.exp(x) for x in values]
t1 = time.time()
time_delta = t1 - t0
print("Exp time: %0.4fs (%0.4f per second)" % (time_delta, numbers/time_delta))
# Comparison
for i in range(repetitions):
t0 = time.time()
ret = [x+y for x, y in zip(values, v2)]
t1 = time.time()
time_delta = t1 - t0
print("x+y time: %0.4fs (%0.4f per second)" % (time_delta, numbers/time_delta))
But I guess zip
makes this one fail, because the result is:
Exp time: 1.3640s (3665573.5997 per second)
Exp time: 1.7404s (2872978.6149 per second)
Exp time: 1.5441s (3238178.6480 per second)
Exp time: 1.5161s (3297876.5227 per second)
Exp time: 1.9912s (2511009.5658 per second)
Exp time: 1.3086s (3820818.9478 per second)
Exp time: 1.4770s (3385254.5642 per second)
Exp time: 1.5179s (3294040.1828 per second)
Exp time: 1.3198s (3788392.1744 per second)
Exp time: 1.5752s (3174296.9903 per second)
x+y time: 9.1045s (549179.7651 per second)
x+y time: 2.2017s (2270981.5582 per second)
x+y time: 2.0781s (2406097.0233 per second)
x+y time: 2.1386s (2338005.6240 per second)
x+y time: 1.9963s (2504681.1570 per second)
x+y time: 2.1617s (2313042.3523 per second)
x+y time: 2.3166s (2158293.4313 per second)
x+y time: 2.2966s (2177155.9497 per second)
x+y time: 2.2939s (2179730.8867 per second)
x+y time: 2.3094s (2165055.9488 per second)
This is a correct description. You can find an interesting description of an optimal gaussian computation in the following paper:
George Saon, Daniel Povey & Geoffrey Zweig, "Anatomy of an extremely fast LVCSR decoder," Interspeech 2005. http://www.danielpovey.com/files/eurospeech05_george_decoder.pdf
likelihood computation section
At this part you probably misunderstood the lecturer. The exp is not a very significant issue. The Gaussian computation is expensive for other reasons: there are several thousand Gaussian scored every frame each with a few dozen components each of 40 floats. It is expensive to process all this data due to the amount of memory you need to feed and store. Gaussian selection helps here to reduce the amount of Gaussian several folds and thus speeds up the computation.
Using a GPU is another solution to this problem. By moving scoring to GPU you can significantly speedup scoring. However, there is an issue with HMM search in that it can not be easily parallelized. This is another important part of the decoding and even if you reduce scoring to zero, the decoding will still be slow still due to the search.
This is not a meaningful comparison. There are many things you ignored here like the cost of the Python zip call (izip is better). This way you can demonstrate any result easily.