I'm trying to understand how an linear classification SVM works (as in the one used for HOG people detection). I feel that I'm missing an essential part, and I'm afraid that I can't find a clear description to understand it better. I know there are ready-to-use implementations, and in the end I will probably end up using one of them, but I would like to understand what I'm doing.
As far as I understand an SVM gets trained with a number of feature vectors and correct classifications. After training the SVM is fully defined as a set of hyperplanes (with number of dimensions being the length of the feature vector), typically a very small number. I would expect (naively?) my trained SVM to be something like:
ax >=b: 0
ax < b: 1
where x is the feature vector, and ax = b
the hyperplane. Here I run into problems because:
- I don't understand how in aforementioned paper they end up with a trained SVM that is 1.7GB. Mine would be something like (64 bit/float * (length of feature vector + 1)).
- Classifying using this SVM is trivial, one dot product and one comparison. Even though I can't seem to find too much information on how long matching using an SVM takes, people seem to be looking for fast implementations.
I'm sure that at some point I misunderstood what I read, however I would like to know where I went wrong in my thinking. I guess I'm just stuck in the wrong mindset, because the more I read about SVMs, the more I see above description confirmed, and this just can't be right.
It looks like in the paper they needed 1.7 GB of RAM to train a classifier. To do that they had to load about 14000 of 64x128 RGB image patches. Which ends up about 1.5 GB when they are stored using integers.
Once classifier is computed you right, only one weight vector is needed to check on which side of a hyper plane a given sample is.