How does int8 inference really works?

848 Views Asked by At

Not sure if this is the right place to ask this kind of question, but I can’t really find an example of how int8 inference works at runtime. What I know is that, given that we are performing uniform symmetric quantisation, we calibrate the model, i.e. we find the best scale parameters for each weight tensor (channel-wise) and activations (that corresponds to the outputs of the activation functions, if I understood correctly). After the calibration process we can quantize the model by applying these scale parameters and clipping che values that end up outside the dynamic range of the given layer. So at this point we have a new Neural Net where all the weights are int8 in the range [-127,127] and some scale parameters for the activations. What I don’t understand is how we perform inference on this new neural network, do we feed the input as float32 or directly as int8? All the computations are always in int8 or sometimes we cast from int8 to float32 and viceversa? It would be nice to find a real example of e.g. a CONV2D+BIAS+ReLU layer. If you could point me to some useful resources that would be appreciated.

Thanks

0

There are 0 best solutions below