I was able to successfully quantise a pytorch model for huggingface text classification with intel lpot(neural compressor)
I now have the original fp32 model and quantised int8 models in my machine. For inference I loaded the quantised lpot model with the below code
model = AutoModelForSequenceClassification.from_pretrained('fp32/model/path')
from lpot.utils.pytorch import load
modellpot = load("path/to/lpotmodel/", model)
I am able to see time time improvement of sorts, But I wanted to confirm if the model weights have been actually quantized and use data types such as int8,fp16 etc, which should be ideally the reason of speed up. I iterate over the model weights and print dtypes of the weights, but I see all weights are of type fp32
for param in modellpot.parameters():
print(param.data.dtype)
output
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
torch.float32
..
...
How do I verify if my pytorch model has been quantised?
Use print(modellpot) to check whether the model is quantized. For example, Linear layer will be converted to QuantizedLinear layer. Actually, only layers that are supported in PyTorch will be converted into Quantized layer, so not all parameters are int8/uint8.
When the model is printed in the output for each you would be able to see the datatype eg the model output would show dtype as qint8 if int8 quantisation has been performed while printing the model.