Int8 quantized model performs worse or similar to non quantized fp32 or fp16 model. #4180

Raj-vivid · 2024-10-02T19:11:06Z

I am using a pretrained model from timm for convnextv2. It comprises of layer norm and globalresponsenormalization layer but even after adding custom quant modules for layer norm , layer norm 2d and global response norm (grn) I still can't make my model run faster than base model with fp16 engine. I am using python extension for tensorrt and using model OPT to perform the quantization.

My code for creating custom modules are as follows :

class QuantLayerNorm(LayerNorm):
    def __init__(self, normalized_shape):
        super().__init__(normalized_shape)
        self._setup()

    def _setup(self):
        self.input_quantizer = TensorQuantizer()
        self.weight_quantizer = TensorQuantizer()

    def forward(self, input):
        input = self.input_quantizer(input)
        weight = self.weight_quantizer(self.weight)
        return F.layer_norm(input, self.normalized_shape, weight, self.bias, self.eps)
    

class QuantLayerNorm2d(LayerNorm2d):
    def __init__(self, normalized_shape):
        super().__init__(normalized_shape)
        self._setup()

    def _setup(self):
        self.input_quantizer = TensorQuantizer()
        self.weight_quantizer = TensorQuantizer()

    def forward(self, input):
        input = self.input_quantizer(input)
        weight = self.weight_quantizer(self.weight)
        input = input.permute(0, 2, 3, 1)
        input = F.layer_norm(input, self.normalized_shape, weight, self.bias, self.eps)
        input = input.permute(0, 3, 1, 2)
        return input
    

class QuantGlobalResponseNorm(GlobalResponseNorm):
    """Quantized Global Response Normalization layer with Tensor Quantizers."""
    
    def __init__(self, dim, eps=1e-6, channels_last=True):
        super().__init__()
        self.eps = eps
        if channels_last:
            self.spatial_dim = (1, 2)
            self.channel_dim = -1
            self.wb_shape = (1, 1, 1, -1)
        else:
            self.spatial_dim = (2, 3)
            self.channel_dim = 1
            self.wb_shape = (1, -1, 1, 1)


        self.weight = nn.Parameter(torch.zeros(dim))
        self.bias = nn.Parameter(torch.zeros(dim))
        
        # Setup quantizers
        self._setup()
    
    def _setup(self):
        self.input_quantizer = TensorQuantizer()
        self.weight_quantizer = TensorQuantizer()
        
        # self.bias_quantizer = TensorQuantizer()
    
    def forward(self, x):
        x = self.input_quantizer(x)
        quant_weight = self.weight_quantizer(self.weight)

        # quant_bias = self.bias_quantizer(self.bias)
        quant_bias = self.bias  

        x_g = x.norm(p=2, dim=self.spatial_dim, keepdim=True)
        x_n = x_g / (x_g.mean(dim=self.channel_dim, keepdim=True) + self.eps)
        return x + torch.addcmul(
            quant_bias.view(self.wb_shape), 
            quant_weight.view(self.wb_shape), 
            x * x_n
        )
    

mtq.register(original_cls=LayerNorm, quantized_cls=QuantLayerNorm)
mtq.register(original_cls=LayerNorm2d, quantized_cls=QuantLayerNorm2d)
mtq.register(original_cls=GlobalResponseNorm, quantized_cls=QuantGlobalResponseNorm)

I am using the following config :

{'quant_cfg': {'*weight_quantizer': {'num_bits': 8, 'axis': 0},
  '*input_quantizer': {'num_bits': 8, 'axis': None},
  '*lm_head*': {'enable': False},
  '*block_sparse_moe.gate*': {'enable': False},
  '*router*': {'enable': False},
  '*output_layer*': {'enable': False},
  'output.*': {'enable': False},
  'nn.BatchNorm1d': {'*': {'enable': False}},
  'nn.BatchNorm2d': {'*': {'enable': False}},
  'nn.BatchNorm3d': {'*': {'enable': False}},
  'nn.LeakyReLU': {'*': {'enable': False}},
  'default': {'enable': False},
  '*output_quantizer': {'num_bits': 8, 'axis': None},
  'LayerNorm2d': {'*': {'enable': True}},
  'LayerNorm': {'*': {'enable': True}},
  'GlobalResponseNorm': {'*': {'enable': True}}},
 'algorithm': 'max'}

I am going by the documentation but it is not clear to me if I am doing something wrong. Help is much appreciated.

The text was updated successfully, but these errors were encountered:

lix19937 · 2024-10-05T06:49:43Z

How about your ptq predict performs ?

Rajjeshwar · 2024-10-07T19:44:44Z

I tried this with ViTb and convnext, I setup a convnext pipeline in this notebook https://drive.google.com/file/d/1LTfJsAcTgJ3Rb8BXiAuC66OD-_9zEWSy/view?usp=drive_link

This is actually using PTQ. Int8 does not seem to give any boosts in performance over fp16 and in some cases causes a slight slowdown.

Raj-vivid · 2024-10-09T13:32:29Z

I tried a few more things, using onnx opset 17 to convert to onnx allowed for using a fused node for Layernorm but even with qdq before conv operation, before GELU activation and before Layernorm in my graph I still get much slower speed than fp16.

Before:

After:

Raj-vivid · 2024-10-15T00:19:25Z

Hello, I appreciate the help you do for the community to answer all the issues on the thread, could you please have a look at this ?

No valid tactics for finetuned_convnext.convnext.stages.0.blocks.0.mlp.fc1.weight + [/finetuned_convnext/convnext/stages/stages.0/blocks/blocks.0/mlp/fc1/weight_quantizer/QuantizeLinear](https://vscode-remote+ssh-002dremote-002bcompute-005fnode.vscode-resource.vscode-cdn.net/finetuned_convnext/convnext/stages/stages.0/blocks/blocks.0/mlp/fc1/weight_quantizer/QuantizeLinear) + [/finetuned_convnext/convnext/stages/stages.0/blocks/blocks.0/mlp/fc1/Conv](https://vscode-remote+ssh-002dremote-002bcompute-005fnode.vscode-resource.vscode-cdn.net/finetuned_convnext/convnext/stages/stages.0/blocks/blocks.0/mlp/fc1/Conv) + PWN(PWN(PWN(PWN(PWN(PWN(PWN(/finetuned_convnext/convnext/stages/stages.0/blocks/blocks.0/mlp/act/Mul,

I keep getting this when using mtq.INT8_DEFAULT_CFG. Does it mean tensorRT does not support fusing layer norm with conv and gelu activation? Still haven't been able to find the slowdown.

kevinch-nv · 2025-02-11T20:54:06Z

Are you still facing this problem? For faster triage are you able to provide your fp16 and int8 models?

kevinch-nv added triaged Issue has been triaged by maintainers Module:Quantization Issues related to Quantization waiting for feedback Requires more information from user to make progress on the issue. labels Feb 11, 2025

kevinch-nv self-assigned this Feb 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Int8 quantized model performs worse or similar to non quantized fp32 or fp16 model. #4180

Int8 quantized model performs worse or similar to non quantized fp32 or fp16 model. #4180

Raj-vivid commented Oct 2, 2024

lix19937 commented Oct 5, 2024

Rajjeshwar commented Oct 7, 2024

Raj-vivid commented Oct 9, 2024

Raj-vivid commented Oct 15, 2024

kevinch-nv commented Feb 11, 2025

Int8 quantized model performs worse or similar to non quantized fp32 or fp16 model. #4180

Int8 quantized model performs worse or similar to non quantized fp32 or fp16 model. #4180

Comments

Raj-vivid commented Oct 2, 2024

lix19937 commented Oct 5, 2024

Rajjeshwar commented Oct 7, 2024

Raj-vivid commented Oct 9, 2024

Raj-vivid commented Oct 15, 2024

kevinch-nv commented Feb 11, 2025