Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Int8 quantized model performs worse or similar to non quantized fp32 or fp16 model. #4180

Open
Raj-vivid opened this issue Oct 2, 2024 · 5 comments
Assignees
Labels
Module:Quantization Issues related to Quantization triaged Issue has been triaged by maintainers waiting for feedback Requires more information from user to make progress on the issue.

Comments

@Raj-vivid
Copy link

I am using a pretrained model from timm for convnextv2. It comprises of layer norm and globalresponsenormalization layer but even after adding custom quant modules for layer norm , layer norm 2d and global response norm (grn) I still can't make my model run faster than base model with fp16 engine. I am using python extension for tensorrt and using model OPT to perform the quantization.

My code for creating custom modules are as follows :

class QuantLayerNorm(LayerNorm):
    def __init__(self, normalized_shape):
        super().__init__(normalized_shape)
        self._setup()

    def _setup(self):
        self.input_quantizer = TensorQuantizer()
        self.weight_quantizer = TensorQuantizer()

    def forward(self, input):
        input = self.input_quantizer(input)
        weight = self.weight_quantizer(self.weight)
        return F.layer_norm(input, self.normalized_shape, weight, self.bias, self.eps)
    

class QuantLayerNorm2d(LayerNorm2d):
    def __init__(self, normalized_shape):
        super().__init__(normalized_shape)
        self._setup()

    def _setup(self):
        self.input_quantizer = TensorQuantizer()
        self.weight_quantizer = TensorQuantizer()

    def forward(self, input):
        input = self.input_quantizer(input)
        weight = self.weight_quantizer(self.weight)
        input = input.permute(0, 2, 3, 1)
        input = F.layer_norm(input, self.normalized_shape, weight, self.bias, self.eps)
        input = input.permute(0, 3, 1, 2)
        return input
    

class QuantGlobalResponseNorm(GlobalResponseNorm):
    """Quantized Global Response Normalization layer with Tensor Quantizers."""
    
    def __init__(self, dim, eps=1e-6, channels_last=True):
        super().__init__()
        self.eps = eps
        if channels_last:
            self.spatial_dim = (1, 2)
            self.channel_dim = -1
            self.wb_shape = (1, 1, 1, -1)
        else:
            self.spatial_dim = (2, 3)
            self.channel_dim = 1
            self.wb_shape = (1, -1, 1, 1)


        self.weight = nn.Parameter(torch.zeros(dim))
        self.bias = nn.Parameter(torch.zeros(dim))
        
        # Setup quantizers
        self._setup()
    
    def _setup(self):
        self.input_quantizer = TensorQuantizer()
        self.weight_quantizer = TensorQuantizer()
        
        # self.bias_quantizer = TensorQuantizer()
    
    def forward(self, x):
        x = self.input_quantizer(x)
        quant_weight = self.weight_quantizer(self.weight)

        # quant_bias = self.bias_quantizer(self.bias)
        quant_bias = self.bias  

        x_g = x.norm(p=2, dim=self.spatial_dim, keepdim=True)
        x_n = x_g / (x_g.mean(dim=self.channel_dim, keepdim=True) + self.eps)
        return x + torch.addcmul(
            quant_bias.view(self.wb_shape), 
            quant_weight.view(self.wb_shape), 
            x * x_n
        )
    

mtq.register(original_cls=LayerNorm, quantized_cls=QuantLayerNorm)
mtq.register(original_cls=LayerNorm2d, quantized_cls=QuantLayerNorm2d)
mtq.register(original_cls=GlobalResponseNorm, quantized_cls=QuantGlobalResponseNorm)

I am using the following config :

{'quant_cfg': {'*weight_quantizer': {'num_bits': 8, 'axis': 0},
  '*input_quantizer': {'num_bits': 8, 'axis': None},
  '*lm_head*': {'enable': False},
  '*block_sparse_moe.gate*': {'enable': False},
  '*router*': {'enable': False},
  '*output_layer*': {'enable': False},
  'output.*': {'enable': False},
  'nn.BatchNorm1d': {'*': {'enable': False}},
  'nn.BatchNorm2d': {'*': {'enable': False}},
  'nn.BatchNorm3d': {'*': {'enable': False}},
  'nn.LeakyReLU': {'*': {'enable': False}},
  'default': {'enable': False},
  '*output_quantizer': {'num_bits': 8, 'axis': None},
  'LayerNorm2d': {'*': {'enable': True}},
  'LayerNorm': {'*': {'enable': True}},
  'GlobalResponseNorm': {'*': {'enable': True}}},
 'algorithm': 'max'}

I am going by the documentation but it is not clear to me if I am doing something wrong. Help is much appreciated.

@lix19937
Copy link

lix19937 commented Oct 5, 2024

How about your ptq predict performs ?

@Rajjeshwar
Copy link

I tried this with ViTb and convnext, I setup a convnext pipeline in this notebook https://drive.google.com/file/d/1LTfJsAcTgJ3Rb8BXiAuC66OD-_9zEWSy/view?usp=drive_link

This is actually using PTQ. Int8 does not seem to give any boosts in performance over fp16 and in some cases causes a slight slowdown.

@Raj-vivid
Copy link
Author

I tried a few more things, using onnx opset 17 to convert to onnx allowed for using a fused node for Layernorm but even with qdq before conv operation, before GELU activation and before Layernorm in my graph I still get much slower speed than fp16.

Before: Image

After: Image

@Raj-vivid
Copy link
Author

Hello, I appreciate the help you do for the community to answer all the issues on the thread, could you please have a look at this ?

No valid tactics for finetuned_convnext.convnext.stages.0.blocks.0.mlp.fc1.weight + [/finetuned_convnext/convnext/stages/stages.0/blocks/blocks.0/mlp/fc1/weight_quantizer/QuantizeLinear](https://vscode-remote+ssh-002dremote-002bcompute-005fnode.vscode-resource.vscode-cdn.net/finetuned_convnext/convnext/stages/stages.0/blocks/blocks.0/mlp/fc1/weight_quantizer/QuantizeLinear) + [/finetuned_convnext/convnext/stages/stages.0/blocks/blocks.0/mlp/fc1/Conv](https://vscode-remote+ssh-002dremote-002bcompute-005fnode.vscode-resource.vscode-cdn.net/finetuned_convnext/convnext/stages/stages.0/blocks/blocks.0/mlp/fc1/Conv) + PWN(PWN(PWN(PWN(PWN(PWN(PWN(/finetuned_convnext/convnext/stages/stages.0/blocks/blocks.0/mlp/act/Mul,

I keep getting this when using mtq.INT8_DEFAULT_CFG. Does it mean tensorRT does not support fusing layer norm with conv and gelu activation? Still haven't been able to find the slowdown.

@kevinch-nv kevinch-nv added triaged Issue has been triaged by maintainers Module:Quantization Issues related to Quantization waiting for feedback Requires more information from user to make progress on the issue. labels Feb 11, 2025
@kevinch-nv kevinch-nv self-assigned this Feb 11, 2025
@kevinch-nv
Copy link
Collaborator

Are you still facing this problem? For faster triage are you able to provide your fp16 and int8 models?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Module:Quantization Issues related to Quantization triaged Issue has been triaged by maintainers waiting for feedback Requires more information from user to make progress on the issue.
Projects
None yet
Development

No branches or pull requests

4 participants