properly handle non-standard bias quantization in FuseQuantizedOps #4059

zjgarvey · 2025-02-27T18:53:29Z

IR as seen in iree-org/iree#19416 generates numerical accuracy errors, since the bias is not quantized with a scale equal to input_scale*weight_scale, which is tacitly assumed to be the case in FuseQuantizedOps. We are re-quantizing the bias with this standard product scale, which does not align with the framework implementation.

My best idea of how to handle this is to factor out the adding of bias if the bias is already quantized with a non-standard scheme. This will default to f32 addition, but we can always add support for mixed-scale quantized add operations.

Here is the IR from the linked issue for reference:

module {
  func.func @main_graph(%arg0: !torch.vtensor<[1,3,224,224],f32>, %arg1: !torch.vtensor<[1,24,112,112],f32>) -> !torch.vtensor<[1,24,112,112],f32> attributes {torch.onnx_meta.ir_version = 8 : si64, torch.onnx_meta.opset_version = 21 : si64, torch.onnx_meta.opset_versions = {ai.onnx.contrib = 1 : si64, ai.onnx.ml = 4 : si64, ai.onnx.preview.training = 1 : si64, ai.onnx.training = 1 : si64, com.microsoft = 1 : si64, com.microsoft.experimental = 1 : si64, com.microsoft.nchwc = 1 : si64, org.pytorch.aten = 1 : si64}, torch.onnx_meta.producer_name = "vai_q_onnx", torch.onnx_meta.producer_version = "1.17.0+43059a7"} {
    %12 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<0> : tensor<si8>} : () -> !torch.vtensor<[],si8> 
    %13 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<5.000000e-01> : tensor<f32>} : () -> !torch.vtensor<[],f32> 
    %14 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<1.000000e+00> : tensor<f32>} : () -> !torch.vtensor<[],f32> 
    %15 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<0> : tensor<si8>} : () -> !torch.vtensor<[],si8> 
    %16 = torch.operator "onnx.Constant"() {torch.onnx.value = dense_resource<_onnx__Conv_1060_quantized> : tensor<24x1x3x3xsi8>} : () -> !torch.vtensor<[24,1,3,3],si8> 
    %17 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<2.500000e-01> : tensor<f32>} : () -> !torch.vtensor<[],f32> 
    %18 = torch.operator "onnx.Constant"() {torch.onnx.value = dense<0> : tensor<si8>} : () -> !torch.vtensor<[],si8> 
    %19 = torch.operator "onnx.Constant"() {torch.onnx.value = dense_resource<_onnx__Conv_1061_quantized> : tensor<24xsi8>} : () -> !torch.vtensor<[24],si8> 
    %24 = torch.operator "onnx.DequantizeLinear"(%16, %14, %15) : (!torch.vtensor<[24,1,3,3],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[24,1,3,3],f32> 
    %25 = torch.operator "onnx.DequantizeLinear"(%19, %17, %18) : (!torch.vtensor<[24],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[24],f32> 
    %35 = torch.operator "onnx.QuantizeLinear"(%arg1, %13, %12) : (!torch.vtensor<[1,24,112,112],f32>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,24,112,112],si8> 
    %36 = torch.operator "onnx.DequantizeLinear"(%35, %13, %12) : (!torch.vtensor<[1,24,112,112],si8>, !torch.vtensor<[],f32>, !torch.vtensor<[],si8>) -> !torch.vtensor<[1,24,112,112],f32> 
    %37 = torch.operator "onnx.Conv"(%36, %24, %25) {torch.onnx.auto_pad = "NOTSET", torch.onnx.dilations = [1 : si64, 1 : si64], torch.onnx.group = 24 : si64, torch.onnx.kernel_shape = [3 : si64, 3 : si64], torch.onnx.pads = [1 : si64, 1 : si64, 1 : si64, 1 : si64], torch.onnx.strides = [1 : si64, 1 : si64]} : (!torch.vtensor<[1,24,112,112],f32>, !torch.vtensor<[24,1,3,3],f32>, !torch.vtensor<[24],f32>) -> !torch.vtensor<[1,24,112,112],f32> 
    return %37 : !torch.vtensor<[1,24,112,112],f32>
  }
}

{-#
  dialect_resources: {
    builtin: {
      _onnx__Conv_1060_quantized: "0x0800000000000000FF00000000000000000000000000000000000000000000000000000000000000FBE208EAA4F91B7A0100000000FE0000000000010000010000FE0000320700CEF703FD0200000000FF0000020003F9FDF529FCFEFB0200000001FF0000000000000000000000000000000000020000000000000000000000010000000000010000000000000000000000000000000000000000000000000000FC0100000300000000010000000000000000030000000000000000FF00000000FF0001FE000200000000000000FF00000000000000000000000000",
      _onnx__Conv_1061_quantized: "0x0800000012044E020B59F50B030B0B0F020114FBFE0800FE040B1014"
    }
  }
#-}

The text was updated successfully, but these errors were encountered:

zjgarvey mentioned this issue Feb 27, 2025

[numeric] Numeric error for Conv operator with quantize/dequantize iree-org/iree#19416

Closed

zjgarvey marked this as a duplicate of iree-org/iree#19416 Feb 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

properly handle non-standard bias quantization in FuseQuantizedOps #4059

properly handle non-standard bias quantization in FuseQuantizedOps #4059

zjgarvey commented Feb 27, 2025

properly handle non-standard bias quantization in FuseQuantizedOps #4059

properly handle non-standard bias quantization in FuseQuantizedOps #4059

Comments

zjgarvey commented Feb 27, 2025