ZeroPointDomain as an arguments #1264

airMeng · 2024-11-12T05:07:14Z

Context

Current ZeroPointDomain is bound to the layout

Lines 607 to 615 in 2ba1a61

    
           zero_point_domain = ZeroPointDomain.FLOAT 
        
           # Sparse Marlin only supports symmetric quantization. 
        
           # NOTE: If we start having lots of layouts that require different configurations, 
        
           # we should consider moving this logic somewhere else. 
        
           if isinstance(layout, MarlinSparseLayout): 
        
               mapping_type = MappingType.SYMMETRIC 
        
               preserve_zero = True 
        
               zero_point_domain = ZeroPointDomain.INT

Ideally, we should allow the data types of zero points to be specified as arguments. There are two main benefits:

Memory Efficiency: Integer zero points can significantly reduce memory footprint with the designed kernel. For example, with a group_size=32, using int4 zero points instead of bf16 can save 0.375 bits per element.
Community Compatibility: Integer zero points have a well-established ecosystem of recipes and kernels. Making this option available in TorchAO would allow us to leverage these resources more effectively.

Proposals

Add an optional argument to let users specify the data types of zero points:

diff --git a/torchao/quantization/quant_api.py b/torchao/quantization/quant_api.py
index 476cc229..5cd35648 100644
--- a/torchao/quantization/quant_api.py
+++ b/torchao/quantization/quant_api.py
@@ -568,8 +568,8 @@ def int8_dynamic_activation_int4_weight(


 def int4_weight_only(
-    group_size=128, layout=TensorCoreTiledLayout(inner_k_tiles=8), use_hqq=False
-):
+    group_size=128, layout=TensorCoreTiledLayout(inner_k_tiles=8), use_hqq=False,
+    zero_point_dtype=torch.bfloat16, zero_point_domain=ZeroPointDomain.INT):
     """
     Applies uint4 weight-only asymmetric per-group quantization to linear layers, using
     "tensor_core_tiled" layout for speedup with tinygemm kernel
@@ -587,6 +587,8 @@ def int4_weight_only(
          size is more fine grained, choices are [256, 128, 64, 32]
         `layout`: layout type for quantized tensor, default is `TensorCoreTiledLayout(inner_k_tiles=8)`
         `use_hqq`: whether to use hqq or default quantization mode, default is False
+        `zero_point_dtype`: the dtype of zero point, default is torch.bfloat16
+        `zero_point_domain`: the domain of zero point, default is ZeroPointDomain.INT
     """

     def apply_int4_weight_only_quant(weight):
@@ -603,8 +605,8 @@ def int4_weight_only(
         quant_max = 15
         eps = 1e-6
         preserve_zero = False
-        zero_point_dtype = torch.bfloat16
-        zero_point_domain = ZeroPointDomain.FLOAT
+        zero_point_dtype = zero_point_dtype if zero_point_dtype else torch.bfloat16
+        zero_point_domain = zero_point_domain if zero_point_domain else ZeroPointDomain.INT

         # Sparse Marlin only supports symmetric quantization.
         # NOTE: If we start having lots of layouts that require different configurations,

Meanwhile we will overload _weight_int4pack_mm with zero points and scales as separate tensors.

An example usage

from torchao.quantization.quant_api import (
    quantize_,
    int8_dynamic_activation_int8_weight,
    int4_weight_only,
    int8_weight_only
)
quantize_(m, int4_weight_only(default_fp_zp=True/False))

The text was updated successfully, but these errors were encountered:

jerryzh168 · 2024-11-12T06:04:36Z

makes sense, I think we can just expose zero_point_domain as an argument, it now has 3 options:

ao/torchao/quantization/quant_primitives.py

Lines 74 to 76 in 2ba1a61

    
           INT = auto() 
        
           FLOAT = auto() 
        
           NONE = auto()

, I think there might be use cases that does not have zero_point as well

HDCharles · 2024-11-13T05:00:32Z

we can do that, but the packing format of int4_weight_only is not normal.

airMeng · 2024-11-13T06:01:06Z

we can do that, but the packing format of int4_weight_only is not normal.

Hi @HDCharles could you give more details about "normal"?

airMeng · 2024-11-15T05:43:08Z

@HDCharles The packing of scales and zero points into one tensor might be the limit of _weight_int4pack_mm. we plan to extend _weight_int4pack_mm to decouple scales and zero points

- func: _weight_int4pack_mm(Tensor self, Tensor mat2, int qGroupSize, Tensor qScaleAndZeros) -> Tensor

- func: _weight_int4pack_mm_with_scale_and_zeros(Tensor self, Tensor mat2, int qGroupSize, Tensor qScale， Tensor qZeros) ->

Does it make sense to you?

cc @jgong5

jerryzh168 · 2024-11-15T05:45:47Z

@HDCharles The packing of scales and zero points into one tensor might be the limit of _weight_int4pack_mm. we plan to extend _weight_int4pack_mm to decouple scales and zero points

this is easy to do with a different layout, like #1278, you can use a different op as well if packing format is different.

what backend are you planning to develop? xpu?

airMeng · 2024-11-15T07:35:16Z

what backend are you planning to develop? xpu?
Yes, but the decouple itself will help to leverage the outside recipes.

jerryzh168 · 2024-11-15T20:48:43Z

OK please let us know your plan and if #1278 is enough to address the concern here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ZeroPointDomain as an arguments #1264

ZeroPointDomain as an arguments #1264

airMeng commented Nov 12, 2024 •

edited

Loading

jerryzh168 commented Nov 12, 2024

HDCharles commented Nov 13, 2024

airMeng commented Nov 13, 2024

airMeng commented Nov 15, 2024

jerryzh168 commented Nov 15, 2024 •

edited

Loading

airMeng commented Nov 15, 2024

jerryzh168 commented Nov 15, 2024

ZeroPointDomain as an arguments #1264

ZeroPointDomain as an arguments #1264

Comments

airMeng commented Nov 12, 2024 • edited Loading

Context

Proposals

jerryzh168 commented Nov 12, 2024

HDCharles commented Nov 13, 2024

airMeng commented Nov 13, 2024

airMeng commented Nov 15, 2024

jerryzh168 commented Nov 15, 2024 • edited Loading

airMeng commented Nov 15, 2024

jerryzh168 commented Nov 15, 2024

airMeng commented Nov 12, 2024 •

edited

Loading

jerryzh168 commented Nov 15, 2024 •

edited

Loading