huggingface
diff --git a/‎docs/source/en/quantization/contribute.md‎
Lines changed: 11 additions & 7 deletions b/‎docs/source/en/quantization/contribute.md‎
Lines changed: 11 additions & 7 deletions
diff --git a/‎src/transformers/integrations/bitsandbytes.py‎
Lines changed: 1 addition & 1 deletion b/‎src/transformers/integrations/bitsandbytes.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎src/transformers/quantizers/base.py‎
Lines changed: 7 additions & 61 deletions b/‎src/transformers/quantizers/base.py‎
Lines changed: 7 additions & 61 deletions
diff --git a/‎src/transformers/quantizers/quantizer_aqlm.py‎
Lines changed: 1 addition & 5 deletions b/‎src/transformers/quantizers/quantizer_aqlm.py‎
Lines changed: 1 addition & 5 deletions
diff --git a/‎src/transformers/quantizers/quantizer_auto_round.py‎
Lines changed: 0 additions & 1 deletion b/‎src/transformers/quantizers/quantizer_auto_round.py‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎src/transformers/quantizers/quantizer_awq.py‎
Lines changed: 0 additions & 2 deletions b/‎src/transformers/quantizers/quantizer_awq.py‎
Lines changed: 0 additions & 2 deletions
diff --git a/‎src/transformers/quantizers/quantizer_bitnet.py‎
Lines changed: 2 additions & 6 deletions b/‎src/transformers/quantizers/quantizer_bitnet.py‎
Lines changed: 2 additions & 6 deletions
@@ -46,26 +46,30 @@ Some quantization methods may require "pre-quantizing" the model through data ca
 
 ## Create new HFQuantizer class
 
+0. The best starting point would be to have a look at another quantization method such as Finegrained Fp8. You will have to update or create three files in total: the [config file](https://github.com/huggingface/transformers/blob/main/src/transformers/utils/quantization_config.py), the [integration file](https://github.com/huggingface/transformers/blob/main/src/transformers/integrations/finegrained_fp8.py) and the [quantizer file](https://github.com/huggingface/transformers/blob/main/src/transformers/quantizers/quantizer_finegrained_fp8.py).
+
 1. Create a new quantization config class inside [src/transformers/utils/quantization_config.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/quantization_config.py). Add the new quantization config to the [_import_structure](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py#L1088) inside Transformers' [src/transformers/__init__.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py) file.
 
 2. Create a new file inside [src/transformers/quantizers/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers) named `quantizer_your_method.py`, and make it inherit from [`~quantizers.HfQuantizer]. Make sure to add the new quantizer and quantization config in the quantization auto-mapping in [src/transformers/quantizers/auto.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/auto.py).
 
-3. Define the following class attributes and property methods for your quantization method.
+3. Define the following class attributes and property methods for your quantization method:
 
     - `requires_calibration`: Whether the quantization method requires a data calibration process. If set to `True`, you can only support inference (with quantized weights) and not inference and quantization.
-    - `required_packages`: A list of strings of the required packages to use the quantized weights. You might need to define some new utility methods such as `is_auto_awq_available` in [transformers/src/utils/import_utils.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/import_utils.py).
-    - `requires_parameters_quantization`: Only required if your quantization method requires extra attention to the underlying [nn.Parameter](https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html) object. For example, bitsandbytes uses [`~bitsandbytes.nn.Params4bit`] and [`~bitsandbytes.nn.Int8Params`], which requires some extra attention when quantizing the model. Most of the recent quantization method packs int2 and int4 weights inside [torch.uint8](https://pytorch.org/docs/stable/tensors.html) weights, so this flag should not be really required (set to `False` by default).
     - `is_serializable`: A property method to determine whether the method is serializable or not.
     - `is_trainable`:  A property method to determine whether you can fine-tune models on top of the quantization method (with or without PEFT approaches).
 
 4. Write the `validate_environment` and `update_dtype` methods. These methods are called before creating the quantized model to ensure users use the right configuration. Refer to other quantizers for an example of it is implemented.
 
 5. Write the `_process_model_before_weight_loading` method. In Transformers, the quantized models are initialized first on the `"meta"` device before loading the weights. This means the `_process_model_before_weight_loading` method takes care of manipulating the model skeleton to replace some modules ([nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)) with the target modules (quantization modules).
 
-    You can define module replacement logic or any other utility method by creating a new file in [transformers/src/integrations/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/integrations) and exposing the relevant methods in that folder's `__init__.py` file. The best starting point would be to have a look at another quantization method such as [quantizer_awq.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/quantizer_awq.py).
+You can define module replacement logic or any other utility method by creating a new file in [transformers/src/integrations/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/integrations) and exposing the relevant methods in that folder's `__init__.py` file. 
+
+6. Add the `get_quantize_ops` method to the quantizer class if the quantization supports quantizing on the fly. In transformers, we materialize each tensor and apply a sequence of different operations on it. In our case, the quantization operation happens at the end. You need to create a `XXXQuantize`,  a subclass of `ConversionOps`, and add a `convert` method. In the `convert` method, you need to quantize the weights and return a dictionary of quantized params.
+
+7. Add the `get_weight_conversions` method to the quantizer class if the quantization supports loading pre-quantized weights. In transformers, we can collect multiple tensors and apply operations on them. This is particularly useful when we have tensors in the checkpoint that require to be regrouped to re-create the quantized tensors.
 
-6. Write the `_process_model_after_weight_loading` method. This method enables implementing additional features that require manipulating the model after loading the weights.
+8. Write the `_process_model_after_weight_loading` method if needed. This method enables implementing additional features that require manipulating the model after loading the weights.
 
-7. Document everything! Make sure your quantization method is documented by adding a new file under `docs/source/en/quantization`.
+9. Document everything! Make sure your quantization method is documented by adding a new file under `docs/source/en/quantization`.
 
-8. You should add tests by adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out existing quantization methods to see how it is implemented.
+10. You should add tests by adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out existing quantization methods to see how it is implemented.
@@ -44,7 +44,7 @@ def convert(
         we need to store some parameters to create the quantized weight. For example, bnb requires 6 values that are stored in the checkpoint to recover the quantized weight. So we store them in a dict that it stored in hf_quantizer for now as we can't save it in the op since we create an op per tensor.
         """
         value = list(input_dict.values())[0]
-        value = value[0] if isinstance(value, list) else value
+        value = value[0]
 
         # update param name to get the weights instead of the quantized stats
         module, _ = get_module_from_name(model, full_layer_name)
 
@@ -75,26 +75,14 @@ class HfQuantizer(ABC):
     Attributes
         quantization_config (`transformers.utils.quantization_config.QuantizationConfigMixin`):
             The quantization config that defines the quantization parameters of your model that you want to quantize.
-        modules_to_not_convert (`list[str]`, *optional*):
-            The list of module names to not convert when quantizing the model.
-        required_packages (`list[str]`, *optional*):
-            The list of required pip packages to install prior to using the quantizer
         requires_calibration (`bool`):
             Whether the quantization method requires to calibrate the model before using it.
-        requires_parameters_quantization (`bool`):
-            Whether the quantization method requires to create a new Parameter. For example, for bitsandbytes, it is
-            required to create a new xxxParameter in order to properly quantize the model.
     """
 
     requires_calibration = False
-    required_packages = None
-    requires_parameters_quantization = False
 
     def __init__(self, quantization_config: QuantizationConfigMixin, **kwargs):
         self.quantization_config = quantization_config
-
-        # -- Handle extra kwargs below --
-        self.modules_to_not_convert = kwargs.pop("modules_to_not_convert", [])
         self.pre_quantized = kwargs.pop("pre_quantized", True)
 
         if not self.pre_quantized and self.requires_calibration:
@@ -157,53 +145,16 @@ def param_element_size(self, model: "PreTrainedModel", param_name: str, param: "
                 return mapping[custom_dtype]
         return param.element_size()
 
-    def update_missing_keys(self, model, missing_keys: list[str], prefix: str) -> list[str]:
-        """
-        Override this method if you want to adjust the `missing_keys`.
-
-        Args:
-            missing_keys (`list[str]`, *optional*):
-                The list of missing keys in the checkpoint compared to the state dict of the model
-        """
-        return missing_keys
-
-    def update_expected_keys(self, model, expected_keys: list[str], loaded_keys: list[str]) -> list[str]:
-        """
-        Override this method if you want to adjust the `update_expected_keys`.
-
-        Args:
-            expected_keys (`list[str]`, *optional*):
-                The list of the expected keys in the initialized model.
-            loaded_keys (`list[str]`, *optional*):
-                The list of the loaded keys in the checkpoint.
-        """
-        return expected_keys
-
-    def update_unexpected_keys(self, model, unexpected_keys: list[str]) -> list[str]:
-        return unexpected_keys
-
     def adjust_max_memory(self, max_memory: dict[str, int | str]) -> dict[str, int | str]:
         """adjust max_memory argument for infer_auto_device_map() if extra memory is needed for quantization"""
         return max_memory
 
     def param_needs_quantization(self, model: "PreTrainedModel", param_name: str, **kwargs) -> bool:
         """
-        Check whether a given param needs quantization as defined by `create_quantized_param`.
+        Check whether a given param needs to be quantized.
         """
         return False
 
-    def create_quantized_param(self, *args, **kwargs):
-        """
-        Take needed components from state_dict (those from which `param_needs_quantization` is True) and create
-        quantized param.
-        It usually also load the new param directly in the `model`.
-        Note: only applicable if requires_parameters_quantization == True.
-        """
-        if not self.requires_parameters_quantization:
-            raise AttributeError(
-                f"`.create_quantized_param()` method is not supported by quantizer class {self.__class__.__name__}."
-            )
-
     def validate_environment(self, *args, **kwargs):
         """
         This method is used to potentially check for potential conflicts with arguments that are
@@ -263,6 +214,11 @@ def postprocess_model(self, model: "PreTrainedModel", **kwargs):
             kwargs (`dict`, *optional*):
                 The keyword arguments that are passed along `_process_model_after_weight_loading`.
         """
+        model.config.quantization_config = self.quantization_config
+
+        if self.pre_quantized and getattr(self.quantization_config, "dequantize", False):
+            self.remove_quantization_config(model)
+
         return self._process_model_after_weight_loading(model, **kwargs)
 
     def remove_quantization_config(self, model):
@@ -285,13 +241,7 @@ def dequantize(self, model):
         Note not all quantization schemes support this.
         """
         model = self._dequantize(model)
-
-        # Delete quantizer and quantization config
-        del model.hf_quantizer
-        del model.config.quantization_config
-        del model.config._pre_quantization_dtype
-        del model.quantization_method
-        model.is_quantized = False
+        self.remove_quantization_config(model)
 
         return model
 
@@ -353,10 +303,6 @@ def get_state_dict_and_metadata(self, model, safe_serialization=False):
         """Get state dict and metadata. Useful when we need to modify a bit the state dict due to quantization"""
         return None, {}
 
-    def update_state_dict_with_metadata(self, state_dict, metadata):
-        """Update state dict with metadata. Default behaviour returns state_dict"""
-        return state_dict
-
     @abstractmethod
     def is_serializable(self, safe_serialization=None): ...
 
 
@@ -39,12 +39,9 @@ class AqlmHfQuantizer(HfQuantizer):
     """
 
     requires_calibration = True
-    required_packages = ["aqlm"]
-    optimum_quantizer = None
 
     def __init__(self, quantization_config: QuantizationConfigMixin, **kwargs):
         super().__init__(quantization_config, **kwargs)
-        self.quantization_config = quantization_config
 
     def validate_environment(self, *args, **kwargs):
         if not is_accelerate_available():
@@ -77,7 +74,6 @@ def _process_model_before_weight_loading(
             quantization_config=self.quantization_config,
             linear_weights_not_to_quantize=self.quantization_config.linear_weights_not_to_quantize,
         )
-        model.config.quantization_config = self.quantization_config
 
     @property
     def is_trainable(self) -> bool:
@@ -90,5 +86,5 @@ def is_trainable(self) -> bool:
             )
             return False
 
-    def is_serializable(self, safe_serialization=None):
+    def is_serializable(self, **kwargs):
         return True
@@ -36,7 +36,6 @@ class AutoRoundQuantizer(HfQuantizer):
 
     # AutoRound requires data calibration - we support only inference
     requires_calibration = True
-    required_packages = ["auto_round"]
 
     def __init__(self, quantization_config: QuantizationConfigMixin, **kwargs):
         super().__init__(quantization_config, **kwargs)
 
@@ -40,8 +40,6 @@ class AwqQuantizer(HfQuantizer):
     # AWQ requires data calibration - we support only inference
     requires_calibration = True
 
-    required_packages = ["awq", "accelerate"]
-
     def __init__(self, quantization_config, **kwargs):
         super().__init__(quantization_config, **kwargs)
 
 
@@ -37,14 +37,10 @@ class BitNetHfQuantizer(HfQuantizer):
     Check out the paper introducing this method: https://huggingface.co/papers/2402.17764
     """
 
-    requires_parameters_quantization = False
     requires_calibration = True
 
-    required_packages = ["accelerate"]
-
     def __init__(self, quantization_config, **kwargs):
         super().__init__(quantization_config, **kwargs)
-        self.quantization_config = quantization_config
 
     def validate_environment(self, *args, **kwargs):
         if not is_accelerate_available():
@@ -62,8 +58,8 @@ def validate_environment(self, *args, **kwargs):
                 "You have loaded a BitNet model on CPU and have a CUDA device available, make sure to set "
                 "your model on a GPU device in order to run your model."
             )
-        elif device_map is not None:
-            if isinstance(device_map, dict) and ("cpu" in device_map.values() or "disk" in device_map.values()):
+        elif isinstance(device_map, dict):
+            if len(device_map) > 1 and "cpu" in device_map.values() or "disk" in device_map.values():
                 raise ValueError(
                     "You are attempting to load a BitNet model with a device_map that contains a CPU or disk device."
                     "This is not supported. Please remove the CPU or disk device from the device_map."