fp8 awq examples (#2145)

HDCharles · web-flow · commit 5f6c8dbca1ce · 2025-12-18T13:41:43.000-05:00
SUMMARY:
Added examples for fp8 awq which now work after the AWQ generalization


TEST PLAN:
python $REPOS/llm-compressor/examples/awq/fp8_dynamic_llama_example.py
2&gt;&amp;1 | tee fp8_dynamic.log
python $REPOS/llm-compressor/examples/awq/fp8_block_llama_example.py
2&gt;&amp;1 | tee fp8_block.log

&lt;details&gt;
  &lt;summary&gt;fp8_dynamic.log&lt;/summary&gt;
  

/home/HDCharles/rhdev/lib/python3.11/site-packages/transformers/utils/hub.py:110:
FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be
removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
`torch_dtype` is deprecated! Use `dtype` instead!

Loading checkpoint shards:   0%|          | 0/4 [00:00&lt;?, ?it/s]
Loading checkpoint shards:  25%|██▌       | 1/4 [00:00&lt;00:00,  7.60it/s]
Loading checkpoint shards:  50%|█████     | 2/4 [00:00&lt;00:00,  6.70it/s]
Loading checkpoint shards:  75%|███████▌  | 3/4 [00:00&lt;00:00,  6.82it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:00&lt;00:00,  8.95it/s]
2025-12-17T20:56:18.271169+0000 | reset | INFO - Compression lifecycle
reset
2025-12-17T20:56:18.271896+0000 | from_modifiers | INFO - Creating
recipe from modifiers
2025-12-17T20:56:18.292591+0000 | initialize | INFO - Compression
lifecycle initialized for 1 modifiers
2025-12-17T20:56:18.292874+0000 | IndependentPipeline | INFO - Inferred
`DataFreePipeline` for `QuantizationModifier`

Updating global scales:   0%|          | 0/224 [00:00&lt;?, ?it/s]
Updating global scales: 100%|██████████| 224/224 [00:00&lt;00:00,
648394.82it/s]

Fusing global scales: 0it [00:00, ?it/s]
Fusing global scales: 647it [00:00, 511346.28it/s]

Calibrating weights:   0%|          | 0/224 [00:00&lt;?, ?it/s]
Calibrating weights:  40%|███▉      | 89/224 [00:00&lt;00:00, 888.99it/s]
Calibrating weights: 100%|██████████| 224/224 [00:00&lt;00:00, 1596.33it/s]
2025-12-17T20:56:53.594142+0000 | finalize | INFO - Compression
lifecycle finalized for 1 modifiers
2025-12-17T20:56:57.580914+0000 | post_process | WARNING - Optimized
model is not saved. To save, please provide`output_dir` as input arg.Ex.
`oneshot(..., output_dir=...)`
The attention mask and the pad token id were not set. As a consequence,
you may observe unexpected behavior. Please pass your input's
`attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because
pad token is same as eos token. As a consequence, you may observe
unexpected behavior. Please pass your input's `attention_mask` to obtain
reliable results.



========== SAMPLE GENERATION ==============
&lt;|begin_of_text|&gt;Hello my name is Sarah and I am a 30-year-old woman who
has been diagnosed with multiple sclerosis (MS). I am here to share my
story and to help raise awareness about this chronic and often
debilitating disease.
I was diagnosed with MS in 2010, when I was 25 years old. At the time, I
was working as a teacher and living a normal life. But suddenly, I
started experiencing strange symptoms such as numbness in my hands and
feet, blurred vision, and fatigue. I went
==========================================


2025-12-17T20:57:24.962901+0000 | get_model_compressor | INFO -
skip_sparsity_compression_stats set to True. Skipping sparsity
compression statistic calculations. No sparsity compressor will be
applied.

Compressing model: 0it [00:00, ?it/s]
Compressing model: 1it [00:00,  3.48it/s]
Compressing model: 5it [00:00, 14.97it/s]
Compressing model: 8it [00:00, 16.31it/s]
Compressing model: 12it [00:00, 22.42it/s]
Compressing model: 15it [00:00, 16.64it/s]
Compressing model: 18it [00:01, 15.70it/s]
Compressing model: 20it [00:01, 12.23it/s]
Compressing model: 25it [00:01, 18.27it/s]
Compressing model: 28it [00:01, 17.00it/s]
Compressing model: 33it [00:01, 22.56it/s]
Compressing model: 36it [00:02, 21.53it/s]
Compressing model: 41it [00:02, 24.09it/s]
Compressing model: 45it [00:02, 27.34it/s]
Compressing model: 49it [00:02, 14.52it/s]
Compressing model: 54it [00:02, 18.97it/s]
Compressing model: 57it [00:03, 18.91it/s]
Compressing model: 61it [00:03, 22.45it/s]
Compressing model: 65it [00:03, 22.69it/s]
Compressing model: 68it [00:03, 18.42it/s]
Compressing model: 71it [00:04, 13.96it/s]
Compressing model: 76it [00:04, 17.50it/s]
Compressing model: 81it [00:04, 22.10it/s]
Compressing model: 84it [00:04, 19.62it/s]
Compressing model: 89it [00:04, 24.35it/s]
Compressing model: 92it [00:04, 22.88it/s]
Compressing model: 96it [00:04, 23.23it/s]
Compressing model: 99it [00:05, 14.21it/s]
Compressing model: 103it [00:05, 17.90it/s]
Compressing model: 106it [00:05, 17.96it/s]
Compressing model: 110it [00:05, 21.75it/s]
Compressing model: 113it [00:06, 15.18it/s]
Compressing model: 116it [00:06, 12.39it/s]
Compressing model: 118it [00:06, 12.76it/s]
Compressing model: 121it [00:06, 15.29it/s]
Compressing model: 125it [00:06, 17.59it/s]
Compressing model: 129it [00:07, 21.70it/s]
Compressing model: 132it [00:07, 20.76it/s]
Compressing model: 137it [00:07, 25.70it/s]
Compressing model: 140it [00:07, 21.44it/s]
Compressing model: 143it [00:07, 14.45it/s]
Compressing model: 146it [00:08, 15.29it/s]
Compressing model: 150it [00:08, 19.35it/s]
Compressing model: 153it [00:08, 19.25it/s]
Compressing model: 158it [00:08, 24.56it/s]
Compressing model: 161it [00:08, 16.54it/s]
Compressing model: 166it [00:09, 17.15it/s]
Compressing model: 169it [00:09, 17.37it/s]
Compressing model: 174it [00:09, 20.55it/s]
Compressing model: 179it [00:09, 25.06it/s]
Compressing model: 182it [00:09, 21.49it/s]
Compressing model: 187it [00:09, 26.07it/s]
Compressing model: 191it [00:10, 25.59it/s]
Compressing model: 194it [00:10, 18.97it/s]
Compressing model: 197it [00:10, 14.35it/s]
Compressing model: 202it [00:10, 17.80it/s]
Compressing model: 206it [00:10, 21.30it/s]
Compressing model: 209it [00:11, 20.75it/s]
Compressing model: 212it [00:11, 12.23it/s]
Compressing model: 215it [00:11, 14.03it/s]
Compressing model: 218it [00:11, 14.99it/s]
Compressing model: 222it [00:12, 19.03it/s]
Compressing model: 224it [00:12, 18.36it/s]
&lt;/details&gt;
&lt;details&gt;
  &lt;summary&gt;fp8_block.log&lt;/summary&gt;


/home/HDCharles/rhdev/lib/python3.11/site-packages/transformers/utils/hub.py:110:
FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be
removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
`torch_dtype` is deprecated! Use `dtype` instead!

Loading checkpoint shards:   0%|          | 0/4 [00:00&lt;?, ?it/s]
Loading checkpoint shards: 100%|██████████| 4/4 [00:00&lt;00:00,
136.99it/s]
2025-12-17T20:57:53.946116+0000 | reset | INFO - Compression lifecycle
reset
2025-12-17T20:57:53.946848+0000 | from_modifiers | INFO - Creating
recipe from modifiers
2025-12-17T20:57:53.966319+0000 | initialize | INFO - Compression
lifecycle initialized for 1 modifiers
2025-12-17T20:57:53.966658+0000 | IndependentPipeline | INFO - Inferred
`DataFreePipeline` for `QuantizationModifier`

Updating global scales:   0%|          | 0/224 [00:00&lt;?, ?it/s]
Updating global scales: 100%|██████████| 224/224 [00:00&lt;00:00,
637397.62it/s]

Fusing global scales: 0it [00:00, ?it/s]
Fusing global scales: 647it [00:00, 486415.97it/s]

Calibrating weights:   0%|          | 0/224 [00:00&lt;?, ?it/s]
Calibrating weights:   0%|          | 1/224 [00:00&lt;00:33,  6.66it/s]
Calibrating weights: 100%|██████████| 224/224 [00:00&lt;00:00, 943.96it/s]
2025-12-17T20:58:00.043737+0000 | finalize | INFO - Compression
lifecycle finalized for 1 modifiers
2025-12-17T20:58:03.951940+0000 | post_process | WARNING - Optimized
model is not saved. To save, please provide`output_dir` as input arg.Ex.
`oneshot(..., output_dir=...)`
The attention mask and the pad token id were not set. As a consequence,
you may observe unexpected behavior. Please pass your input's
`attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because
pad token is same as eos token. As a consequence, you may observe
unexpected behavior. Please pass your input's `attention_mask` to obtain
reliable results.



========== SAMPLE GENERATION ==============
&lt;|begin_of_text|&gt;Hello my name is Kaitlyn and I am a 24-year-old
freelance writer and editor. I have a passion for storytelling and a
knack for crafting compelling narratives. I have a degree in English
Literature and have been writing professionally for over 5 years. I have
experience writing articles, blog posts, and website content for a
variety of clients, including businesses, non-profits, and individuals.
I am also skilled in editing and proofreading, and have worked with
clients to refine their writing and ensure it is error
==========================================


2025-12-17T20:58:34.036482+0000 | get_model_compressor | INFO -
skip_sparsity_compression_stats set to True. Skipping sparsity
compression statistic calculations. No sparsity compressor will be
applied.

Compressing model: 0it [00:00, ?it/s]
Compressing model: 5it [00:00, 34.55it/s]
Compressing model: 9it [00:00, 12.71it/s]
Compressing model: 11it [00:00, 12.76it/s]
Compressing model: 13it [00:00, 13.16it/s]
Compressing model: 17it [00:01, 18.77it/s]
Compressing model: 20it [00:01, 18.86it/s]
Compressing model: 23it [00:01, 20.26it/s]
Compressing model: 27it [00:01, 15.43it/s]
Compressing model: 29it [00:01, 12.88it/s]
Compressing model: 34it [00:02, 17.10it/s]
Compressing model: 39it [00:02, 22.20it/s]
Compressing model: 42it [00:02, 19.60it/s]
Compressing model: 47it [00:02, 24.68it/s]
Compressing model: 50it [00:02, 23.06it/s]
Compressing model: 55it [00:03, 18.85it/s]
Compressing model: 58it [00:03, 16.46it/s]
Compressing model: 62it [00:03, 18.39it/s]
Compressing model: 67it [00:03, 23.19it/s]
Compressing model: 70it [00:03, 20.28it/s]
Compressing model: 75it [00:03, 25.18it/s]
Compressing model: 78it [00:04, 18.17it/s]
Compressing model: 81it [00:04, 19.71it/s]
Compressing model: 84it [00:04, 14.67it/s]
Compressing model: 89it [00:04, 19.78it/s]
Compressing model: 92it [00:04, 19.63it/s]
Compressing model: 97it [00:05, 22.49it/s]
Compressing model: 102it [00:05, 26.98it/s]
Compressing model: 106it [00:05, 17.97it/s]
Compressing model: 110it [00:05, 17.31it/s]
Compressing model: 113it [00:06, 17.63it/s]
Compressing model: 118it [00:06, 20.70it/s]
Compressing model: 122it [00:06, 24.05it/s]
Compressing model: 125it [00:06, 22.60it/s]
Compressing model: 128it [00:06, 13.66it/s]
Compressing model: 131it [00:07, 14.68it/s]
Compressing model: 133it [00:07, 14.59it/s]
Compressing model: 138it [00:07, 20.29it/s]
Compressing model: 141it [00:07, 19.93it/s]
Compressing model: 146it [00:07, 22.96it/s]
Compressing model: 150it [00:07, 26.31it/s]
Compressing model: 153it [00:07, 24.04it/s]
Compressing model: 156it [00:08, 17.59it/s]
Compressing model: 159it [00:08, 14.86it/s]
Compressing model: 161it [00:08, 14.72it/s]
Compressing model: 166it [00:08, 20.20it/s]
Compressing model: 169it [00:08, 19.64it/s]
Compressing model: 173it [00:09, 23.47it/s]
Compressing model: 176it [00:09, 17.13it/s]
Compressing model: 179it [00:09, 18.76it/s]
Compressing model: 182it [00:09, 14.24it/s]
Compressing model: 187it [00:09, 19.21it/s]
Compressing model: 190it [00:10, 19.04it/s]
Compressing model: 195it [00:10, 22.02it/s]
Compressing model: 200it [00:10, 26.51it/s]
Compressing model: 204it [00:10, 18.33it/s]
Compressing model: 207it [00:10, 19.70it/s]
Compressing model: 210it [00:11, 14.98it/s]
Compressing model: 215it [00:11, 19.78it/s]
Compressing model: 218it [00:11, 19.47it/s]
Compressing model: 222it [00:11, 23.07it/s]
Compressing model: 224it [00:11, 19.04it/s]

&lt;\details&gt;

Signed-off-by: HDCharles &lt;charlesdavidhernandez@gmail.com&gt;
diff --git a/examples/awq/fp8_block_llama_example.py b/examples/awq/fp8_block_llama_example.py
@@ -0,0 +1,81 @@
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+from llmcompressor import oneshot
+from llmcompressor.modifiers.awq import AWQModifier
+from llmcompressor.utils import dispatch_for_generation
+
+# Select model and load it.
+MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+
+model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
+
+# Select calibration dataset.
+DATASET_ID = "HuggingFaceH4/ultrachat_200k"
+DATASET_SPLIT = "train_sft"
+
+# Select number of samples. 256 samples is a good place to start.
+# Increasing the number of samples can improve accuracy.
+NUM_CALIBRATION_SAMPLES = 256
+MAX_SEQUENCE_LENGTH = 512
+
+# Load dataset and preprocess.
+ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
+ds = ds.shuffle(seed=42)
+
+
+def preprocess(example):
+    return {
+        "text": tokenizer.apply_chat_template(
+            example["messages"],
+            tokenize=False,
+        )
+    }
+
+
+ds = ds.map(preprocess)
+
+
+# Tokenize inputs.
+def tokenize(sample):
+    return tokenizer(
+        sample["text"],
+        padding=False,
+        max_length=MAX_SEQUENCE_LENGTH,
+        truncation=True,
+        add_special_tokens=False,
+    )
+
+
+# Configure the quantization algorithm to run.
+recipe = [
+    AWQModifier(
+        ignore=["lm_head"], scheme="FP8_BLOCK", targets=["Linear"], duo_scaling="both"
+    ),
+]
+
+# Apply algorithms.
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)
+
+# Confirm generations of the quantized model look sane.
+print("\n\n")
+print("========== SAMPLE GENERATION ==============")
+dispatch_for_generation(model)
+input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
+    model.device
+)
+output = model.generate(input_ids, max_new_tokens=100)
+print(tokenizer.decode(output[0]))
+print("==========================================\n\n")
+
+# Save to disk compressed.
+SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-awq-asym"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)
diff --git a/examples/awq/fp8_dynamic_llama_example.py b/examples/awq/fp8_dynamic_llama_example.py
@@ -0,0 +1,81 @@
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+from llmcompressor import oneshot
+from llmcompressor.modifiers.awq import AWQModifier
+from llmcompressor.utils import dispatch_for_generation
+
+# Select model and load it.
+MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
+
+model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
+
+# Select calibration dataset.
+DATASET_ID = "HuggingFaceH4/ultrachat_200k"
+DATASET_SPLIT = "train_sft"
+
+# Select number of samples. 256 samples is a good place to start.
+# Increasing the number of samples can improve accuracy.
+NUM_CALIBRATION_SAMPLES = 256
+MAX_SEQUENCE_LENGTH = 512
+
+# Load dataset and preprocess.
+ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
+ds = ds.shuffle(seed=42)
+
+
+def preprocess(example):
+    return {
+        "text": tokenizer.apply_chat_template(
+            example["messages"],
+            tokenize=False,
+        )
+    }
+
+
+ds = ds.map(preprocess)
+
+
+# Tokenize inputs.
+def tokenize(sample):
+    return tokenizer(
+        sample["text"],
+        padding=False,
+        max_length=MAX_SEQUENCE_LENGTH,
+        truncation=True,
+        add_special_tokens=False,
+    )
+
+
+# Configure the quantization algorithm to run.
+recipe = [
+    AWQModifier(
+        ignore=["lm_head"], scheme="FP8_DYNAMIC", targets=["Linear"], duo_scaling="both"
+    ),
+]
+
+# Apply algorithms.
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=recipe,
+    max_seq_length=MAX_SEQUENCE_LENGTH,
+    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
+)
+
+# Confirm generations of the quantized model look sane.
+print("\n\n")
+print("========== SAMPLE GENERATION ==============")
+dispatch_for_generation(model)
+input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
+    model.device
+)
+output = model.generate(input_ids, max_new_tokens=100)
+print(tokenizer.decode(output[0]))
+print("==========================================\n\n")
+
+# Save to disk compressed.
+SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-awq-asym"
+model.save_pretrained(SAVE_DIR, save_compressed=True)
+tokenizer.save_pretrained(SAVE_DIR)