Skip to content

Commit 5f6c8db

Browse files
authored
fp8 awq examples (#2145)
SUMMARY: Added examples for fp8 awq which now work after the AWQ generalization TEST PLAN: python $REPOS/llm-compressor/examples/awq/fp8_dynamic_llama_example.py 2>&1 | tee fp8_dynamic.log python $REPOS/llm-compressor/examples/awq/fp8_block_llama_example.py 2>&1 | tee fp8_block.log <details> <summary>fp8_dynamic.log</summary> /home/HDCharles/rhdev/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 25%|██▌ | 1/4 [00:00<00:00, 7.60it/s] Loading checkpoint shards: 50%|█████ | 2/4 [00:00<00:00, 6.70it/s] Loading checkpoint shards: 75%|███████▌ | 3/4 [00:00<00:00, 6.82it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 8.95it/s] 2025-12-17T20:56:18.271169+0000 | reset | INFO - Compression lifecycle reset 2025-12-17T20:56:18.271896+0000 | from_modifiers | INFO - Creating recipe from modifiers 2025-12-17T20:56:18.292591+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers 2025-12-17T20:56:18.292874+0000 | IndependentPipeline | INFO - Inferred `DataFreePipeline` for `QuantizationModifier` Updating global scales: 0%| | 0/224 [00:00<?, ?it/s] Updating global scales: 100%|██████████| 224/224 [00:00<00:00, 648394.82it/s] Fusing global scales: 0it [00:00, ?it/s] Fusing global scales: 647it [00:00, 511346.28it/s] Calibrating weights: 0%| | 0/224 [00:00<?, ?it/s] Calibrating weights: 40%|███▉ | 89/224 [00:00<00:00, 888.99it/s] Calibrating weights: 100%|██████████| 224/224 [00:00<00:00, 1596.33it/s] 2025-12-17T20:56:53.594142+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers 2025-12-17T20:56:57.580914+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)` The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. ========== SAMPLE GENERATION ============== <|begin_of_text|>Hello my name is Sarah and I am a 30-year-old woman who has been diagnosed with multiple sclerosis (MS). I am here to share my story and to help raise awareness about this chronic and often debilitating disease. I was diagnosed with MS in 2010, when I was 25 years old. At the time, I was working as a teacher and living a normal life. But suddenly, I started experiencing strange symptoms such as numbness in my hands and feet, blurred vision, and fatigue. I went ========================================== 2025-12-17T20:57:24.962901+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 0it [00:00, ?it/s] Compressing model: 1it [00:00, 3.48it/s] Compressing model: 5it [00:00, 14.97it/s] Compressing model: 8it [00:00, 16.31it/s] Compressing model: 12it [00:00, 22.42it/s] Compressing model: 15it [00:00, 16.64it/s] Compressing model: 18it [00:01, 15.70it/s] Compressing model: 20it [00:01, 12.23it/s] Compressing model: 25it [00:01, 18.27it/s] Compressing model: 28it [00:01, 17.00it/s] Compressing model: 33it [00:01, 22.56it/s] Compressing model: 36it [00:02, 21.53it/s] Compressing model: 41it [00:02, 24.09it/s] Compressing model: 45it [00:02, 27.34it/s] Compressing model: 49it [00:02, 14.52it/s] Compressing model: 54it [00:02, 18.97it/s] Compressing model: 57it [00:03, 18.91it/s] Compressing model: 61it [00:03, 22.45it/s] Compressing model: 65it [00:03, 22.69it/s] Compressing model: 68it [00:03, 18.42it/s] Compressing model: 71it [00:04, 13.96it/s] Compressing model: 76it [00:04, 17.50it/s] Compressing model: 81it [00:04, 22.10it/s] Compressing model: 84it [00:04, 19.62it/s] Compressing model: 89it [00:04, 24.35it/s] Compressing model: 92it [00:04, 22.88it/s] Compressing model: 96it [00:04, 23.23it/s] Compressing model: 99it [00:05, 14.21it/s] Compressing model: 103it [00:05, 17.90it/s] Compressing model: 106it [00:05, 17.96it/s] Compressing model: 110it [00:05, 21.75it/s] Compressing model: 113it [00:06, 15.18it/s] Compressing model: 116it [00:06, 12.39it/s] Compressing model: 118it [00:06, 12.76it/s] Compressing model: 121it [00:06, 15.29it/s] Compressing model: 125it [00:06, 17.59it/s] Compressing model: 129it [00:07, 21.70it/s] Compressing model: 132it [00:07, 20.76it/s] Compressing model: 137it [00:07, 25.70it/s] Compressing model: 140it [00:07, 21.44it/s] Compressing model: 143it [00:07, 14.45it/s] Compressing model: 146it [00:08, 15.29it/s] Compressing model: 150it [00:08, 19.35it/s] Compressing model: 153it [00:08, 19.25it/s] Compressing model: 158it [00:08, 24.56it/s] Compressing model: 161it [00:08, 16.54it/s] Compressing model: 166it [00:09, 17.15it/s] Compressing model: 169it [00:09, 17.37it/s] Compressing model: 174it [00:09, 20.55it/s] Compressing model: 179it [00:09, 25.06it/s] Compressing model: 182it [00:09, 21.49it/s] Compressing model: 187it [00:09, 26.07it/s] Compressing model: 191it [00:10, 25.59it/s] Compressing model: 194it [00:10, 18.97it/s] Compressing model: 197it [00:10, 14.35it/s] Compressing model: 202it [00:10, 17.80it/s] Compressing model: 206it [00:10, 21.30it/s] Compressing model: 209it [00:11, 20.75it/s] Compressing model: 212it [00:11, 12.23it/s] Compressing model: 215it [00:11, 14.03it/s] Compressing model: 218it [00:11, 14.99it/s] Compressing model: 222it [00:12, 19.03it/s] Compressing model: 224it [00:12, 18.36it/s] </details> <details> <summary>fp8_block.log</summary> /home/HDCharles/rhdev/lib/python3.11/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead. warnings.warn( `torch_dtype` is deprecated! Use `dtype` instead! Loading checkpoint shards: 0%| | 0/4 [00:00<?, ?it/s] Loading checkpoint shards: 100%|██████████| 4/4 [00:00<00:00, 136.99it/s] 2025-12-17T20:57:53.946116+0000 | reset | INFO - Compression lifecycle reset 2025-12-17T20:57:53.946848+0000 | from_modifiers | INFO - Creating recipe from modifiers 2025-12-17T20:57:53.966319+0000 | initialize | INFO - Compression lifecycle initialized for 1 modifiers 2025-12-17T20:57:53.966658+0000 | IndependentPipeline | INFO - Inferred `DataFreePipeline` for `QuantizationModifier` Updating global scales: 0%| | 0/224 [00:00<?, ?it/s] Updating global scales: 100%|██████████| 224/224 [00:00<00:00, 637397.62it/s] Fusing global scales: 0it [00:00, ?it/s] Fusing global scales: 647it [00:00, 486415.97it/s] Calibrating weights: 0%| | 0/224 [00:00<?, ?it/s] Calibrating weights: 0%| | 1/224 [00:00<00:33, 6.66it/s] Calibrating weights: 100%|██████████| 224/224 [00:00<00:00, 943.96it/s] 2025-12-17T20:58:00.043737+0000 | finalize | INFO - Compression lifecycle finalized for 1 modifiers 2025-12-17T20:58:03.951940+0000 | post_process | WARNING - Optimized model is not saved. To save, please provide`output_dir` as input arg.Ex. `oneshot(..., output_dir=...)` The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation. The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results. ========== SAMPLE GENERATION ============== <|begin_of_text|>Hello my name is Kaitlyn and I am a 24-year-old freelance writer and editor. I have a passion for storytelling and a knack for crafting compelling narratives. I have a degree in English Literature and have been writing professionally for over 5 years. I have experience writing articles, blog posts, and website content for a variety of clients, including businesses, non-profits, and individuals. I am also skilled in editing and proofreading, and have worked with clients to refine their writing and ensure it is error ========================================== 2025-12-17T20:58:34.036482+0000 | get_model_compressor | INFO - skip_sparsity_compression_stats set to True. Skipping sparsity compression statistic calculations. No sparsity compressor will be applied. Compressing model: 0it [00:00, ?it/s] Compressing model: 5it [00:00, 34.55it/s] Compressing model: 9it [00:00, 12.71it/s] Compressing model: 11it [00:00, 12.76it/s] Compressing model: 13it [00:00, 13.16it/s] Compressing model: 17it [00:01, 18.77it/s] Compressing model: 20it [00:01, 18.86it/s] Compressing model: 23it [00:01, 20.26it/s] Compressing model: 27it [00:01, 15.43it/s] Compressing model: 29it [00:01, 12.88it/s] Compressing model: 34it [00:02, 17.10it/s] Compressing model: 39it [00:02, 22.20it/s] Compressing model: 42it [00:02, 19.60it/s] Compressing model: 47it [00:02, 24.68it/s] Compressing model: 50it [00:02, 23.06it/s] Compressing model: 55it [00:03, 18.85it/s] Compressing model: 58it [00:03, 16.46it/s] Compressing model: 62it [00:03, 18.39it/s] Compressing model: 67it [00:03, 23.19it/s] Compressing model: 70it [00:03, 20.28it/s] Compressing model: 75it [00:03, 25.18it/s] Compressing model: 78it [00:04, 18.17it/s] Compressing model: 81it [00:04, 19.71it/s] Compressing model: 84it [00:04, 14.67it/s] Compressing model: 89it [00:04, 19.78it/s] Compressing model: 92it [00:04, 19.63it/s] Compressing model: 97it [00:05, 22.49it/s] Compressing model: 102it [00:05, 26.98it/s] Compressing model: 106it [00:05, 17.97it/s] Compressing model: 110it [00:05, 17.31it/s] Compressing model: 113it [00:06, 17.63it/s] Compressing model: 118it [00:06, 20.70it/s] Compressing model: 122it [00:06, 24.05it/s] Compressing model: 125it [00:06, 22.60it/s] Compressing model: 128it [00:06, 13.66it/s] Compressing model: 131it [00:07, 14.68it/s] Compressing model: 133it [00:07, 14.59it/s] Compressing model: 138it [00:07, 20.29it/s] Compressing model: 141it [00:07, 19.93it/s] Compressing model: 146it [00:07, 22.96it/s] Compressing model: 150it [00:07, 26.31it/s] Compressing model: 153it [00:07, 24.04it/s] Compressing model: 156it [00:08, 17.59it/s] Compressing model: 159it [00:08, 14.86it/s] Compressing model: 161it [00:08, 14.72it/s] Compressing model: 166it [00:08, 20.20it/s] Compressing model: 169it [00:08, 19.64it/s] Compressing model: 173it [00:09, 23.47it/s] Compressing model: 176it [00:09, 17.13it/s] Compressing model: 179it [00:09, 18.76it/s] Compressing model: 182it [00:09, 14.24it/s] Compressing model: 187it [00:09, 19.21it/s] Compressing model: 190it [00:10, 19.04it/s] Compressing model: 195it [00:10, 22.02it/s] Compressing model: 200it [00:10, 26.51it/s] Compressing model: 204it [00:10, 18.33it/s] Compressing model: 207it [00:10, 19.70it/s] Compressing model: 210it [00:11, 14.98it/s] Compressing model: 215it [00:11, 19.78it/s] Compressing model: 218it [00:11, 19.47it/s] Compressing model: 222it [00:11, 23.07it/s] Compressing model: 224it [00:11, 19.04it/s] <\details> Signed-off-by: HDCharles <charlesdavidhernandez@gmail.com>
1 parent 3f25fd1 commit 5f6c8db

File tree

2 files changed

+162
-0
lines changed

2 files changed

+162
-0
lines changed
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
from datasets import load_dataset
2+
from transformers import AutoModelForCausalLM, AutoTokenizer
3+
4+
from llmcompressor import oneshot
5+
from llmcompressor.modifiers.awq import AWQModifier
6+
from llmcompressor.utils import dispatch_for_generation
7+
8+
# Select model and load it.
9+
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
10+
11+
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
12+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
13+
14+
# Select calibration dataset.
15+
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
16+
DATASET_SPLIT = "train_sft"
17+
18+
# Select number of samples. 256 samples is a good place to start.
19+
# Increasing the number of samples can improve accuracy.
20+
NUM_CALIBRATION_SAMPLES = 256
21+
MAX_SEQUENCE_LENGTH = 512
22+
23+
# Load dataset and preprocess.
24+
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
25+
ds = ds.shuffle(seed=42)
26+
27+
28+
def preprocess(example):
29+
return {
30+
"text": tokenizer.apply_chat_template(
31+
example["messages"],
32+
tokenize=False,
33+
)
34+
}
35+
36+
37+
ds = ds.map(preprocess)
38+
39+
40+
# Tokenize inputs.
41+
def tokenize(sample):
42+
return tokenizer(
43+
sample["text"],
44+
padding=False,
45+
max_length=MAX_SEQUENCE_LENGTH,
46+
truncation=True,
47+
add_special_tokens=False,
48+
)
49+
50+
51+
# Configure the quantization algorithm to run.
52+
recipe = [
53+
AWQModifier(
54+
ignore=["lm_head"], scheme="FP8_BLOCK", targets=["Linear"], duo_scaling="both"
55+
),
56+
]
57+
58+
# Apply algorithms.
59+
oneshot(
60+
model=model,
61+
dataset=ds,
62+
recipe=recipe,
63+
max_seq_length=MAX_SEQUENCE_LENGTH,
64+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
65+
)
66+
67+
# Confirm generations of the quantized model look sane.
68+
print("\n\n")
69+
print("========== SAMPLE GENERATION ==============")
70+
dispatch_for_generation(model)
71+
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
72+
model.device
73+
)
74+
output = model.generate(input_ids, max_new_tokens=100)
75+
print(tokenizer.decode(output[0]))
76+
print("==========================================\n\n")
77+
78+
# Save to disk compressed.
79+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-awq-asym"
80+
model.save_pretrained(SAVE_DIR, save_compressed=True)
81+
tokenizer.save_pretrained(SAVE_DIR)
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
from datasets import load_dataset
2+
from transformers import AutoModelForCausalLM, AutoTokenizer
3+
4+
from llmcompressor import oneshot
5+
from llmcompressor.modifiers.awq import AWQModifier
6+
from llmcompressor.utils import dispatch_for_generation
7+
8+
# Select model and load it.
9+
MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
10+
11+
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto")
12+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
13+
14+
# Select calibration dataset.
15+
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
16+
DATASET_SPLIT = "train_sft"
17+
18+
# Select number of samples. 256 samples is a good place to start.
19+
# Increasing the number of samples can improve accuracy.
20+
NUM_CALIBRATION_SAMPLES = 256
21+
MAX_SEQUENCE_LENGTH = 512
22+
23+
# Load dataset and preprocess.
24+
ds = load_dataset(DATASET_ID, split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]")
25+
ds = ds.shuffle(seed=42)
26+
27+
28+
def preprocess(example):
29+
return {
30+
"text": tokenizer.apply_chat_template(
31+
example["messages"],
32+
tokenize=False,
33+
)
34+
}
35+
36+
37+
ds = ds.map(preprocess)
38+
39+
40+
# Tokenize inputs.
41+
def tokenize(sample):
42+
return tokenizer(
43+
sample["text"],
44+
padding=False,
45+
max_length=MAX_SEQUENCE_LENGTH,
46+
truncation=True,
47+
add_special_tokens=False,
48+
)
49+
50+
51+
# Configure the quantization algorithm to run.
52+
recipe = [
53+
AWQModifier(
54+
ignore=["lm_head"], scheme="FP8_DYNAMIC", targets=["Linear"], duo_scaling="both"
55+
),
56+
]
57+
58+
# Apply algorithms.
59+
oneshot(
60+
model=model,
61+
dataset=ds,
62+
recipe=recipe,
63+
max_seq_length=MAX_SEQUENCE_LENGTH,
64+
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
65+
)
66+
67+
# Confirm generations of the quantized model look sane.
68+
print("\n\n")
69+
print("========== SAMPLE GENERATION ==============")
70+
dispatch_for_generation(model)
71+
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
72+
model.device
73+
)
74+
output = model.generate(input_ids, max_new_tokens=100)
75+
print(tokenizer.decode(output[0]))
76+
print("==========================================\n\n")
77+
78+
# Save to disk compressed.
79+
SAVE_DIR = MODEL_ID.rstrip("/").split("/")[-1] + "-awq-asym"
80+
model.save_pretrained(SAVE_DIR, save_compressed=True)
81+
tokenizer.save_pretrained(SAVE_DIR)

0 commit comments

Comments
 (0)