|
1 | 1 | # Mistral-format model compression (experimental) |
2 | | - |
3 | | -This folder contains tools for compressing Mistral-format models, like `mistralai/Devstral-Small-2505` and `mistralai/Magistral-Small-2506`. |
4 | | - |
5 | | -## FP8 W8A8 Quantization |
6 | | - |
7 | | -This script quantizes Mistral-format models to FP8. It is not for use with HuggingFace-format models. |
8 | | - |
9 | | -### 1. Download the model |
10 | | - |
11 | | -Download the model and save it to a new "FP8" folder. We use `mistralai/Magistral-Small-2506` as an example. |
12 | | - |
13 | | -```bash |
14 | | -huggingface-cli download mistralai/Magistral-Small-2506 --local-dir Magistral-Small-2506-FP8 |
15 | | -``` |
16 | | - |
17 | | -### 2. Clean up HuggingFace-specific files |
18 | | - |
19 | | -Models from the Hub often include files for both the native Mistral format and the HuggingFace `transformers` format. This script works on the native format, so the `transformers` files should be removed to avoid confusion. |
20 | | - |
21 | | -The HuggingFace-specific files are typically `config.json`, `model-000*-of-000*.safetensors`, and `model.safetensors.index.json`. The `params.json`, `tekken.json` and `consolidated.safetensors` are for the native format. |
22 | | - |
23 | | -Before deleting, it's a good idea to look at the files in the directory to understand what you're removing. |
24 | | - |
25 | | -Once you're ready, remove the `transformers`-specific files: |
26 | | - |
27 | | -```bash |
28 | | -rm Magistral-Small-2506/config.json Magistral-Small-2506/model.safetensors.index.json Magistral-Small-2506-FP8/model-000* |
29 | | -``` |
30 | | - |
31 | | -### 3. Run the quantization script |
32 | | - |
33 | | -Now, run the FP8 quantization script on the directory. This will modify the `.safetensors` files in-place and update `params.json` and `consolidated.safetensors`. |
34 | | - |
35 | | -```bash |
36 | | -python fp8_quantize.py Magistral-Small-2506-FP8 |
37 | | -``` |
38 | | - |
39 | | -### 4. Use the quantized model |
40 | | - |
41 | | -The model should now be ready to use in vLLM! |
42 | | - |
43 | | -```bash |
44 | | -vllm serve Magistral-Small-2506-FP8 --tokenizer-mode mistral --config-format mistral --load-format mistral --tool-call-parser mistral --enable-auto-tool-choice |
45 | | -``` |
| 2 | +For quantizing mistral models which do not have a huggingface model definition such as `mistralai/Devstral-Small-2505`, `mistralai/Magistral-Small-2506`, and `mistralai/mistral-large-3`, please use the [`model_free_ptq`](/src/llmcompressor/entrypoints/model_free/) entrypoint. |
0 commit comments