[Feat] Add FP8 training support #758

fishcrap · 2025-12-23T10:59:35Z

Description

This PR adds comprehensive FP8 (8-bit floating point) training support to AReaL, enabling memory-efficient training with low precision while maintaining training stability. The implementation includes:

FP8 quantization/dequantization utilities: New fp8_utils.py and fp8_kernels.py modules providing blockwise and per-tensor quantization support
CLI configuration: Extended TrainEngineConfig with FP8-related options (fp8 mode, recipe, parameter quantization, etc.)
Model loading/saving: Updated HuggingFace model loading and saving to handle FP8 weights with proper conversion between PyTorch FP8 and Transformer Engine FP8 formats
Megatron engine integration: Enhanced MegatronEngine to support FP8 training with proper configuration propagation
Comprehensive test suite: Added extensive tests for FP8 conversion, BF16 comparison, and gradient correctness

The implementation supports the blockwise scheme, with integration into Transformer Engine's FP8 infrastructure for efficient GEMM operations.

Related Issue

Fixes #(issue)

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code refactoring (no functional changes)
Performance improvement
Test coverage improvement

Checklist

I have read the Contributing Guide
I have run formatting tools (pre-commit or manual)
I have run relevant unit tests and they pass
I have added tests for new functionality
I have updated documentation if needed
My branch is up to date with main
This PR introduces breaking changes (if yes, fill out details below)
If this PR changes documentation, I have built and previewed it locally with jb build docs
No critical issues raised by AI reviewers (/gemini review)

Breaking Change Details (if applicable):

N/A - This is a new feature that adds optional FP8 support without breaking existing functionality.

Additional Context

TODO:

Memory profiling
Training time reduction
Fix MLA

gemini-code-assist · 2025-12-23T11:00:03Z

Summary of Changes

Hello @fishcrap, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades AReaL by integrating comprehensive FP8 training support. The primary goal is to enable memory-efficient training with reduced precision without compromising model stability. This is achieved through the introduction of new FP8 quantization and dequantization utilities, extensive configuration options via the CLI, and updates to model loading and saving processes to handle FP8 weights. The core MegatronEngine has been adapted to leverage these FP8 capabilities, and new tests ensure the reliability of these low-precision operations.

Highlights

Comprehensive FP8 Training Support: This PR introduces full 8-bit floating point (FP8) training capabilities to AReaL, enabling more memory-efficient training while striving to maintain training stability.
FP8 Quantization Utilities: New modules fp8_utils.py and fp8_kernels.py have been added, providing blockwise and per-tensor quantization and dequantization functionalities, including Triton-based kernels for efficient operations.
Extended CLI Configuration: The TrainEngineConfig and MegatronEngineConfig have been significantly extended with numerous FP8-related options, allowing users to configure FP8 mode, scaling recipes, parameter quantization, and other precision-related settings via the command-line interface.
Enhanced Model Loading and Saving: HuggingFace model loading and saving mechanisms have been updated to correctly handle FP8 weights, including proper conversion between PyTorch FP8 and Transformer Engine FP8 formats, and dequantization when necessary.
MegatronEngine Integration: The MegatronEngine has been enhanced to seamlessly support FP8 training, ensuring that FP8 configurations are correctly propagated and applied throughout the training process.
New Test Suite: A comprehensive test suite (test_fp8_conversion.py) has been added to verify the correctness of FP8 conversion, compare results with BF16 baselines, and ensure gradient accuracy.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces comprehensive FP8 training support, including new utilities for quantization/dequantization, CLI configurations, and updates to model loading/saving to handle FP8 weights. The changes are extensive and well-structured. I've identified a few areas with TODO or FIXME comments in the new code, particularly in tests and utility functions, that should be addressed to ensure correctness and clarity. The overall implementation seems robust, with good integration into the existing MegatronEngine and the addition of a comprehensive test suite.

areal/models/mcore/hf_load.py

areal/engine/megatron_engine.py

areal/tests/test_fp8_conversion.py

areal/utils/fp8_utils.py

areal/utils/megatron.py

areal/utils/mcore/pipeline_parallel.py

garrett4wade

The critical issue is that we should enforce HF fp8 base model if fp8 training is enabled.

garrett4wade · 2025-12-24T06:12:47Z

areal/api/cli_args.py

    bucket_size: int | None = None
    average_in_collective: bool = False
    fp8_param_gather: bool = False
+    data_parallel_sharding_strategy: str = field(


Is it for FSDP or DDP? Does no_shard means no sharding for optimizer states or parameters?

delete this field

garrett4wade · 2025-12-24T06:13:01Z

areal/api/cli_args.py

    recompute_modules: list[str] | None = None

+    # MoE
+    moe_router_dtype: str | None = None


default to float32?

areal/engine/megatron_engine.py

garrett4wade · 2025-12-24T06:17:24Z

areal/engine/megatron_engine.py


-    def get_device_stats(self) -> DeviceRuntimeInfo:
-        return DeviceRuntimeInfo.get_current()
+    def _check_and_apply_fp8_config(self):


Should also check transformer_engine installation here. If transformer_engine is not installed, e.g., in a uv pip install environment, a runtime error should be raised

should also revert the above change

areal/models/mcore/hf_load.py

garrett4wade · 2025-12-24T06:40:35Z

areal/api/cli_args.py

+    # FP8 Training Configuration
+    fp8: str | None = field(
+        default=None,
+        metadata={
+            "help": "Enable FP8 precision training. Options: "
+            "'e4m3' (uniform e4m3), "
+            "'hybrid' (e4m3 for activations/weights, e5m2 for output activation gradients)."
+        },
+    )


Can we provide an example yaml config for fp8 qwen3 training? We'd better provide a learning curve with the config (fp8 vs bf16 training curve).

areal/engine/megatron_engine.py

areal/models/mcore/hf_load.py

areal/utils/fp8_utils.py

fishcrap added 21 commits December 2, 2025 11:59

add megatron training args

be694c3

fix for dsv3

398b4e0

fp8 align 16 for training input

d160507

add fp8 update weight which needs quantize to fp8 first

c9fa040

add sglang online quant

4bb22b9

Merge remote-tracking branch 'github/main' into sxj/fp8_train

24fda85

add online dequant and quant in megatron save load

42c5844

fix shape

e949d0c

convert pytorch fp8 to transformer_engine fp8

731b11d

fix load

2e3ac85

fix fp8 scale_inv and weight not in same bin

cc2bf1e

fix fp8 load

ed48b0e

fix hf save

ce8e6e0

fix fp8 save

8e203be

add fp8 tests

e968de7

fix fp8_param weight update

55e36a3

fix hf_load

dc5b71d

add fp8_recipe in optimizer

384cbaf

default scale_inv dtype bfloat16

176bd26

fix megatron distributed

0edd0a4

Merge remote-tracking branch 'origin/main' into sxj/fp8_train

b683eb0

gemini-code-assist bot reviewed Dec 23, 2025

View reviewed changes

refactor fp8 tests

1b81d61

fishcrap changed the title ~~Sxj/fp8 train~~ [Feat] Add FP8 training support Dec 24, 2025

fishcrap added 5 commits December 24, 2025 12:47

fix test names

31df0ef

use refactered forward in tests

ca7c973

use refactered train in tests

18ddcbb

fix and refactor fp8 tests

5ae8bd6

fix

25650c1

fishcrap added 2 commits December 24, 2025 14:07

fix

07074fd

fix

ad45cb3

fishcrap force-pushed the sxj/fp8_train branch from 1c640a0 to ad45cb3 Compare December 24, 2025 06:10

fix tests

65ee2e6

garrett4wade reviewed Dec 24, 2025

View reviewed changes

fishcrap added 13 commits December 24, 2025 16:32

fix megatron engine

eba74ea

fix test fp8 conversion

803981f

fix test fp8 conversion

0ce7194

add explanation for fixing distributed optimizer

1af4220

fix import and comments

7a92a3f

del useless comments

058fd75

fix inference ep for megatron engine

f189bb9

del comment

da9108f

use engine fixture

174bcad

del __init__.py

c692f69

del pytorch fp8 to te fp8

89d3f03

add comments

437ae40

add fp8 consistency check

500c010

[Feat] Add FP8 training support #758

Are you sure you want to change the base?

[Feat] Add FP8 training support #758

Conversation

fishcrap commented Dec 23, 2025

Description

Related Issue

Type of Change

Checklist

Additional Context

TODO:

Uh oh!

gemini-code-assist bot commented Dec 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garrett4wade left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

garrett4wade Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

garrett4wade Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

garrett4wade Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

garrett4wade Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

garrett4wade Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

garrett4wade Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

garrett4wade left a comment •

edited

Loading