[ROCm] Use aiter.topk_sigmoid in llama4 #30255

tpopp · 2025-12-08T11:07:26Z

This fuses a set of kernels that are bound by launch latency and is used whenever aiter is available.

Test Plan:
Server Launch:
vllm serve ${MODEL} --port ${PORT} --swap-space 32 --max-model-len ${MAX_MODEL_LEN} --tensor-parallel-size ${TP} --max-num-seqs ${MAX_NUM_SEQS} --gpu-memory-utilization 0.93 --kv-cache-dtype fp8 --max-num-batched-tokens ${MAX_NUM_BATCHED_TOKENS} --compilation-config "{\"custom_ops\": [\"-rms_norm\", \"-quant_fp8\", \"-silu_and_mul\"] }" --no-enable-prefix-caching --async-scheduling

Example benchmark command:
vllm bench serve --model meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 --host localhost --dataset-name random --random-input-len 1024 --random-output-len 1024 --max-concurrency 4 --num-prompts 48 --ignore-eos

Example correctness check:
lm_eval --model local-completions --tasks gsm8k --model_args base_url=http://0.0.0.0:${PORT}/v1/completions,model=meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8,tokenized_requests=False,tokenizer_backend=None,num_concurrent=128,timeout=120,max_retries=5

Test Result (using mi325x8):

Throughput increase by about 4% and accuracy comparable/better.

Before:

============ Serving Benchmark Result ============
Successful requests:                     48
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  119.76
Total input tokens:                      49104
Total generated tokens:                  49152
Request throughput (req/s):              0.40
Output token throughput (tok/s):         410.41
Peak output token throughput (tok/s):    428.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          820.42
---------------Time to First Token----------------
Mean TTFT (ms):                          88.79
Median TTFT (ms):                        99.22
P99 TTFT (ms):                           111.84
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.67
Median TPOT (ms):                        9.65
P99 TPOT (ms):                           9.80
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.67
Median ITL (ms):                         9.46
P99 ITL (ms):                            14.17
==================================================

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9333|±  |0.0069|
|     |       |strict-match    |     5|exact_match|↑  |0.9356|±  |0.0068|

After:

============ Serving Benchmark Result ============
Successful requests:                     48
Failed requests:                         0
Maximum request concurrency:             4
Benchmark duration (s):                  115.14
Total input tokens:                      49104
Total generated tokens:                  49152
Request throughput (req/s):              0.42
Output token throughput (tok/s):         426.89
Peak output token throughput (tok/s):    444.00
Peak concurrent requests:                8.00
Total Token throughput (tok/s):          853.36
---------------Time to First Token----------------
Mean TTFT (ms):                          87.66
Median TTFT (ms):                        96.94
P99 TTFT (ms):                           118.98
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          9.29
Median TPOT (ms):                        9.29
P99 TPOT (ms):                           9.37
---------------Inter-token Latency----------------
Mean ITL (ms):                           9.29
Median ITL (ms):                         9.09
P99 ITL (ms):                            13.76
==================================================

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9287|±  |0.0071|
|     |       |strict-match    |     5|exact_match|↑  |0.9310|±  |0.0070|

This fuses a set of kernels that are bound by launch latency. Signed-off-by: Tres Popp <tres.popp@amd.com>

gemini-code-assist

Code Review

This pull request introduces an optimization for Llama-4 models on ROCm platforms by using aiter.topk_sigmoid to fuse the topk and sigmoid operations. This is a good optimization for latency-bound kernels. The changes are well-implemented, adding a new rocm_aiter_topk_sigmoid op and conditionally using it when aiter is available. The implementation is consistent with existing aiter ops in the codebase. The provided benchmarks demonstrate a performance improvement. The code is clean and correct, and I have no suggestions for improvement.

tpopp · 2025-12-08T11:11:36Z

I attempted this as a separate optimization pass, which I believe is better in general, but did not have luck. I can gladly change it to that, but I need guidance then. I see topk+sigmoid in traces, but I do not see those operations in dumped pytorch graphs. I couldn't figure out how to check if the custom_routing is included in a pytorch graph. It is part of a hipGraph, but I don't know if those are per pytorch graph or can include more.

12521ae

tpopp · 2025-12-08T11:12:34Z

Proof of the kernel being used on this branch for the PR:

tjtanaa · 2025-12-08T12:57:51Z

@tpopp please include accuracy results like gsm8k.

tpopp · 2025-12-08T13:18:05Z

@tjtanaa I think I included what you are asking for in the 2 code blocks of the description, or are you saying to include accuracy numbers from a benchmark other than gsm8k?

tjtanaa

LGTM

tjtanaa · 2025-12-09T01:51:00Z

vllm/model_executor/models/llama4.py

        topk: int,
        renormalize: bool,
    ) -> tuple[torch.Tensor, torch.Tensor]:
+        if rocm_aiter_ops.is_enabled():


NITS: Since this is related to MoE, let's change to use rocm_aiter_ops.is_fused_moe_enabled()

tjtanaa · 2025-12-09T01:52:17Z

@houseroad If you are free could you take a quick look. Thank you.

mergify · 2025-12-10T04:10:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tpopp.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

[ROCm] Use aiter.topk_sigmoid in llama4

3527736

This fuses a set of kernels that are bound by launch latency. Signed-off-by: Tres Popp <tres.popp@amd.com>

mergify bot added llama Related to Llama models rocm Related to AMD ROCm labels Dec 8, 2025

gemini-code-assist bot reviewed Dec 8, 2025

View reviewed changes

tpopp marked this pull request as ready for review December 8, 2025 11:37

tpopp requested a review from tjtanaa as a code owner December 8, 2025 11:37

tjtanaa approved these changes Dec 9, 2025

View reviewed changes

tjtanaa reviewed Dec 9, 2025

View reviewed changes

mergify bot added the needs-rebase label Dec 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[ROCm] Use aiter.topk_sigmoid in llama4 #30255

[ROCm] Use aiter.topk_sigmoid in llama4 #30255

tpopp commented Dec 8, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

tpopp commented Dec 8, 2025

Uh oh!

tpopp commented Dec 8, 2025

Uh oh!

tjtanaa commented Dec 8, 2025

Uh oh!

tpopp commented Dec 8, 2025

Uh oh!

tjtanaa left a comment

Uh oh!

tjtanaa Dec 9, 2025

Uh oh!

tjtanaa commented Dec 9, 2025

Uh oh!

mergify bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[ROCm] Use aiter.topk_sigmoid in llama4 #30255

Are you sure you want to change the base?

[ROCm] Use aiter.topk_sigmoid in llama4 #30255

Conversation

tpopp commented Dec 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

tpopp commented Dec 8, 2025

Uh oh!

tpopp commented Dec 8, 2025

Uh oh!

tjtanaa commented Dec 8, 2025

Uh oh!

tpopp commented Dec 8, 2025

Uh oh!

tjtanaa left a comment

Choose a reason for hiding this comment

Uh oh!

tjtanaa Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

tjtanaa commented Dec 9, 2025

Uh oh!

mergify bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tpopp commented Dec 8, 2025 •

edited by github-actions bot

Loading