Skip to content

How to Reproduce 1000 TPS Results from Paper & Ablation Study Details #8

@JamesYen220

Description

@JamesYen220

Environment

  • Hardware: Single A100-40GB GPU
  • Model: LLaDA 1.5 (demo in README) / LLaDA-MoE-7B-A1B-Instruct-fused (benchmark)

Current Results

We tested the following configurations:

  • README demo: 40.15 TPS
  • benchmarks/benchmark.py: 47.29 TPS
  • benchmarks/benchmark.py with tp=2: 22 TPS

Questions

1. Reproduction Setup

What is the exact hardware and software setup used to achieve the 1000 TPS reported in the paper?

2. Ablation Studies

Could you provide details on the individual contribution of each design component to the TPS improvement? Specifically:

  • Decoding strategy
  • Model architecture/selection
  • Distillation techniques
  • Other optimizations mentioned in the paper

3. Reproduction Guide

Are there additional instructions or configurations needed to reproduce the paper's results?

~/dInfer$ python demo.py 
INFO 10-15 06:43:07 [__init__.py:244] Automatically detected platform cuda.
WARNING 10-15 06:43:12 [config.py:4703] Current vLLM config is not set.
WARNING 10-15 06:43:17 [config.py:4703] Current vLLM config is not set.
INFO 10-15 06:43:17 [parallel_state.py:1065] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
WARNING 10-15 06:43:58 [fused_moe.py:683] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/tonywei/miniconda3/envs/dinfer/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1024,device_name=NVIDIA_A100-PCIE-40GB.json
Here are all the prime numbers between 1 and 100:

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97

There are **25 prime numbers** in this range.<|role_end|>
Tokens per second (TPS): 40.15 tokens/sec
[rank0]:[W1015 06:44:01.228481383 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
~/dInfer$ python benchmarks/benchmark.py
INFO 10-15 06:49:19 [__init__.py:244] Automatically detected platform cuda.
Namespace(model_name='/nfsdata/sjtu/huggingface/LLaDA-1.5/', input_data=None, gpu='0', batch_size=1, num_test_iter=2, gen_len=1024, block_length=64, threshold=0.9, exp_name='exp', cache='', tp=False, sliding=False, prefix_look=0, after_look=0, warmup_steps=1)
started 1 0 0 Namespace(model_name='/nfsdata/sjtu/huggingface/LLaDA-1.5/', input_data=None, gpu='0', batch_size=1, num_test_iter=2, gen_len=1024, block_length=64, threshold=0.9, exp_name='exp', cache='', tp=False, sliding=False, prefix_look=0, after_look=0, warmup_steps=1)
rank=0, world size=1
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 119.78it/s]
prompt len: 50 , total len: 1074
[rank0]:[W1015 06:49:56.166279944 ProcessGroupNCCL.cpp:4715] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.80s/it]
Iter: 1, Forward: 158, cache updates: 0, Time: 15.60623288154602, FPS: 10.124160083938708, TPS: 47.28879836675169
To determine how many kilometers Lily can run in 8 hours, we need to break down the problem into two parts: the distance she runs in the first 4 hours and the distance she runs in the next 4 hours.

**Step 1: Calculate the distance Lily runs in the first 4 hours.**

Lily runs at a speed of 12 kilometers per hour for 4 hours. The distance she runs in the first 4 hours is:
\[
\text{Distance} = \text{Speed} \times \text{Time} = 12 \, \text{km/h} \times 4 \, \text{h} = 48 \, \text{km}
\]

**Step 2: Calculate the distance Lily runs in the next 4 hours.**

After the first 4 hours, Lily runs at a speed of 6 kilometers per hour for the next 4 hours. The distance she runs in the next 4 hours is:
\[
\text{Distance} = \text{Speed} \times \text{Time} = 6 \, \text{km/h} \times 4 \, \text{h} = 24 \, \text{km}
\]

**Step 3: Calculate the total distance Lily runs in 8 hours.**

To find the total distance, we add the distance run in the first 4 hours to the distance run in the next 4 hours:
\[
\text{Total Distance} = 48 \, \text{km} + 24 \, \text{km} = 72 \, \text{km}
\]

Therefore, the total distance Lily can run in 8 hours is \boxed{72} kilometers.<|eot_id|>
~/dInfer$ python benchmarks/benchmark.py --gpu 0,1
INFO 10-15 07:11:21 [__init__.py:244] Automatically detected platform cuda.
Namespace(model_name='/nfsdata/sjtu/huggingface/LLaDA-1.5/', input_data=None, gpu='0,1', batch_size=1, num_test_iter=2, gen_len=1024, block_length=64, threshold=0.9, exp_name='exp', cache='', tp=False, sliding=False, prefix_look=0, after_look=0, warmup_steps=1)
WARNING 10-15 07:11:37 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: [Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDI](https://github.com/NVIDIA/nccl/issues/1234)
INFO 10-15 07:11:38 [__init__.py:244] Automatically detected platform cuda.
WARNING 10-15 07:11:38 [env_override.py:17] NCCL_CUMEM_ENABLE is set to 0, skipping override. This may increase memory overhead with cudagraph+allreduce: [Report of increased memory overhead during cudagraph capture with nccl >= 2.19 · Issue #1234 · NVIDI](https://github.com/NVIDIA/nccl/issues/1234)
INFO 10-15 07:11:38 [__init__.py:244] Automatically detected platform cuda.
started 2 1 1 Namespace(model_name='/nfsdata/sjtu/huggingface/LLaDA-1.5/', input_data=None, gpu='0,1', batch_size=1, num_test_iter=2, gen_len=1024, block_length=64, threshold=0.9, exp_name='exp', cache='', tp=False, sliding=False, prefix_look=0, after_look=0, warmup_steps=1)
started 2 0 0 Namespace(model_name='/nfsdata/sjtu/huggingface/LLaDA-1.5/', input_data=None, gpu='0,1', batch_size=1, num_test_iter=2, gen_len=1024, block_length=64, threshold=0.9, exp_name='exp', cache='', tp=False, sliding=False, prefix_look=0, after_look=0, warmup_steps=1)
rank=1, world size=2
rank=0, world size=2
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 65.70it/s]
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 36.60it/s]
prompt len: 50 , total len: 1074
prompt len: 50 , total len: 1074
/home/miniconda3/envs/dinfer/lib/python3.12/site-packages/torch/_inductor/lowering.py:7007: UserWarning: 
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions