Skip to content

[BUG] Training-only run with allocation_mode=fsdp:* fails due to mandatory inference backend check (sglang / vllm) #752

@lixiaofei123

Description

@lixiaofei123

Checklist

  • The error occurs when using our provided Docker image.
  • I can consistently reproduce the bug across multiple trials or random seeds.
  • If the error causes experiment abortion, I've verified that this error is the root
    cause, not a secondary error caused by peer workers.

Detailed Information

Describe the bug

When following the AReaL debugging best practices to separately launch inference and training processes,
the inference process starts successfully with allocation_mode=sglang:, but the training process fails to start when using allocation_mode=fsdp:.

Specifically, the training process crashes early with the following error:

ValueError: Invalid backend: None, expected sglang or vllm

Expected behavior

When running training and inference as separate processes, successfully start training-only workers using FSDP

Full logs

[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
20251219-09:39:34.788 [FSDP Engine Rank 1] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
20251219-09:39:34.788 [FSDP Engine Rank 0] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
20251219-09:39:34.790 [FSDP Engine Rank 1] INFO: Data parallel head 1 and rank 1
[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 62, in <module>
[rank1]:     main(sys.argv[1:])
[rank1]:   File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 34, in main
[rank1]:     with PPOTrainer(
[rank1]:          ^^^^^^^^^^^
[rank1]:   File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 95, in __init__
[rank1]:     self.rollout = self._init_rollout(config.rollout, is_eval=False)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 413, in _init_rollout
[rank1]:     raise ValueError(
[rank1]: ValueError: Invalid backend: None, expected sglang or vllm
20251219-09:39:34.791 [FSDP Engine Rank 0] INFO: Data parallel head 0 and rank 0
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 62, in <module>
[rank0]:     main(sys.argv[1:])
[rank0]:   File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 34, in main
[rank0]:     with PPOTrainer(
[rank0]:          ^^^^^^^^^^^
[rank0]:   File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 95, in __init__
[rank0]:     self.rollout = self._init_rollout(config.rollout, is_eval=False)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 413, in _init_rollout
[rank0]:     raise ValueError(
[rank0]: ValueError: Invalid backend: None, expected sglang or vllm
[rank0]:[W1219 09:39:35.433559464 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank1]:[W1219 09:39:36.464258515 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1219 09:39:36.911000 5172 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 5307 closing signal SIGTERM
E1219 09:39:37.025000 5172 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5306) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples/math/gsm8k_rl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-19_09:39:36
  host      : li-X640-G40
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5306)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
20251219-09:39:39.022 Local Scheduler INFO: Stopping local process with signal SIGTERM, pid: [5171]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/data/AReaL-0.5.1/areal/launcher/local.py", line 419, in <module>
    main()
  File "/data/AReaL-0.5.1/areal/launcher/local.py", line 261, in main
    local_main(config, run_id=0)
  File "/data/AReaL-0.5.1/areal/launcher/local.py", line 413, in local_main
    raise e
  File "/data/AReaL-0.5.1/areal/launcher/local.py", line 389, in local_main
    launcher.wait(
  File "/data/AReaL-0.5.1/areal/launcher/local.py", line 236, in wait
    raise JobException(
areal.utils.launcher.JobException: Job test_test:trainer JobState.COMPLETED at node local
20251219-09:39:39.972 Local Scheduler INFO: Waiting for 0 local running processes, pids:

To Reproduce

Commit ID

Please provide your Git commit ID.

Environment

Using the released Docker image:

ghcr.io/inclusionai/areal-runtime:v0.5.1
4 × NVIDIA A10 (24GB)

Script

nohup python -m areal.launcher.local examples/math/gsm8k_rl.py \
    --config examples/math/gsm8k_grpo.yaml \
    allocation_mode=sglang:d4p1t1 \
    > llm_server.log 2>&1 &

python3 -m areal.launcher.local examples/math/gsm8k_rl.py \
    --config examples/math/gsm8k_grpo.yaml \
    experiment_name=test \
    trial_name=test \
    allocation_mode=fsdp:d2p1t1 \
    cluster.n_nodes=1 \
    cluster.n_gpus_per_node=4 \
    gconfig.max_new_tokens=2048 \
    train_dataset.batch_size=1024 \
    +sglang.attention_backend=triton

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions