[BUG] Training-only run with allocation_mode=fsdp:* fails due to mandatory inference backend check (sglang / vllm)

## Checklist

- [x] The error occurs when using our provided Docker image.
- [x] I can consistently reproduce the bug across multiple trials or random seeds.
- [x] If the error causes experiment abortion, I've verified that this error is the root
  cause, not a secondary error caused by peer workers.

## Detailed Information

### Describe the bug

When following the AReaL debugging best practices to separately launch inference and training processes,
the inference process starts successfully with allocation_mode=sglang:*, but the training process fails to start when using allocation_mode=fsdp:*.

Specifically, the training process crashes early with the following error:

```
ValueError: Invalid backend: None, expected sglang or vllm

```

### Expected behavior

When running training and inference as separate processes, successfully start training-only workers using FSDP

### Full logs

```
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
20251219-09:39:34.788 [FSDP Engine Rank 1] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
20251219-09:39:34.788 [FSDP Engine Rank 0] INFO: Initializing device mesh with parallel dims (dp=2, sp=1, tp=1, ep=1, etp=1, world_size=2).
20251219-09:39:34.790 [FSDP Engine Rank 1] INFO: Data parallel head 1 and rank 1
[rank1]: Traceback (most recent call last):
[rank1]:   File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 62, in <module>
[rank1]:     main(sys.argv[1:])
[rank1]:   File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 34, in main
[rank1]:     with PPOTrainer(
[rank1]:          ^^^^^^^^^^^
[rank1]:   File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 95, in __init__
[rank1]:     self.rollout = self._init_rollout(config.rollout, is_eval=False)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 413, in _init_rollout
[rank1]:     raise ValueError(
[rank1]: ValueError: Invalid backend: None, expected sglang or vllm
20251219-09:39:34.791 [FSDP Engine Rank 0] INFO: Data parallel head 0 and rank 0
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 62, in <module>
[rank0]:     main(sys.argv[1:])
[rank0]:   File "/data/AReaL-0.5.1/examples/math/gsm8k_rl.py", line 34, in main
[rank0]:     with PPOTrainer(
[rank0]:          ^^^^^^^^^^^
[rank0]:   File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 95, in __init__
[rank0]:     self.rollout = self._init_rollout(config.rollout, is_eval=False)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/data/AReaL-0.5.1/areal/experimental/trainer/rl.py", line 413, in _init_rollout
[rank0]:     raise ValueError(
[rank0]: ValueError: Invalid backend: None, expected sglang or vllm
[rank0]:[W1219 09:39:35.433559464 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank1]:[W1219 09:39:36.464258515 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W1219 09:39:36.911000 5172 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 5307 closing signal SIGTERM
E1219 09:39:37.025000 5172 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 5306) of binary: /usr/bin/python3
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
examples/math/gsm8k_rl.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-19_09:39:36
  host      : li-X640-G40
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 5306)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
20251219-09:39:39.022 Local Scheduler INFO: Stopping local process with signal SIGTERM, pid: [5171]
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/data/AReaL-0.5.1/areal/launcher/local.py", line 419, in <module>
    main()
  File "/data/AReaL-0.5.1/areal/launcher/local.py", line 261, in main
    local_main(config, run_id=0)
  File "/data/AReaL-0.5.1/areal/launcher/local.py", line 413, in local_main
    raise e
  File "/data/AReaL-0.5.1/areal/launcher/local.py", line 389, in local_main
    launcher.wait(
  File "/data/AReaL-0.5.1/areal/launcher/local.py", line 236, in wait
    raise JobException(
areal.utils.launcher.JobException: Job test_test:trainer JobState.COMPLETED at node local
20251219-09:39:39.972 Local Scheduler INFO: Waiting for 0 local running processes, pids:
````
## To Reproduce

### Commit ID

Please provide your Git commit ID.

### Environment

Using the released Docker image:

```
ghcr.io/inclusionai/areal-runtime:v0.5.1
```

```
4 × NVIDIA A10 (24GB)
```

### Script

```
nohup python -m areal.launcher.local examples/math/gsm8k_rl.py \
    --config examples/math/gsm8k_grpo.yaml \
    allocation_mode=sglang:d4p1t1 \
    > llm_server.log 2>&1 &

python3 -m areal.launcher.local examples/math/gsm8k_rl.py \
    --config examples/math/gsm8k_grpo.yaml \
    experiment_name=test \
    trial_name=test \
    allocation_mode=fsdp:d2p1t1 \
    cluster.n_nodes=1 \
    cluster.n_gpus_per_node=4 \
    gconfig.max_new_tokens=2048 \
    train_dataset.batch_size=1024 \
    +sglang.attention_backend=triton

```



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] Training-only run with allocation_mode=fsdp:* fails due to mandatory inference backend check (sglang / vllm) #752

Checklist

Detailed Information

Describe the bug

Expected behavior

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] Training-only run with allocation_mode=fsdp:* fails due to mandatory inference backend check (sglang / vllm) #752

Description

Checklist

Detailed Information

Describe the bug

Expected behavior

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions