[BUG] KeyError in vLLM LoRA request cleanup during lora RL fine-tuning (missing request_id in running_requests)

## Checklist

* [ ] The error occurs when using our provided Docker image.
* [x] I can consistently reproduce the bug across multiple trials or random seeds.
* [ ] If the error causes experiment abortion, I've verified that this error is the root
  cause, not a secondary error caused by peer workers.

## Detailed Information

### Describe the bug

During LoRA-based RL fine-tuning with vLLM v0.11.0, the training crashes with a `KeyError` in vLLM’s async output handling. Specifically, when finishing a request, vLLM attempts to remove a request ID from `lora_stats.running_requests`, but the key does not exist. This suggests that the LoRA request state is not properly tracked or has already been cleared. 

### Expected behavior

LoRA RL fine-tuning should complete without errors, and request states should be consistently added to and removed from `running_requests` without raising a `KeyError`.

### Potential fix
Seems like we can fix this bug in AReal by managing clearing state of requests better but where this can be done is not clear yet. However, there are other similar issues raised in vLLM due to the brittle nature of code in vllm. Recent versions (v12.0+) has fixed these kinds of bugs by rewriting the code in a more defensive manner. See this  vLLM PR #26801. 

### Full logs

```
ERROR [async_llm.py:480] AsyncLLM output_handler failed.
Traceback (most recent call last):
  File ".../async_llm.py", line 457, in output_handler
    processed_outputs = output_processor.process_outputs(...)
  File ".../output_processor.py", line 470, in process_outputs
    self._update_stats_from_finished(...)
  File ".../output_processor.py", line 572, in _update_stats_from_finished
    self.lora_states.finish_request(req_state)
  File ".../stats.py", line 222, in finish_request
    lora_stats.running_requests.remove(req_state.request_id)
KeyError: 'cmpl-247631329f8f4eaf905fbbe34a96b13c-0'
```

## To Reproduce

### Commit ID

N/A (observed on released versions)

### Environment

* vLLM v0.11.0
* AReal v0.5.1

### Script

```bash
python3 -m areal.launcher.local \
  examples/lora/gsm8k_grpo_lora_vllm.py \
  --config examples/lora/gsm8k_grpo_lora_vllm.yaml
```

---

**Notes:**

* The bug happens during LoRA RL fine-tuning with vLLM v0.11.0.
* The immediate cause is a missing request ID in `running_requests` when removing a finished request.
* This likely occurs because AReal does not properly clear or synchronize LoRA states with vLLM.
* Similar issues have been reported in vLLM and led to a rewrite of this logic (see vLLM PR #26801).
* In newer versions of vLLM, this issue appears to be resolved.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] KeyError in vLLM LoRA request cleanup during lora RL fine-tuning (missing request_id in running_requests) #751

Checklist

Detailed Information

Describe the bug

Expected behavior

Potential fix

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] KeyError in vLLM LoRA request cleanup during lora RL fine-tuning (missing request_id in running_requests) #751

Description

Checklist

Detailed Information

Describe the bug

Expected behavior

Potential fix

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions