Skip to content

[BUG] KeyError in vLLM LoRA request cleanup during lora RL fine-tuning (missing request_id in running_requests) #751

@gursimar

Description

@gursimar

Checklist

  • The error occurs when using our provided Docker image.
  • I can consistently reproduce the bug across multiple trials or random seeds.
  • If the error causes experiment abortion, I've verified that this error is the root
    cause, not a secondary error caused by peer workers.

Detailed Information

Describe the bug

During LoRA-based RL fine-tuning with vLLM v0.11.0, the training crashes with a KeyError in vLLM’s async output handling. Specifically, when finishing a request, vLLM attempts to remove a request ID from lora_stats.running_requests, but the key does not exist. This suggests that the LoRA request state is not properly tracked or has already been cleared.

Expected behavior

LoRA RL fine-tuning should complete without errors, and request states should be consistently added to and removed from running_requests without raising a KeyError.

Potential fix

Seems like we can fix this bug in AReal by managing clearing state of requests better but where this can be done is not clear yet. However, there are other similar issues raised in vLLM due to the brittle nature of code in vllm. Recent versions (v12.0+) has fixed these kinds of bugs by rewriting the code in a more defensive manner. See this vLLM PR #26801.

Full logs

ERROR [async_llm.py:480] AsyncLLM output_handler failed.
Traceback (most recent call last):
  File ".../async_llm.py", line 457, in output_handler
    processed_outputs = output_processor.process_outputs(...)
  File ".../output_processor.py", line 470, in process_outputs
    self._update_stats_from_finished(...)
  File ".../output_processor.py", line 572, in _update_stats_from_finished
    self.lora_states.finish_request(req_state)
  File ".../stats.py", line 222, in finish_request
    lora_stats.running_requests.remove(req_state.request_id)
KeyError: 'cmpl-247631329f8f4eaf905fbbe34a96b13c-0'

To Reproduce

Commit ID

N/A (observed on released versions)

Environment

  • vLLM v0.11.0
  • AReal v0.5.1

Script

python3 -m areal.launcher.local \
  examples/lora/gsm8k_grpo_lora_vllm.py \
  --config examples/lora/gsm8k_grpo_lora_vllm.yaml

Notes:

  • The bug happens during LoRA RL fine-tuning with vLLM v0.11.0.
  • The immediate cause is a missing request ID in running_requests when removing a finished request.
  • This likely occurs because AReal does not properly clear or synchronize LoRA states with vLLM.
  • Similar issues have been reported in vLLM and led to a rewrite of this logic (see vLLM PR #26801).
  • In newer versions of vLLM, this issue appears to be resolved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions