Skip to content

[BUG] The advantage of EOS is always 0 #760

@dirtyDan0

Description

@dirtyDan0

Checklist

  • The error occurs when using our provided Docker image.
  • I can consistently reproduce the bug across multiple trials or random seeds.
  • If the error causes experiment abortion, I've verified that this error is the root
    cause, not a secondary error caused by peer workers.

Detailed Information

Describe the bug

In the current implementation, the advantages are only computed for non-eos tokens. However EOS is also generated by llm, and frameworks like verl and openrlhf compute adv for EOS. Line 207-208 are the main cause.

# Compute KL-regularized rewards.
attn_mask = data["attention_mask"]
seqlens = attn_mask.sum(-1).long()
seq_no_eos_mask = seqlens == attn_mask.shape[1]
rewards = -self.kl_ctl * self.kl_estimator(old_logp, ref_logp)
kl_rewards = rewards.clone()
# KL rewards at the next token after eos is zero.
rewards[batch_indices, seqlens - 1] = 0
indices = torch.clip(seqlens - 2, min=0)
if self.mask_no_eos_with_zero:
rewards[batch_indices, indices] += torch.where(
seq_no_eos_mask, 0, reward_score
)
else:
rewards[batch_indices, indices] += reward_score
# Compute GAE.
if "values" not in data:
values = torch.zeros_like(rewards)
else:
values = data["values"]
advantages_reversed = [
torch.zeros(bs, dtype=torch.float32, device=values.device)
]
lastgaelam = 0
nextvalues = values[:, max_seqlen - 1] * seq_no_eos_mask
for t in reversed(range(max_seqlen - 1)):
delta = rewards[:, t] + self.discount * nextvalues - values[:, t]
newgaelam = delta + self.discount * self.gae_lambda * lastgaelam
# Skip tokens that do not contribute to the loss
mask = loss_mask[:, t]
nextvalues = nextvalues * (1 - mask) + values[:, t] * mask
lastgaelam = lastgaelam * (1 - mask) + newgaelam * mask
advantages_reversed.append(lastgaelam)
advantages = torch.stack(advantages_reversed[::-1], dim=1)

Expected behavior

A clear and concise description of what you expected to happen.

Full logs

If possible, provide logs for more detailed information.

To Reproduce

Commit ID

Please provide your Git commit ID.

Environment

Please provide your software and hardware information if you're not using a
containerized environment.

Script

The bash script or YAML configuration to run:

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions