-
Notifications
You must be signed in to change notification settings - Fork 257
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Checklist
- The error occurs when using our provided Docker image.
- I can consistently reproduce the bug across multiple trials or random seeds.
- If the error causes experiment abortion, I've verified that this error is the root
cause, not a secondary error caused by peer workers.
Detailed Information
Describe the bug
In the current implementation, the advantages are only computed for non-eos tokens. However EOS is also generated by llm, and frameworks like verl and openrlhf compute adv for EOS. Line 207-208 are the main cause.
AReaL/areal/engine/ppo/actor.py
Lines 200 to 236 in d5093d7
| # Compute KL-regularized rewards. | |
| attn_mask = data["attention_mask"] | |
| seqlens = attn_mask.sum(-1).long() | |
| seq_no_eos_mask = seqlens == attn_mask.shape[1] | |
| rewards = -self.kl_ctl * self.kl_estimator(old_logp, ref_logp) | |
| kl_rewards = rewards.clone() | |
| # KL rewards at the next token after eos is zero. | |
| rewards[batch_indices, seqlens - 1] = 0 | |
| indices = torch.clip(seqlens - 2, min=0) | |
| if self.mask_no_eos_with_zero: | |
| rewards[batch_indices, indices] += torch.where( | |
| seq_no_eos_mask, 0, reward_score | |
| ) | |
| else: | |
| rewards[batch_indices, indices] += reward_score | |
| # Compute GAE. | |
| if "values" not in data: | |
| values = torch.zeros_like(rewards) | |
| else: | |
| values = data["values"] | |
| advantages_reversed = [ | |
| torch.zeros(bs, dtype=torch.float32, device=values.device) | |
| ] | |
| lastgaelam = 0 | |
| nextvalues = values[:, max_seqlen - 1] * seq_no_eos_mask | |
| for t in reversed(range(max_seqlen - 1)): | |
| delta = rewards[:, t] + self.discount * nextvalues - values[:, t] | |
| newgaelam = delta + self.discount * self.gae_lambda * lastgaelam | |
| # Skip tokens that do not contribute to the loss | |
| mask = loss_mask[:, t] | |
| nextvalues = nextvalues * (1 - mask) + values[:, t] * mask | |
| lastgaelam = lastgaelam * (1 - mask) + newgaelam * mask | |
| advantages_reversed.append(lastgaelam) | |
| advantages = torch.stack(advantages_reversed[::-1], dim=1) |
Expected behavior
A clear and concise description of what you expected to happen.
Full logs
If possible, provide logs for more detailed information.
To Reproduce
Commit ID
Please provide your Git commit ID.
Environment
Please provide your software and hardware information if you're not using a
containerized environment.
Script
The bash script or YAML configuration to run:
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working