[BUG] The advantage of EOS is always 0

## Checklist

- [ ] The error occurs when using our provided Docker image.
- [ ] I can consistently reproduce the bug across multiple trials or random seeds.
- [ ] If the error causes experiment abortion, I've verified that this error is the root
  cause, not a secondary error caused by peer workers.

## Detailed Information

### Describe the bug

In the current implementation, the advantages are only computed for non-eos tokens. However EOS is also generated by llm, and frameworks like verl and openrlhf compute adv for EOS. Line 207-208 are the main cause.

https://github.com/inclusionAI/AReaL/blob/d5093d725b0569662b71777c8fcb0ee61bb8728a/areal/engine/ppo/actor.py#L200-L236

### Expected behavior

A clear and concise description of what you expected to happen.

### Full logs

If possible, provide logs for more detailed information.

## To Reproduce

### Commit ID

Please provide your Git commit ID.

### Environment

Please provide your software and hardware information if you're not using a
containerized environment.

### Script

The bash script or YAML configuration to run:


	# Compute KL-regularized rewards.
	attn_mask = data["attention_mask"]
	seqlens = attn_mask.sum(-1).long()
	seq_no_eos_mask = seqlens == attn_mask.shape[1]
	rewards = -self.kl_ctl * self.kl_estimator(old_logp, ref_logp)
	kl_rewards = rewards.clone()
	# KL rewards at the next token after eos is zero.
	rewards[batch_indices, seqlens - 1] = 0
	indices = torch.clip(seqlens - 2, min=0)
	if self.mask_no_eos_with_zero:
	rewards[batch_indices, indices] += torch.where(
	seq_no_eos_mask, 0, reward_score
	)
	else:
	rewards[batch_indices, indices] += reward_score

	# Compute GAE.
	if "values" not in data:
	values = torch.zeros_like(rewards)
	else:
	values = data["values"]
	advantages_reversed = [
	torch.zeros(bs, dtype=torch.float32, device=values.device)
	]
	lastgaelam = 0
	nextvalues = values[:, max_seqlen - 1] * seq_no_eos_mask
	for t in reversed(range(max_seqlen - 1)):
	delta = rewards[:, t] + self.discount * nextvalues - values[:, t]
	newgaelam = delta + self.discount * self.gae_lambda * lastgaelam

	# Skip tokens that do not contribute to the loss
	mask = loss_mask[:, t]
	nextvalues = nextvalues * (1 - mask) + values[:, t] * mask
	lastgaelam = lastgaelam * (1 - mask) + newgaelam * mask
	advantages_reversed.append(lastgaelam)

	advantages = torch.stack(advantages_reversed[::-1], dim=1)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] The advantage of EOS is always 0 #760

Checklist

Detailed Information

Describe the bug

Expected behavior

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] The advantage of EOS is always 0 #760

Description

Checklist

Detailed Information

Describe the bug

Expected behavior

Full logs

To Reproduce

Commit ID

Environment

Script

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions