Skip to content

Conversation

@Luodian
Copy link
Collaborator

@Luodian Luodian commented Feb 7, 2026

Summary

Two correctness bugs in the training loop at training/train.py (lines 655-669) that affect runs with backward_passes_per_step > 1 (i.e. Stage-2 video training where it is set to 4).


Bug 1: lr_scheduler.step() called on every micro-step instead of every optimizer step

Before (broken)

        if is_accumulation_step:
            with backbone_ddp_compiled.no_sync():
                scaled_loss.backward()
        else:
            scaled_loss.backward()
            ...
            opt.step()
            opt.zero_grad()

        lr_scheduler.step()          # <-- OUTSIDE the else branch

lr_scheduler.step() is called unconditionally on every loop iteration (every micro-step), but opt.step() is only called on actual optimizer-step iterations. When backward_passes_per_step=4, the scheduler advances 4x faster than the optimizer.

Concrete impact

The PolynomialLRWarmup scheduler (see training/lr_scheduler.py) is initialized with:

lr_scheduler = PolynomialLRWarmup(opt, int(total_steps * warmup_ratio), total_steps, power=2)

where total_steps = num_sampled_data / batch_size / world_size. This total_steps counts micro-steps (loop iterations), not optimizer steps.

With backward_passes_per_step=4:

  • The loop runs for total_steps micro-step iterations.
  • The optimizer steps total_steps / 4 times.
  • The scheduler is called total_steps times - which matches the total_iters it was given, so the schedule shape is correct in terms of LR trajectory over the full training run.

However, the scheduler should still only step when the optimizer steps, because:

  1. The scheduler's last_epoch counter tracks optimizer updates, not micro-steps. Calling it on non-update steps means the LR changes between micro-batches within the same accumulation window - the 4 micro-batches that contribute to one optimizer step see 4 different learning rates. This is semantically wrong.
  2. If total_steps is meant to count optimizer steps (as the variable name suggests), the 4x over-stepping causes the scheduler to exhaust its budget in 1/4 of the training run, after which LR stays at the terminal value.

Fix

Move lr_scheduler.step() inside the else branch, so it only fires when opt.step() fires.

Note: If the intent is to have the scheduler span the full micro-step count, total_steps passed to the scheduler should be divided by backward_passes_per_step. But the simpler and more standard fix is to align scheduler stepping with optimizer stepping.


Bug 2: Off-by-one in accumulation step detection causes first optimizer step to use only 1 micro-batch

Before (broken)

is_accumulation_step = global_step % args.backward_passes_per_step != 0

When global_step=0 (first iteration) and backward_passes_per_step=4:

  • 0 % 4 == 0is_accumulation_step = False
  • The optimizer steps immediately on the very first micro-batch, before any accumulation happens.

This means the first optimizer update uses gradients from only 1 micro-batch instead of 4. All subsequent accumulation windows are also shifted by 1.

Concrete impact with backward_passes_per_step=4

global_step Old step % 4 != 0 Optimizer steps? Expected
0 False (0%4=0) ✅ YES (only 1 micro-batch!) ❌ Should accumulate
1 True (1%4=1) ❌ accumulate ✅ accumulate
2 True (2%4=2) ❌ accumulate ✅ accumulate
3 True (3%4=3) ❌ accumulate should step here
4 False (4%4=0) ✅ YES ❌ accumulate

Fix

is_accumulation_step = (global_step + 1) % args.backward_passes_per_step != 0

Now the optimizer steps at global_step = 3, 7, 11, ... — i.e. after every 4 micro-batches.


Changes

- is_accumulation_step = global_step % args.backward_passes_per_step != 0
+ is_accumulation_step = (global_step + 1) % args.backward_passes_per_step != 0
  ...
          opt.step()
          opt.zero_grad()
+         lr_scheduler.step()

- lr_scheduler.step()

Risk Assessment

  • Low risk: Both changes are minimal (2 lines changed) and only affect the training loop control flow.
  • Checkpoint compatibility: Existing checkpoints store global_step and lr_scheduler state. Resuming from a checkpoint saved by the old code will shift the accumulation window by 1 step (negligible impact) and the scheduler will be correctly aligned going forward.
  • Stage-1 unaffected: Stage-1 scripts use backward_passes_per_step=1, where both bugs are no-ops (1 % 1 == 0 always, and scheduler steps every iteration = every optimizer step).

@Luodian Luodian force-pushed the fix/training-loop-grad-accum-and-lr-scheduler branch from b604da3 to 19eea97 Compare February 7, 2026 10:41
@Luodian Luodian changed the title fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping in training loop fix: training loop - grad accum off-by-one, lr_scheduler over-stepping, zero_grad optimization, checkpoint I/O reduction Feb 7, 2026
@Luodian Luodian force-pushed the fix/training-loop-grad-accum-and-lr-scheduler branch from 19eea97 to 52a1915 Compare February 7, 2026 11:16
@Luodian Luodian changed the title fix: training loop - grad accum off-by-one, lr_scheduler over-stepping, zero_grad optimization, checkpoint I/O reduction fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping Feb 7, 2026
…accumulation

lr_scheduler total_iters was set to micro-step count (total_steps), but
after moving lr_scheduler.step() to only fire on optimizer steps, the
scheduler would only traverse 1/backward_passes_per_step of its budget.

Divide total_iters by backward_passes_per_step so the full LR curve
(warmup + polynomial decay) completes over the actual optimizer steps.
No-op when backward_passes_per_step=1 (Stage-1).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant