fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping #82

Luodian · 2026-02-07T10:35:54Z

Summary

Two correctness bugs in the training loop at training/train.py (lines 655-669) that affect runs with backward_passes_per_step > 1 (i.e. Stage-2 video training where it is set to 4).

Bug 1: `lr_scheduler.step()` called on every micro-step instead of every optimizer step

Before (broken)

        if is_accumulation_step:
            with backbone_ddp_compiled.no_sync():
                scaled_loss.backward()
        else:
            scaled_loss.backward()
            ...
            opt.step()
            opt.zero_grad()

        lr_scheduler.step()          # <-- OUTSIDE the else branch

lr_scheduler.step() is called unconditionally on every loop iteration (every micro-step), but opt.step() is only called on actual optimizer-step iterations. When backward_passes_per_step=4, the scheduler advances 4x faster than the optimizer.

Concrete impact

The PolynomialLRWarmup scheduler (see training/lr_scheduler.py) is initialized with:

lr_scheduler = PolynomialLRWarmup(opt, int(total_steps * warmup_ratio), total_steps, power=2)

where total_steps = num_sampled_data / batch_size / world_size. This total_steps counts micro-steps (loop iterations), not optimizer steps.

With backward_passes_per_step=4:

The loop runs for total_steps micro-step iterations.
The optimizer steps total_steps / 4 times.
The scheduler is called total_steps times - which matches the total_iters it was given, so the schedule shape is correct in terms of LR trajectory over the full training run.

However, the scheduler should still only step when the optimizer steps, because:

The scheduler's last_epoch counter tracks optimizer updates, not micro-steps. Calling it on non-update steps means the LR changes between micro-batches within the same accumulation window - the 4 micro-batches that contribute to one optimizer step see 4 different learning rates. This is semantically wrong.
If total_steps is meant to count optimizer steps (as the variable name suggests), the 4x over-stepping causes the scheduler to exhaust its budget in 1/4 of the training run, after which LR stays at the terminal value.

Fix

Move lr_scheduler.step() inside the else branch, so it only fires when opt.step() fires.

Note: If the intent is to have the scheduler span the full micro-step count, total_steps passed to the scheduler should be divided by backward_passes_per_step. But the simpler and more standard fix is to align scheduler stepping with optimizer stepping.

Bug 2: Off-by-one in accumulation step detection causes first optimizer step to use only 1 micro-batch

Before (broken)

is_accumulation_step = global_step % args.backward_passes_per_step != 0

When global_step=0 (first iteration) and backward_passes_per_step=4:

0 % 4 == 0 → is_accumulation_step = False
The optimizer steps immediately on the very first micro-batch, before any accumulation happens.

This means the first optimizer update uses gradients from only 1 micro-batch instead of 4. All subsequent accumulation windows are also shifted by 1.

Concrete impact with `backward_passes_per_step=4`

global_step	Old `step % 4 != 0`	Optimizer steps?	Expected
0	`False` (0%4=0)	✅ YES (only 1 micro-batch!)	❌ Should accumulate
1	`True` (1%4=1)	❌ accumulate	✅ accumulate
2	`True` (2%4=2)	❌ accumulate	✅ accumulate
3	`True` (3%4=3)	❌ accumulate	✅ should step here
4	`False` (4%4=0)	✅ YES	❌ accumulate

Fix

is_accumulation_step = (global_step + 1) % args.backward_passes_per_step != 0

Now the optimizer steps at global_step = 3, 7, 11, ... — i.e. after every 4 micro-batches.

Changes

- is_accumulation_step = global_step % args.backward_passes_per_step != 0
+ is_accumulation_step = (global_step + 1) % args.backward_passes_per_step != 0
  ...
          opt.step()
          opt.zero_grad()
+         lr_scheduler.step()

- lr_scheduler.step()

Risk Assessment

Low risk: Both changes are minimal (2 lines changed) and only affect the training loop control flow.
Checkpoint compatibility: Existing checkpoints store global_step and lr_scheduler state. Resuming from a checkpoint saved by the old code will shift the accumulation window by 1 step (negligible impact) and the scheduler will be correctly aligned going forward.
Stage-1 unaffected: Stage-1 scripts use backward_passes_per_step=1, where both bugs are no-ops (1 % 1 == 0 always, and scheduler steps every iteration = every optimizer step).

…tepping

…accumulation lr_scheduler total_iters was set to micro-step count (total_steps), but after moving lr_scheduler.step() to only fire on optimizer steps, the scheduler would only traverse 1/backward_passes_per_step of its budget. Divide total_iters by backward_passes_per_step so the full LR curve (warmup + polynomial decay) completes over the actual optimizer steps. No-op when backward_passes_per_step=1 (Stage-1).

Luodian force-pushed the fix/training-loop-grad-accum-and-lr-scheduler branch from b604da3 to 19eea97 Compare February 7, 2026 10:41

Luodian changed the title ~~fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping in training loop~~ fix: training loop - grad accum off-by-one, lr_scheduler over-stepping, zero_grad optimization, checkpoint I/O reduction Feb 7, 2026

fix: correct gradient accumulation off-by-one and lr_scheduler over-s…

52a1915

…tepping

Luodian force-pushed the fix/training-loop-grad-accum-and-lr-scheduler branch from 19eea97 to 52a1915 Compare February 7, 2026 11:16

Luodian changed the title ~~fix: training loop - grad accum off-by-one, lr_scheduler over-stepping, zero_grad optimization, checkpoint I/O reduction~~ fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping Feb 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping #82

fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping #82

Uh oh!

Luodian commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping #82

Are you sure you want to change the base?

fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping #82

Uh oh!

Conversation

Luodian commented Feb 7, 2026

Summary

Bug 1: lr_scheduler.step() called on every micro-step instead of every optimizer step

Before (broken)

Concrete impact

Fix

Bug 2: Off-by-one in accumulation step detection causes first optimizer step to use only 1 micro-batch

Before (broken)

Concrete impact with backward_passes_per_step=4

Fix

Changes

Risk Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1: `lr_scheduler.step()` called on every micro-step instead of every optimizer step

Concrete impact with `backward_passes_per_step=4`