fix: correct gradient accumulation off-by-one and lr_scheduler over-stepping #82
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Two correctness bugs in the training loop at
training/train.py(lines 655-669) that affect runs withbackward_passes_per_step > 1(i.e. Stage-2 video training where it is set to 4).Bug 1:
lr_scheduler.step()called on every micro-step instead of every optimizer stepBefore (broken)
lr_scheduler.step()is called unconditionally on every loop iteration (every micro-step), butopt.step()is only called on actual optimizer-step iterations. Whenbackward_passes_per_step=4, the scheduler advances 4x faster than the optimizer.Concrete impact
The
PolynomialLRWarmupscheduler (seetraining/lr_scheduler.py) is initialized with:where
total_steps = num_sampled_data / batch_size / world_size. Thistotal_stepscounts micro-steps (loop iterations), not optimizer steps.With
backward_passes_per_step=4:total_stepsmicro-step iterations.total_steps / 4times.total_stepstimes - which matches the total_iters it was given, so the schedule shape is correct in terms of LR trajectory over the full training run.However, the scheduler should still only step when the optimizer steps, because:
last_epochcounter tracks optimizer updates, not micro-steps. Calling it on non-update steps means the LR changes between micro-batches within the same accumulation window - the 4 micro-batches that contribute to one optimizer step see 4 different learning rates. This is semantically wrong.total_stepsis meant to count optimizer steps (as the variable name suggests), the 4x over-stepping causes the scheduler to exhaust its budget in 1/4 of the training run, after which LR stays at the terminal value.Fix
Move
lr_scheduler.step()inside theelsebranch, so it only fires whenopt.step()fires.Note: If the intent is to have the scheduler span the full micro-step count,
total_stepspassed to the scheduler should be divided bybackward_passes_per_step. But the simpler and more standard fix is to align scheduler stepping with optimizer stepping.Bug 2: Off-by-one in accumulation step detection causes first optimizer step to use only 1 micro-batch
Before (broken)
When
global_step=0(first iteration) andbackward_passes_per_step=4:0 % 4 == 0→is_accumulation_step = FalseThis means the first optimizer update uses gradients from only 1 micro-batch instead of 4. All subsequent accumulation windows are also shifted by 1.
Concrete impact with
backward_passes_per_step=4step % 4 != 0False(0%4=0)True(1%4=1)True(2%4=2)True(3%4=3)False(4%4=0)Fix
Now the optimizer steps at
global_step = 3, 7, 11, ...— i.e. after every 4 micro-batches.Changes
Risk Assessment
global_stepandlr_schedulerstate. Resuming from a checkpoint saved by the old code will shift the accumulation window by 1 step (negligible impact) and the scheduler will be correctly aligned going forward.backward_passes_per_step=1, where both bugs are no-ops (1 % 1 == 0always, and scheduler steps every iteration = every optimizer step).