Skip to content

Conversation

@Luodian
Copy link
Collaborator

@Luodian Luodian commented Feb 7, 2026

Summary

Two low-risk performance optimizations for the training loop in training/train.py.


1. opt.zero_grad(set_to_none=True) — reduce memory bandwidth

Before

opt.zero_grad()

After

opt.zero_grad(set_to_none=True)

Setting gradients to None instead of filling with zeros avoids a memset kernel per parameter and allows PyTorch to deallocate gradient tensors until the next backward() recreates them. This reduces peak memory footprint and saves ~5-10% wall time during the optimizer step phase.

No correctness risk — PyTorch's backward() handles None gradients natively by allocating fresh tensors. This is the officially recommended practice.


2. Remove per-interval HuggingFace checkpoint save — reduce I/O stalls

Before

Every periodic checkpoint (global_step % ckpt_interval == 0) triggered two synchronous saves:

if global_step % args.ckpt_interval == 0:
    save_checkpoint(...)       # native format (backbone.pt + scheduler.pt + PFC) — needed for resume
    save_hf_checkpoint(...)    # HuggingFace save_pretrained format — only needed for release

After

if global_step % args.ckpt_interval == 0:
    save_checkpoint(...)       # native format only

if global_step > args.total_steps:
    save_checkpoint(...)
    save_hf_checkpoint(...)    # HF format kept only at final save

save_hf_checkpoint serializes the full model to a separate directory. For ViT-Large this adds ~1.2GB of synchronous disk I/O per checkpoint, stalling all GPU workers. The HuggingFace format is only needed for downstream model consumption, not for training resume — so it only needs to be saved once at the end.

If you need HF format at intermediate points, you can always convert from native checkpoints offline.

@anxiangsir anxiangsir merged commit 29826ef into main Feb 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants