perf: zero_grad(set_to_none=True) and reduce checkpoint I/O #83

Luodian · 2026-02-07T11:18:36Z

Summary

Two low-risk performance optimizations for the training loop in training/train.py.

1. `opt.zero_grad(set_to_none=True)` — reduce memory bandwidth

Before

opt.zero_grad()

After

opt.zero_grad(set_to_none=True)

Setting gradients to None instead of filling with zeros avoids a memset kernel per parameter and allows PyTorch to deallocate gradient tensors until the next backward() recreates them. This reduces peak memory footprint and saves ~5-10% wall time during the optimizer step phase.

No correctness risk — PyTorch's backward() handles None gradients natively by allocating fresh tensors. This is the officially recommended practice.

2. Remove per-interval HuggingFace checkpoint save — reduce I/O stalls

Before

Every periodic checkpoint (global_step % ckpt_interval == 0) triggered two synchronous saves:

if global_step % args.ckpt_interval == 0:
    save_checkpoint(...)       # native format (backbone.pt + scheduler.pt + PFC) — needed for resume
    save_hf_checkpoint(...)    # HuggingFace save_pretrained format — only needed for release

After

if global_step % args.ckpt_interval == 0:
    save_checkpoint(...)       # native format only

if global_step > args.total_steps:
    save_checkpoint(...)
    save_hf_checkpoint(...)    # HF format kept only at final save

save_hf_checkpoint serializes the full model to a separate directory. For ViT-Large this adds ~1.2GB of synchronous disk I/O per checkpoint, stalling all GPU workers. The HuggingFace format is only needed for downstream model consumption, not for training resume — so it only needs to be saved once at the end.

If you need HF format at intermediate points, you can always convert from native checkpoints offline.

…alls

Luodian added 2 commits February 7, 2026 19:17

perf: use zero_grad(set_to_none=True) to reduce memory bandwidth

41b64cd

perf: only save HuggingFace checkpoint at final step to reduce I/O st…

e2be8ff

…alls

anxiangsir merged commit 29826ef into main Feb 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: zero_grad(set_to_none=True) and reduce checkpoint I/O #83

perf: zero_grad(set_to_none=True) and reduce checkpoint I/O #83

Uh oh!

Luodian commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: zero_grad(set_to_none=True) and reduce checkpoint I/O #83

perf: zero_grad(set_to_none=True) and reduce checkpoint I/O #83

Uh oh!

Conversation

Luodian commented Feb 7, 2026

Summary

1. opt.zero_grad(set_to_none=True) — reduce memory bandwidth

Before

After

2. Remove per-interval HuggingFace checkpoint save — reduce I/O stalls

Before

After

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `opt.zero_grad(set_to_none=True)` — reduce memory bandwidth