docs: site skeleton & initial placeholder content #62

lbliii · 2025-11-19T16:51:04Z

This PR provides:

additional site config needed for sphinx docs
basic skeleton for starter sections (about, get started, references)
initial content with directive examples for myst list tables, dropdowns, tabsets, and admonitions.
index overview pages for each directory and subdirectory.

note: the content staged here can either be used as a starting point or just as a reference to be deleted. I tried my best to make the content realistic, but ultimately these articles need SME input/direction. Future sections are yet to be determined.

To preview docs:

cd docs
make docs-live

Signed-off-by: Lawrence Lane <llane@nvidia.com>

…language from training paradigms - Remove specific node count limits (16 nodes) lacking code evidence - Change 'Unlimited nodes' to 'Large multi-node clusters' for accuracy - Replace 'webdataset format' with 'Energon data loader' (verified in code) - Remove subjective time estimates (minutes/hours) from setup complexity - Improve precision of scalability descriptions throughout Signed-off-by: Lawrence Lane <llane@nvidia.com>

- Add complete fsdp config examples showing all 4 parallelism dimensions - Replace specific bandwidth numbers with general high-bandwidth requirement - Clarify pipeline bubble efficiency without unverified percentages - Remove unverified 2× memory claim for optimizer state sharding - Add runtime verification examples for checking parallelism config - Add note about automatic DP calculation in automodel - Improve DP calculation example with concrete numbers Signed-off-by: Lawrence Lane <llane@nvidia.com>

… structure, progressive disclosure, and clearer examples Signed-off-by: Lawrence Lane <llane@nvidia.com>

…bleshooting detail - Change content_type from tutorial to how-to (correct classification) - Improve progressive disclosure with clearer step labels - Add verified configuration parameters from source code - Enhance troubleshooting with specific symptoms and actionable solutions - Add checkpoint structure details and contents - Improve configuration override explanation with three-layer precedence - Add missing checkpoint configuration options - Fix list spacing for markdown lint compliance Signed-off-by: Lawrence Lane <llane@nvidia.com>

…l vs megatron); add automodel track (training + inference); add megatron track (data prep + training + inference); update index to route users by use case Signed-off-by: Lawrence Lane <llane@nvidia.com>

Signed-off-by: Lawrence Lane <llane@nvidia.com>

* init Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * add sigma_min/amx Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * add sigma_min/max Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * rename fientune.py to train.py Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * add from_config Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * pass scheduler and model Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update param Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * introduce NeMoWanPipeline Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * add mode Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update build_model_and_optimizer Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update NeMoWanPipeline Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * rename Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * move examples Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * move Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix imports Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * lint Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * more lint Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix import Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix 3rdparty & pyproject Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * add torch Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update uv.lock Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * revert 3rdparty Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update uv.lock Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * update uv.lock Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* add tests Signed-off-by: linnan wang <wangnan318@gmail.com> * update test Signed-off-by: linnan wang <wangnan318@gmail.com> * update Signed-off-by: linnan wang <wangnan318@gmail.com> * update Signed-off-by: linnan wang <wangnan318@gmail.com> --------- Signed-off-by: linnan wang <wangnan318@gmail.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* adding tests * ruff lint * ruff lint * ruff lint * Explicit mcore path override to use Megatron-Bridge's pinned submodule commit Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update Megatron-Bridge submodule to latest main with correct Megatron-LM commit (3cbe5c68) Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Add Mcore WAN pretrain mock test to CI/CD Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Fix slow Docker build from Megatron-LM source Signed-off-by: Pablo Garay <pagaray@nvidia.com> * ci: Update gpu runners to use self-hosted-nemo (#48) * ci: Update gpu runners to use self-hosted-nemo Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use uv run in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure uv group megatron-bridge is used for test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain * Update TRANSFORMERS_OFFLINE environment variable to 0 and increase timeout in test_mcore_wan_pretrain Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert GHA changes Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Move uv run group call to L2_Mcore_Mock_Tests_GPU Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set test back to 5 minute timeout Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Megatron fixes (#49) * Enhance DiT and Wan layer specifications - Updated `get_query_key_value_tensors` method in `dit_attention.py` to include an `output_gate` parameter and set `split_qkv` to default to `True`. - Modified `WanLayerWithAdaLN` class in `wan_layer_spec.py` to add `rotary_pos_cos_sin` parameter for improved positional encoding handling. * Implement ProcessGroupCollection initialization in DiT and Wan models - Added initialization of `pg_collection` in both `DiTCrossAttentionModel` and `WanModel` to ensure proper handling of process groups. - This change checks if `pg_collection` exists and is not None before assigning it, enhancing the robustness of the models. * Update CONTRIBUTING.md to include detailed setup instructions for development environment and Docker container usage. Added sections for building and running the container, as well as setting the PYTHONPATH for DFM. * Refactor import statements in dit_model.py to streamline dependencies. Removed redundant import of ProcessGroupCollection, enhancing code clarity and maintainability. * Refactor code style in DiT and Wan models - Updated string quotes in `dit_model.py` and `wan_model.py` for consistency, changing from single to double quotes. - Reformatted the `get_query_key_value_tensors` method call in `dit_attention.py` for improved readability by breaking it into multiple lines. * Revert M4 changes * Ruff * Ruff * Lint --------- Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> * Revert "Revert GHA changes" This reverts commit d7ad1ab. * tempfortest: timeout setting Signed-off-by: Pablo Garay <pagaray@nvidia.com> * workflow dispatch Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update Signed-off-by: Pablo Garay <pagaray@nvidia.com> * add logging Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Update test configuration for Mcore WAN pretraining - Increased the number of processes per node from 1 to 2 for distributed training. - Set the number of training iterations to 10 to enhance the training process. * More changes * Lint --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Pablo Garay <pagaray@nvidia.com> * Reapply "Revert GHA changes" This reverts commit fdb911f. Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update path per request Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * update CONTRIBUTING.md Signed-off-by: Pablo Garay <pagaray@nvidia.com> * lintfix Signed-off-by: Pablo Garay <pagaray@nvidia.com> * adding v run --group megatron-bridge * update test * ruff lint * restore Dockerfile.ci * update .github/workflows/cicd-main.yml --------- Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Huy Vu2 <huvu@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Pablo Garay <pagaray@nvidia.com> Co-authored-by: Charlie Truong <chtruong@nvidia.com> Co-authored-by: Abhinav Garg <abhinavg@stanford.edu> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* introduce step_scheduler section Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * add step_scheduler section Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * lint Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * rm dead code Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

* replace torch.stack with torch.cat Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> * fix Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> --------- Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii self-assigned this Nov 19, 2025

lbliii requested a review from a team as a code owner November 19, 2025 16:51

copy-pr-bot bot temporarily deployed to test November 19, 2025 16:51 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 19, 2025 16:51 Error

lbliii requested a review from abhinavg4 November 19, 2025 16:52

copy-pr-bot bot temporarily deployed to test November 19, 2025 16:57 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 19, 2025 16:57 Error

copy-pr-bot bot temporarily deployed to test November 19, 2025 16:58 Inactive

copy-pr-bot bot temporarily deployed to test November 19, 2025 16:59 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 19, 2025 16:59 Error

lbliii and others added 15 commits November 19, 2025 12:00

docs: site config

1122c19

Signed-off-by: Lawrence Lane <llane@nvidia.com>

docs(get-started): improve inference quickstart with tutorial-focused…

45443d8

… structure, progressive disclosure, and clearer examples Signed-off-by: Lawrence Lane <llane@nvidia.com>

docs(get-started): restructure into track-based quickstarts (automode…

08f957c

…l vs megatron); add automodel track (training + inference); add megatron track (data prep + training + inference); update index to route users by use case Signed-off-by: Lawrence Lane <llane@nvidia.com>

docs: first pass at initial about and get started content

a33bfbd

Signed-off-by: Lawrence Lane <llane@nvidia.com>

remove extras

e1fdaf0

Signed-off-by: Lawrence Lane <llane@nvidia.com>

for oss scan (#57)

ec91a93

Signed-off-by: Pablo Garay <pagaray@nvidia.com> Signed-off-by: Lawrence Lane <llane@nvidia.com>

fixes

f2c7c94

Signed-off-by: Lawrence Lane <llane@nvidia.com>

lbliii force-pushed the llane/site-config-and-skeleton branch from 0ffe23d to f2c7c94 Compare November 19, 2025 17:00

Merge branch 'main' into llane/site-config-and-skeleton

5eb059e

copy-pr-bot bot temporarily deployed to test November 19, 2025 17:01 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 19, 2025 17:01 Error

cleanup

40dd868

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 19, 2025 17:10 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 19, 2025 17:10 Error

whitespace fix

5c72339

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 19, 2025 17:18 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 19, 2025 17:18 Failure

landing pg change

8337e88

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 19, 2025 19:22 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 19, 2025 19:22 Failure

Merge branch 'main' into llane/site-config-and-skeleton

f8540e8

copy-pr-bot bot temporarily deployed to test November 24, 2025 21:08 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 24, 2025 21:08 Failure

Merge branch 'main' into llane/site-config-and-skeleton

2193496

copy-pr-bot bot temporarily deployed to test December 1, 2025 21:47 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 1, 2025 21:47 Failure

updates

45803bd

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot bot temporarily deployed to test December 3, 2025 15:49 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 3, 2025 15:49 Failure

updates

7072c2a

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot bot temporarily deployed to test December 3, 2025 15:59 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 3, 2025 16:00 Error

Merge branch 'main' into llane/site-config-and-skeleton

80f41ed

copy-pr-bot bot temporarily deployed to test December 3, 2025 16:13 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 3, 2025 16:13 Failure

updates

a16f130

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot bot temporarily deployed to test December 3, 2025 18:27 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 3, 2025 18:27 Error

updates

78456b7

Signed-off-by: Lawrence Lane <llane@nvidia.com>

copy-pr-bot bot temporarily deployed to test December 3, 2025 18:32 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci December 3, 2025 18:32 Failure

lbliii requested a review from jgerh December 9, 2025 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: site skeleton & initial placeholder content #62

docs: site skeleton & initial placeholder content #62

Uh oh!

lbliii commented Nov 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

docs: site skeleton & initial placeholder content #62

Are you sure you want to change the base?

docs: site skeleton & initial placeholder content #62

Uh oh!

Conversation

lbliii commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

lbliii commented Nov 19, 2025 •

edited

Loading