Skip to content

Conversation

@pablo-garay
Copy link
Collaborator

@pablo-garay pablo-garay commented Nov 7, 2025

feat: add megatron-bridge (as dependency)

Passing tests: https://github.com/NVIDIA-NeMo/DFM/actions/runs/19188264333/job/54859363462?pr=32

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
@pablo-garay pablo-garay requested a review from a team as a code owner November 7, 2025 08:35
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 7, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pablo-garay
Copy link
Collaborator Author

/ok to test 0547378

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
… dependency group)

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
@pablo-garay
Copy link
Collaborator Author

/ok to test de72a5b

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
@pablo-garay
Copy link
Collaborator Author

/ok to test b75984f

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 7, 2025

/ok to test b75984f

@pablo-garay, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@pablo-garay
Copy link
Collaborator Author

/ok to test b75984f

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 7, 2025

/ok to test b75984f

@pablo-garay, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@pablo-garay
Copy link
Collaborator Author

/ok to test 069c536

@pablo-garay
Copy link
Collaborator Author

/ok to test 6c08f5e

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
@pablo-garay
Copy link
Collaborator Author

/ok to test 90f32c8

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
@pablo-garay
Copy link
Collaborator Author

/ok to test 8e0c998

@github-actions
Copy link
Contributor

github-actions bot commented Nov 8, 2025

uv.lock is up to date

The lockfile is in sync with pyproject.toml.

Signed-off-by: Pablo Garay <pagaray@nvidia.com>
Signed-off-by: Pablo Garay <pagaray@nvidia.com>
@pablo-garay
Copy link
Collaborator Author

/ok to test 8b8adff

@github-actions
Copy link
Contributor

github-actions bot commented Nov 8, 2025

uv.lock is up to date

The lockfile is in sync with pyproject.toml.

@pablo-garay pablo-garay changed the title feat: add megatron-bridge feat: add megatron-bridge as dependency Nov 8, 2025
@abhinavg4
Copy link
Contributor

abhinavg4 commented Nov 8, 2025

  1. (Done) Can we change nemo-vfm to nemo-dfm ?

  2. (Abhinav) Make keywords to something better (I can do this later)

  3. (Done) Core dependencies should not have Automodel or anything megatron related. Let's add megatron-energon there though

  4. (Done) We need to have 2 deps groups: automodel and mcore.

    1. Only use 3rd party git submodule and no pypi
  5. For CI testing we should have 3 dockers:

    1. (p0) Dockerfile.all.ci : Our current CI file
    2. (p1) Dockerfile.mcore.ci: Install Only Mcore deps
    3. (p1) Dockerfile.automodel.ci: Install Only Automodel deps
  6. (Done) Let use 3rd party and have these 3 things in 3rd party:

    1. Automodel
    2. Megatron Bridge
    3. These can point to TOT and we use them to build docker CI file
  7. Can you add a test for:

    1. Megatron Bridge: Use their README simplest example, maybe this:
from megatron.bridge.training.gpt_step import forward_step
from megatron.bridge.training.pretrain import pretrain

if __name__ == "__main__":
    # The recipe uses the Llama 3.2 1B model configuration from HuggingFace
    cfg = llama32_1b_pretrain_config(seq_length=1024)

    # Override training parameters
    cfg.train.train_iters = 10
    cfg.scheduler.lr_decay_iters = 10000
    cfg.model.vocab_size = 8192
    cfg.tokenizer.vocab_size = cfg.model.vocab_size

    pretrain(cfg, forward_step)
2. Automodel: Use their REAME simplest example: Maybe
torchrun --nproc-per-node=2 examples/llm_finetune/finetune.py --config examples/llm_finetune/llama3_2/llama3_2_1b_hellaswag.yaml

OR

python -c "import nemo_automodel; print('AutoModel ready')"
  1. Tests should have 3 folders: common, mcore, automodel
  2. Dockerfile.all.ci should be tested with all 3 folder above. Dockerfile.automodel.ci should test with only common and automodel and Dockerfile.mcore.ci should test with common and Mcore

Copy link
Contributor

@abhinavg4 abhinavg4 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see comments

RUN uv venv ${UV_PROJECT_ENVIRONMENT} --system-site-packages

# Copy dependency files and source code (needed for dynamic version resolution)
COPY pyproject.toml uv.lock ./
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should do this. Why are we doing this? We should copy these to /opt/DFM or something.

Similar to https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docker/Dockerfile.ci

@pablo-garay
Copy link
Collaborator Author

pablo-garay commented Nov 8, 2025

  1. Can we change nemo-vfm to nemo-dfm ?

  2. Make keywords to something better (I can do this later)

  3. Core dependencies should not have Automodel or anything megatron related. Let's add megatron-energon there though

  4. We need to have 2 deps groups: automodel and mcore.

    1. Automodel: Has pypi automodel as deps not the TOT (https://pypi.org/project/nemo-automodel)
    2. Megatron bridge should have pypi megatron-bridge
  5. For CI testing we should have 3 dockers:

    1. Dockerfile.all.ci : Our current CI file
    2. Dockerfile.mcore.ci: Install Only Mcore deps
    3. Dockerfile.automodel.ci: Install Only Automodel deps
  6. Let use 3rd party and have these 3 things in 3rd party:

    1. Automodel
    2. Megatron Bridge
    3. Megatron LM
    4. These can point to TOT and we use them to build docker CI file
  7. Can you add a test for:

    1. Megatron Bridge: Use their README simplest example, maybe this:
from megatron.bridge.training.gpt_step import forward_step
from megatron.bridge.training.pretrain import pretrain

if __name__ == "__main__":
    # The recipe uses the Llama 3.2 1B model configuration from HuggingFace
    cfg = llama32_1b_pretrain_config(seq_length=1024)

    # Override training parameters
    cfg.train.train_iters = 10
    cfg.scheduler.lr_decay_iters = 10000
    cfg.model.vocab_size = 8192
    cfg.tokenizer.vocab_size = cfg.model.vocab_size

    pretrain(cfg, forward_step)
2. Automodel: Use their REAME simplest example: Maybe
torchrun --nproc-per-node=2 examples/llm_finetune/finetune.py --config examples/llm_finetune/llama3_2/llama3_2_1b_hellaswag.yaml

OR

python -c "import nemo_automodel; print('AutoModel ready')"
  1. Tests should have 3 folders: common, mcore, automodel
  2. Dockerfile.all.ci should be tested with all 3 folder above. Dockerfile.automodel.ci should test with only common and automodel and Dockerfile.mcore.ci should test with common and Mcore

@abhinavg4 i agree with the proposed above. This PR while seeming small took already 30 commits though. It really was an effort of putting different pieces to be able together to make it work (even if changes in the diff look simple). I recommend we take an incremental/iterative approach & we merge this & i can create follow up PR's to address the above proposal.

"ftfy",
"imageio-ffmpeg",
"opencv-python-headless==4.10.0.84",
"megatron-bridge",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MBridge has a tricky problem I think because it's installing MCore from a submodule rather than MCore's pypi package. It's unfortunately an issue with how we decided to use the MCore dev branch and there is no dev branch pypi package. So if we just list MBridge here, it will not install the expected MCore dependency. Please work with @ko3n1g for a suggested path forward.

It may be cleaner to handle both Automodel and MBridge in a similar fashion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @chtruong814 I think Mbridge does install Mcore here?

https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/pyproject.toml#L81

Nonetheless, we can use Megatron and Mbridge as a 3rd party for our dev work and containers, (See point 6 here but to release on pypi, we can only use packages from pypi?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that MBridge uses a commit of MCore off of the MCore dev branch, which is referenced as a 3rd party submodule. That's only installed correctly using uv with this:

https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/pyproject.toml#L117

Otherwise, uv would by default install whatever available pypi packages exists, which is not the same as the referenced commit.

git push origin HEAD:${{ github.ref_name }}
- name: Comment on PR with lockfile status
if: github.event_name == 'pull_request'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new workflow isn't running on PR events.

Copy link
Collaborator Author

@pablo-garay pablo-garay Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. This was intended. Only run on manual trigger. This workflow is super helpful but optional for when people want it only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants