feat: add megatron-bridge as dependency #32

pablo-garay · 2025-11-07T08:35:23Z

feat: add megatron-bridge (as dependency)

Passing tests: https://github.com/NVIDIA-NeMo/DFM/actions/runs/19188264333/job/54859363462?pr=32

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

copy-pr-bot · 2025-11-07T08:35:26Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

pablo-garay · 2025-11-07T08:36:30Z

/ok to test 0547378

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

… dependency group) Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

pablo-garay · 2025-11-07T19:54:16Z

/ok to test de72a5b

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

pablo-garay · 2025-11-07T20:06:07Z

/ok to test b75984f

copy-pr-bot · 2025-11-07T20:06:10Z

/ok to test b75984f

@pablo-garay, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

pablo-garay · 2025-11-07T20:06:25Z

/ok to test b75984f

copy-pr-bot · 2025-11-07T20:06:29Z

/ok to test b75984f

@pablo-garay, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

pablo-garay · 2025-11-07T20:06:53Z

/ok to test 069c536

pablo-garay · 2025-11-08T00:10:28Z

/ok to test 6c08f5e

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

pablo-garay · 2025-11-08T00:15:10Z

/ok to test 90f32c8

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

pablo-garay · 2025-11-08T00:29:33Z

/ok to test 8e0c998

github-actions · 2025-11-08T04:41:40Z

✅ uv.lock is up to date

The lockfile is in sync with pyproject.toml.

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

pablo-garay · 2025-11-08T04:55:26Z

/ok to test 8b8adff

github-actions · 2025-11-08T04:56:25Z

✅ uv.lock is up to date

The lockfile is in sync with pyproject.toml.

abhinavg4 · 2025-11-08T07:22:52Z

(Done) Can we change nemo-vfm to nemo-dfm ?
(Abhinav) Make keywords to something better (I can do this later)
(Done) Core dependencies should not have Automodel or anything megatron related. Let's add megatron-energon there though
(Done) We need to have 2 deps groups: automodel and mcore.
1. Only use 3rd party git submodule and no pypi
For CI testing we should have 3 dockers:
1. (p0) Dockerfile.all.ci : Our current CI file
2. (p1) Dockerfile.mcore.ci: Install Only Mcore deps
3. (p1) Dockerfile.automodel.ci: Install Only Automodel deps
(Done) Let use 3rd party and have these 3 things in 3rd party:
1. Automodel
2. Megatron Bridge
3. These can point to TOT and we use them to build docker CI file
Can you add a test for:
1. Megatron Bridge: Use their README simplest example, maybe this:

from megatron.bridge.training.gpt_step import forward_step
from megatron.bridge.training.pretrain import pretrain

if __name__ == "__main__":
    # The recipe uses the Llama 3.2 1B model configuration from HuggingFace
    cfg = llama32_1b_pretrain_config(seq_length=1024)

    # Override training parameters
    cfg.train.train_iters = 10
    cfg.scheduler.lr_decay_iters = 10000
    cfg.model.vocab_size = 8192
    cfg.tokenizer.vocab_size = cfg.model.vocab_size

    pretrain(cfg, forward_step)

2. Automodel: Use their REAME simplest example: Maybe

torchrun --nproc-per-node=2 examples/llm_finetune/finetune.py --config examples/llm_finetune/llama3_2/llama3_2_1b_hellaswag.yaml

OR

python -c "import nemo_automodel; print('AutoModel ready')"

Tests should have 3 folders: common, mcore, automodel
Dockerfile.all.ci should be tested with all 3 folder above. Dockerfile.automodel.ci should test with only common and automodel and Dockerfile.mcore.ci should test with common and Mcore

abhinavg4

Please see comments

abhinavg4 · 2025-11-08T07:57:18Z

docker/Dockerfile.ci

 RUN uv venv ${UV_PROJECT_ENVIRONMENT} --system-site-packages

 # Copy dependency files and source code (needed for dynamic version resolution)
 COPY pyproject.toml uv.lock ./


I don't think we should do this. Why are we doing this? We should copy these to /opt/DFM or something.

Similar to https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/docker/Dockerfile.ci

pablo-garay · 2025-11-08T16:17:00Z

Can we change nemo-vfm to nemo-dfm ?

Make keywords to something better (I can do this later)

Core dependencies should not have Automodel or anything megatron related. Let's add megatron-energon there though

We need to have 2 deps groups: automodel and mcore.

Automodel: Has pypi automodel as deps not the TOT (https://pypi.org/project/nemo-automodel)

Megatron bridge should have pypi megatron-bridge

For CI testing we should have 3 dockers:

Dockerfile.all.ci : Our current CI file

Dockerfile.mcore.ci: Install Only Mcore deps

Dockerfile.automodel.ci: Install Only Automodel deps

Let use 3rd party and have these 3 things in 3rd party:

Automodel

Megatron Bridge

Megatron LM

These can point to TOT and we use them to build docker CI file

Can you add a test for:

Megatron Bridge: Use their README simplest example, maybe this:
from megatron.bridge.training.gpt_step import forward_step
from megatron.bridge.training.pretrain import pretrain

if __name__ == "__main__":
    # The recipe uses the Llama 3.2 1B model configuration from HuggingFace
    cfg = llama32_1b_pretrain_config(seq_length=1024)

    # Override training parameters
    cfg.train.train_iters = 10
    cfg.scheduler.lr_decay_iters = 10000
    cfg.model.vocab_size = 8192
    cfg.tokenizer.vocab_size = cfg.model.vocab_size

    pretrain(cfg, forward_step)
2. Automodel: Use their REAME simplest example: Maybe
torchrun --nproc-per-node=2 examples/llm_finetune/finetune.py --config examples/llm_finetune/llama3_2/llama3_2_1b_hellaswag.yaml

OR

python -c "import nemo_automodel; print('AutoModel ready')"
Tests should have 3 folders: common, mcore, automodel

Dockerfile.all.ci should be tested with all 3 folder above. Dockerfile.automodel.ci should test with only common and automodel and Dockerfile.mcore.ci should test with common and Mcore

@abhinavg4 i agree with the proposed above. This PR while seeming small took already 30 commits though. It really was an effort of putting different pieces to be able together to make it work (even if changes in the diff look simple). I recommend we take an incremental/iterative approach & we merge this & i can create follow up PR's to address the above proposal.

chtruong814 · 2025-11-10T14:09:02Z

pyproject.toml

    "ftfy",
    "imageio-ffmpeg",
    "opencv-python-headless==4.10.0.84",
+    "megatron-bridge",


MBridge has a tricky problem I think because it's installing MCore from a submodule rather than MCore's pypi package. It's unfortunately an issue with how we decided to use the MCore dev branch and there is no dev branch pypi package. So if we just list MBridge here, it will not install the expected MCore dependency. Please work with @ko3n1g for a suggested path forward.

It may be cleaner to handle both Automodel and MBridge in a similar fashion.

Hi @chtruong814 I think Mbridge does install Mcore here?

https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/pyproject.toml#L81

Nonetheless, we can use Megatron and Mbridge as a 3rd party for our dev work and containers, (See point 6 here but to release on pypi, we can only use packages from pypi?

The problem is that MBridge uses a commit of MCore off of the MCore dev branch, which is referenced as a 3rd party submodule. That's only installed correctly using uv with this:

https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/main/pyproject.toml#L117

Otherwise, uv would by default install whatever available pypi packages exists, which is not the same as the referenced commit.

chtruong814 · 2025-11-10T14:11:16Z

.github/workflows/generate-lockfile.yml

+          git push origin HEAD:${{ github.ref_name }}
+
+      - name: Comment on PR with lockfile status
+        if: github.event_name == 'pull_request'


This new workflow isn't running on PR events.

Right. This was intended. Only run on manual trigger. This workflow is super helpful but optional for when people want it only.

pablo-garay added 5 commits November 6, 2025 00:13

dockerfile install dependencies

537b82e

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

update uv lock

1577f2c

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

docker adjustment for current dfm status

19197ef

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

version being loaded dynamically

6a6c427

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

Add megatron-bridge dependency (megatron-core as transitive)

0547378

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

pablo-garay requested a review from a team as a code owner November 7, 2025 08:35

copy-pr-bot bot temporarily deployed to test November 7, 2025 08:36 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 7, 2025 08:36 Failure

pablo-garay added 8 commits November 7, 2025 00:44

allow certain packages access existing env deps at build time

29f204f

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

necessary build deps for pkgs when build isolation disabled (as build…

f8e3440

… dependency group) Signed-off-by: Pablo Garay <pagaray@nvidia.com>

use cuda version installed in pytorch container

85babff

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

build dep group first, then all other deps

b6c7a9e

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

generate lockfile in CI

7edaaab

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

lintfix

b3d2fb4

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

lintfix

5bf51b3

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

generate lockfile update

de72a5b

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 7, 2025 19:54 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 7, 2025 19:54 Error

pablo-garay added 2 commits November 7, 2025 12:02

generate lockfile update

b75984f

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

generate lockfile update

069c536

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 7, 2025 20:07 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 7, 2025 20:07 Failure

copy-pr-bot bot had a problem deploying to nemo-ci November 7, 2025 22:59 Failure

Fix lockfile generation to match runtime configuration

90f32c8

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 8, 2025 00:15 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 8, 2025 00:15 Failure

uv.lock update

8e0c998

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 8, 2025 00:29 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 8, 2025 00:29 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 8, 2025 00:43 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 8, 2025 00:53 Inactive

Merge branch 'main' into pablo-garay/dockerfile_install_dependencies2

e8c7efb

pablo-garay added 2 commits November 7, 2025 20:50

lintfix

793825a

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

run generate uv.lock workflow on manual trigger only

8b8adff

Signed-off-by: Pablo Garay <pagaray@nvidia.com>

copy-pr-bot bot temporarily deployed to test November 8, 2025 04:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 8, 2025 04:55 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 8, 2025 05:14 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 8, 2025 05:53 Inactive

pablo-garay changed the title ~~feat: add megatron-bridge~~ feat: add megatron-bridge as dependency Nov 8, 2025

abhinavg4 requested changes Nov 8, 2025

View reviewed changes

chtruong814 reviewed Nov 10, 2025

View reviewed changes

feat: add megatron-bridge as dependency #32

Are you sure you want to change the base?

feat: add megatron-bridge as dependency #32

Uh oh!

Conversation

pablo-garay commented Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Nov 7, 2025

Uh oh!

pablo-garay commented Nov 7, 2025

Uh oh!

pablo-garay commented Nov 7, 2025

Uh oh!

pablo-garay commented Nov 7, 2025

Uh oh!

copy-pr-bot bot commented Nov 7, 2025

Uh oh!

pablo-garay commented Nov 7, 2025

Uh oh!

copy-pr-bot bot commented Nov 7, 2025

Uh oh!

pablo-garay commented Nov 7, 2025

Uh oh!

pablo-garay commented Nov 8, 2025

Uh oh!

pablo-garay commented Nov 8, 2025

Uh oh!

pablo-garay commented Nov 8, 2025

Uh oh!

github-actions bot commented Nov 8, 2025

Uh oh!

pablo-garay commented Nov 8, 2025

Uh oh!

github-actions bot commented Nov 8, 2025

Uh oh!

abhinavg4 commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abhinavg4 left a comment

Choose a reason for hiding this comment

Uh oh!

abhinavg4 Nov 8, 2025

Choose a reason for hiding this comment

Uh oh!

pablo-garay commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chtruong814 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

abhinavg4 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

chtruong814 Nov 11, 2025

Choose a reason for hiding this comment

Uh oh!

chtruong814 Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

pablo-garay Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pablo-garay commented Nov 7, 2025 •

edited

Loading

abhinavg4 commented Nov 8, 2025 •

edited

Loading

pablo-garay commented Nov 8, 2025 •

edited

Loading

pablo-garay Nov 10, 2025 •

edited

Loading