fix: Enabled GPU based tests #649

romitjain · 2025-12-08T14:00:59Z

Updated tox.ini and pyproject.toml to support GPU-based tests. Currently

tox -e accel

will run tests all the unit tests.

Apart from this, I have made changes to monkeypatch global vars instead of passing them directly to the trainer and monkeypatched environment variables.

Signed-off-by: romitjain <romit@ibm.com>

github-actions · 2025-12-08T14:01:07Z

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

dushyantbehl · 2025-12-08T15:21:06Z

tox.ini

+    pytest>=7
+    .[aim,mlflow,clearml,scanner-dev,fms-accel-all]
+setenv =
+    CUDA_VISIBLE_DEVICES=0


Do we need to assume only one GPU is available?

None of these tests requires a multi-GPU setup, so I set it as a single GPU.
Without this, an internal huggingface code was giving error (for computing loss)

got it lets keep it like this ...
is it possible to complain if we detect the tests are running on a multi gpu setup and essentially wasting gpu hours so complain about it?

I don't see a way via tox to do that. I think we should handle that at a layer above, where we add some validation to disable running multi-GPU runs for the template.

dushyantbehl

Requesting clarification

dushyantbehl · 2025-12-08T15:22:38Z

tox.ini

+setenv =
+    CUDA_VISIBLE_DEVICES=0
+commands =
+    pytest {posargs:tests/test_sft_trainer.py}


Shall we not run all the tests from inside fms-hf-tuning?
we are currently limiting this to only top level tuning.sft_trainer unit tests.

We can do that, but there were no tests skipped due to lack of GPUs in other tests. They will just take more time.
I am okay either way.

Another approach is that we mark specific tests and only run them. That will be the most focused and time-saving approach.
I am not in favour of that because we should run all tuning-based tests on GPUs.

The testing time is around <10mins right? for all the tests without GPUs?

just another thought if we make this a single place of truth for all the unit tests its better...so users need not worry about executing the unit tests on their end but rather run them all together if successful share screenshot along with the PR.

On the workflow we can keep running non GPU based.

@dushyantbehl Yes, for all the tests without GPU the testing time is ~10 mins, but running with GPUs easily goes north of 30 mins.

Let me enable all the unit tests under this.

Signed-off-by: romitjain <romit@ibm.com>

kmehant · 2025-12-10T13:17:18Z

tests/test_sft_trainer.py

-            checkpoint_path=os.path.join(
-                _get_checkpoint_path(tempdir), "hf_converted_checkpoint"
-            )
+            checkpoint_path=_get_hf_converted_path(tempdir)


@romitjain should not it point to safetensors-{step number} folder?

@kmehant yes, I have added a new function _get_hf_converted_path that helps in retreiving the path

dushyantbehl · 2025-12-10T13:32:53Z

@romitjain the lint seems to be failing can you please check.

Signed-off-by: romit <romit@ibm.com>

romitjain · 2025-12-11T11:21:19Z

tests/test_sft_trainer.py

    return os.path.join(dir_path, checkpoint_dirs[-1])


+def _get_hf_converted_path(dir_path):


new function to get the hf converted path directly

Signed-off-by: romit <romit@ibm.com>

romitjain · 2025-12-11T12:04:12Z

tests/acceleration/test_acceleration_framework.py

        train_args.output_dir = tempdir
        train_args.save_strategy = "no"
-        train_args.fp16 = True
+        train_args.bf16 = True


fp16 upcasting is not allowed

romitjain · 2025-12-11T12:04:27Z

tests/acceleration/test_acceleration_framework.py

        train_args.output_dir = tempdir
        train_args.save_strategy = "no"
-        train_args.fp16 = True
+        train_args.bf16 = True


same reason as above

romitjain · 2025-12-15T05:23:45Z

@dushyantbehl @kmehant Please have a look

tox -e accel is passing, but tox -e gpu is failing due to some state leakage.

dushyantbehl · 2025-12-15T08:26:31Z

#649 (comment)

@romitjain we can wait for fixing this before merging the PR

romitjain · 2025-12-15T08:28:34Z

@dushyantbehl
Tests are passing when running individually, which means both the test suite and the code it is testing is fine. Just running under a single process is promlemetic. Why do we need to wait for the state leakage fix?

dushyantbehl · 2025-12-15T08:37:10Z

@dushyantbehl Tests are passing when running individually, which means both the test suite and the code it is testing is fine. Just running under a single process is promlemetic. Why do we need to wait for the state leakage fix?

What would be the mode for running them individually on the pod?

romitjain · 2025-12-15T08:39:22Z

What would be the mode for running them individually on the pod?

tox -e accel

which will run the tests for each folder individually.

IMO, we can figure out the state leakage independently

dushyantbehl · 2025-12-15T08:50:18Z

What would be the mode for running them individually on the pod?
tox -e accel
which will run the tests for each folder individually.

IMO, we can figure out the state leakage independently

thanks...sure I will review the PR later then

romitjain · 2025-12-15T09:05:31Z

@dushyantbehl Actually, I was suggesting that we can review/merge this PR. Since the scope of this PR is already expanded, we can fix the state leakage issue in another PR. Let me know what you think.

dushyantbehl

LGTM

Updated tox and pyproject

2c23201

Signed-off-by: romitjain <romit@ibm.com>

romitjain requested review from aluu317, anhuong, dushyantbehl, fabianlim and kmehant as code owners December 8, 2025 14:01

github-actions bot added the fix label Dec 8, 2025

ashokponkumar previously approved these changes Dec 8, 2025

View reviewed changes

dushyantbehl reviewed Dec 8, 2025

View reviewed changes

dushyantbehl requested changes Dec 8, 2025

View reviewed changes

romitjain requested a review from dushyantbehl December 9, 2025 04:54

Updates

850430f

Signed-off-by: romitjain <romit@ibm.com>

romitjain dismissed ashokponkumar’s stale review via 850430f December 10, 2025 10:22

kmehant reviewed Dec 10, 2025

View reviewed changes

romitjain added 4 commits December 11, 2025 05:16

Updated ci/cd

a86c868

Signed-off-by: romit <romit@ibm.com>

Added monkey patching

ec5c79c

Signed-off-by: romit <romit@ibm.com>

Added copy for global args

a90a3b4

Signed-off-by: romit <romit@ibm.com>

Linting fixes

8abb968

Signed-off-by: romit <romit@ibm.com>

romitjain commented Dec 11, 2025

View reviewed changes

Fixed errors

2d6ff00

Signed-off-by: romit <romit@ibm.com>

romitjain commented Dec 11, 2025

View reviewed changes

ashokponkumar approved these changes Dec 12, 2025

View reviewed changes

romitjain requested a review from kmehant December 15, 2025 05:22

dushyantbehl approved these changes Dec 16, 2025

View reviewed changes

dushyantbehl merged commit 13b05a9 into foundation-model-stack:main Dec 16, 2025
9 checks passed

		return os.path.join(dir_path, checkpoint_dirs[-1])


		def _get_hf_converted_path(dir_path):

fix: Enabled GPU based tests #649

fix: Enabled GPU based tests #649

Uh oh!

Conversation

romitjain commented Dec 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dushyantbehl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dushyantbehl Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dushyantbehl commented Dec 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romitjain commented Dec 15, 2025

Uh oh!

dushyantbehl commented Dec 15, 2025

Uh oh!

romitjain commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dushyantbehl commented Dec 15, 2025

Uh oh!

romitjain commented Dec 15, 2025

Uh oh!

dushyantbehl commented Dec 15, 2025

Uh oh!

romitjain commented Dec 15, 2025

Uh oh!

dushyantbehl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

romitjain commented Dec 8, 2025 •

edited

Loading

dushyantbehl Dec 9, 2025 •

edited

Loading

romitjain commented Dec 15, 2025 •

edited

Loading