Skip to content

Conversation

@romitjain
Copy link
Collaborator

@romitjain romitjain commented Dec 8, 2025

Updated tox.ini and pyproject.toml to support GPU-based tests. Currently

tox -e accel

will run tests all the unit tests.

Apart from this, I have made changes to monkeypatch global vars instead of passing them directly to the trainer and monkeypatched environment variables.

Signed-off-by: romitjain <romit@ibm.com>
@github-actions
Copy link

github-actions bot commented Dec 8, 2025

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the fix label Dec 8, 2025
ashokponkumar
ashokponkumar previously approved these changes Dec 8, 2025
pytest>=7
.[aim,mlflow,clearml,scanner-dev,fms-accel-all]
setenv =
CUDA_VISIBLE_DEVICES=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to assume only one GPU is available?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these tests requires a multi-GPU setup, so I set it as a single GPU.
Without this, an internal huggingface code was giving error (for computing loss)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it lets keep it like this ...
is it possible to complain if we detect the tests are running on a multi gpu setup and essentially wasting gpu hours so complain about it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a way via tox to do that. I think we should handle that at a layer above, where we add some validation to disable running multi-GPU runs for the template.

Copy link
Collaborator

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting clarification

tox.ini Outdated
setenv =
CUDA_VISIBLE_DEVICES=0
commands =
pytest {posargs:tests/test_sft_trainer.py}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we not run all the tests from inside fms-hf-tuning?
we are currently limiting this to only top level tuning.sft_trainer unit tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do that, but there were no tests skipped due to lack of GPUs in other tests. They will just take more time.
I am okay either way.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another approach is that we mark specific tests and only run them. That will be the most focused and time-saving approach.
I am not in favour of that because we should run all tuning-based tests on GPUs.

Copy link
Collaborator

@dushyantbehl dushyantbehl Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The testing time is around <10mins right? for all the tests without GPUs?

just another thought if we make this a single place of truth for all the unit tests its better...so users need not worry about executing the unit tests on their end but rather run them all together if successful share screenshot along with the PR.

On the workflow we can keep running non GPU based.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dushyantbehl Yes, for all the tests without GPU the testing time is ~10 mins, but running with GPUs easily goes north of 30 mins.

Let me enable all the unit tests under this.

Signed-off-by: romitjain <romit@ibm.com>
checkpoint_path=os.path.join(
_get_checkpoint_path(tempdir), "hf_converted_checkpoint"
)
checkpoint_path=_get_hf_converted_path(tempdir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romitjain should not it point to safetensors-{step number} folder?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kmehant yes, I have added a new function _get_hf_converted_path that helps in retreiving the path

@dushyantbehl
Copy link
Collaborator

@romitjain the lint seems to be failing can you please check.

Signed-off-by: romit <romit@ibm.com>
Signed-off-by: romit <romit@ibm.com>
Signed-off-by: romit <romit@ibm.com>
Signed-off-by: romit <romit@ibm.com>
return os.path.join(dir_path, checkpoint_dirs[-1])


def _get_hf_converted_path(dir_path):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new function to get the hf converted path directly

Signed-off-by: romit <romit@ibm.com>
train_args.output_dir = tempdir
train_args.save_strategy = "no"
train_args.fp16 = True
train_args.bf16 = True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fp16 upcasting is not allowed

train_args.output_dir = tempdir
train_args.save_strategy = "no"
train_args.fp16 = True
train_args.bf16 = True
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same reason as above

@romitjain romitjain requested a review from kmehant December 15, 2025 05:22
@romitjain
Copy link
Collaborator Author

@dushyantbehl @kmehant Please have a look

tox -e accel is passing, but tox -e gpu is failing due to some state leakage.

@dushyantbehl
Copy link
Collaborator

#649 (comment)

@romitjain we can wait for fixing this before merging the PR

@romitjain
Copy link
Collaborator Author

romitjain commented Dec 15, 2025

@dushyantbehl
Tests are passing when running individually, which means both the test suite and the code it is testing is fine. Just running under a single process is promlemetic. Why do we need to wait for the state leakage fix?

@dushyantbehl
Copy link
Collaborator

@dushyantbehl Tests are passing when running individually, which means both the test suite and the code it is testing is fine. Just running under a single process is promlemetic. Why do we need to wait for the state leakage fix?

What would be the mode for running them individually on the pod?

@romitjain
Copy link
Collaborator Author

What would be the mode for running them individually on the pod?

tox -e accel

which will run the tests for each folder individually.

IMO, we can figure out the state leakage independently

@dushyantbehl
Copy link
Collaborator

What would be the mode for running them individually on the pod?

tox -e accel

which will run the tests for each folder individually.

IMO, we can figure out the state leakage independently

thanks...sure I will review the PR later then

@romitjain
Copy link
Collaborator Author

@dushyantbehl Actually, I was suggesting that we can review/merge this PR. Since the scope of this PR is already expanded, we can fix the state leakage issue in another PR. Let me know what you think.

Copy link
Collaborator

@dushyantbehl dushyantbehl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dushyantbehl dushyantbehl merged commit 13b05a9 into foundation-model-stack:main Dec 16, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants