-
Notifications
You must be signed in to change notification settings - Fork 65
fix: Enabled GPU based tests #649
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Enabled GPU based tests #649
Conversation
Signed-off-by: romitjain <romit@ibm.com>
|
Thanks for making a pull request! 😃 |
| pytest>=7 | ||
| .[aim,mlflow,clearml,scanner-dev,fms-accel-all] | ||
| setenv = | ||
| CUDA_VISIBLE_DEVICES=0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to assume only one GPU is available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these tests requires a multi-GPU setup, so I set it as a single GPU.
Without this, an internal huggingface code was giving error (for computing loss)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it lets keep it like this ...
is it possible to complain if we detect the tests are running on a multi gpu setup and essentially wasting gpu hours so complain about it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a way via tox to do that. I think we should handle that at a layer above, where we add some validation to disable running multi-GPU runs for the template.
dushyantbehl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting clarification
tox.ini
Outdated
| setenv = | ||
| CUDA_VISIBLE_DEVICES=0 | ||
| commands = | ||
| pytest {posargs:tests/test_sft_trainer.py} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we not run all the tests from inside fms-hf-tuning?
we are currently limiting this to only top level tuning.sft_trainer unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do that, but there were no tests skipped due to lack of GPUs in other tests. They will just take more time.
I am okay either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another approach is that we mark specific tests and only run them. That will be the most focused and time-saving approach.
I am not in favour of that because we should run all tuning-based tests on GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The testing time is around <10mins right? for all the tests without GPUs?
just another thought if we make this a single place of truth for all the unit tests its better...so users need not worry about executing the unit tests on their end but rather run them all together if successful share screenshot along with the PR.
On the workflow we can keep running non GPU based.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dushyantbehl Yes, for all the tests without GPU the testing time is ~10 mins, but running with GPUs easily goes north of 30 mins.
Let me enable all the unit tests under this.
tests/test_sft_trainer.py
Outdated
| checkpoint_path=os.path.join( | ||
| _get_checkpoint_path(tempdir), "hf_converted_checkpoint" | ||
| ) | ||
| checkpoint_path=_get_hf_converted_path(tempdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@romitjain should not it point to safetensors-{step number} folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kmehant yes, I have added a new function _get_hf_converted_path that helps in retreiving the path
|
@romitjain the lint seems to be failing can you please check. |
Signed-off-by: romit <romit@ibm.com>
Signed-off-by: romit <romit@ibm.com>
Signed-off-by: romit <romit@ibm.com>
Signed-off-by: romit <romit@ibm.com>
| return os.path.join(dir_path, checkpoint_dirs[-1]) | ||
|
|
||
|
|
||
| def _get_hf_converted_path(dir_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new function to get the hf converted path directly
Signed-off-by: romit <romit@ibm.com>
| train_args.output_dir = tempdir | ||
| train_args.save_strategy = "no" | ||
| train_args.fp16 = True | ||
| train_args.bf16 = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fp16 upcasting is not allowed
| train_args.output_dir = tempdir | ||
| train_args.save_strategy = "no" | ||
| train_args.fp16 = True | ||
| train_args.bf16 = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same reason as above
|
@dushyantbehl @kmehant Please have a look
|
|
@romitjain we can wait for fixing this before merging the PR |
|
@dushyantbehl |
What would be the mode for running them individually on the pod? |
which will run the tests for each folder individually. IMO, we can figure out the state leakage independently |
thanks...sure I will review the PR later then |
|
@dushyantbehl Actually, I was suggesting that we can review/merge this PR. Since the scope of this PR is already expanded, we can fix the state leakage issue in another PR. Let me know what you think. |
dushyantbehl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Updated
tox.iniandpyproject.tomlto support GPU-based tests. Currentlywill run tests all the unit tests.
Apart from this, I have made changes to monkeypatch global vars instead of passing them directly to the trainer and monkeypatched environment variables.