-
Notifications
You must be signed in to change notification settings - Fork 65
fix: add global step number in hf moe chpt #643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: add global step number in hf moe chpt #643
Conversation
|
Thanks for making a pull request! 😃 |
| if is_intermediate: | ||
| hf_converted_output_dir = os.path.join( | ||
| save_dir, "hf_converted_checkpoint" | ||
| save_dir, f"hf_converted_checkpoint-{state.global_step}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you rename this to safetensors_converted_checkpoint-{state.global_step}
if this would require big changes to llmb we can ignore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be no change required in llmb at all. But a long name might create issues at times with cos and other file systems. How about safetensors-{state.global_step} or something simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
safetensors-{state.global_step} good for me too thanks @ashokponkumar
tuning/trainercontroller/callback.py
Outdated
| hf_converted_path = os.path.join(base_path, "hf_converted_checkpoint") | ||
| hf_converted_path = os.path.join( | ||
| base_path, f"hf_converted_checkpoint-{state.global_step}" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check if the path exists and the files exist before printing this message?
or print some error in the other case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do just that in the code below the path.join , if the dir does not exist that means accelerate callback didn't create dir and hence we use base dir , also as we know accelerate callback would always run before trainer controller callback hence this also ensures logic to check if hf checkpoint or not, in what scenario would we need a error btw?:
if os.path.isdir(hf_converted_path):
kwargs["hf_path"] = hf_converted_path
else:
kwargs["hf_path"] = base_path
c307497 to
62cc522
Compare
Signed-off-by: yashasvi <yashasvi@ibm.com>
62cc522 to
a1b3726
Compare
Description of the change
add global step to checkpoint path for moe models finetuned with sharded_state.
Example:
hf_converted_checkpoint->safetensors-1234Related issue number
How to verify the PR
Was the PR tested