Skip to content

Conversation

@YashasviChaurasia
Copy link
Contributor

@YashasviChaurasia YashasviChaurasia commented Nov 24, 2025

Description of the change

add global step to checkpoint path for moe models finetuned with sharded_state.
Example: hf_converted_checkpoint -> safetensors-1234

Related issue number

How to verify the PR

Was the PR tested

  • I have added >=1 unit test(s) for every new method I have added.
  • I have ensured all unit tests pass

@github-actions
Copy link

Thanks for making a pull request! 😃
One of the maintainers will review and advise on the next steps.

@github-actions github-actions bot added the fix label Nov 24, 2025
if is_intermediate:
hf_converted_output_dir = os.path.join(
save_dir, "hf_converted_checkpoint"
save_dir, f"hf_converted_checkpoint-{state.global_step}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you rename this to safetensors_converted_checkpoint-{state.global_step}

if this would require big changes to llmb we can ignore

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be no change required in llmb at all. But a long name might create issues at times with cos and other file systems. How about safetensors-{state.global_step} or something simpler

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

safetensors-{state.global_step} good for me too thanks @ashokponkumar

hf_converted_path = os.path.join(base_path, "hf_converted_checkpoint")
hf_converted_path = os.path.join(
base_path, f"hf_converted_checkpoint-{state.global_step}"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we check if the path exists and the files exist before printing this message?
or print some error in the other case

Copy link
Contributor Author

@YashasviChaurasia YashasviChaurasia Nov 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do just that in the code below the path.join , if the dir does not exist that means accelerate callback didn't create dir and hence we use base dir , also as we know accelerate callback would always run before trainer controller callback hence this also ensures logic to check if hf checkpoint or not, in what scenario would we need a error btw?:

        if os.path.isdir(hf_converted_path):
            kwargs["hf_path"] = hf_converted_path
        else:
            kwargs["hf_path"] = base_path

Signed-off-by: yashasvi <yashasvi@ibm.com>
@ashokponkumar ashokponkumar merged commit facbce8 into foundation-model-stack:main Nov 24, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants