Skip to content

Conversation

@weiji14
Copy link

@weiji14 weiji14 commented Aug 15, 2024

Edit: Bug fix was applied in #119. This PR thus only adds two unit tests to prevent bug regression.


Code was calling os.path.join() too many times and causing the path to be repeated unnecessarily, resulting in errors like:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/user/projects/FM-model/finetune_script.py", line 288, in <module>
[rank0]:     train(cfg=cfg, rank=local_rank)
[rank0]:   File "/home/user/projects/FM-model/finetune_script.py", line 188, in train
[rank0]:     model, optimizer, _, start_step, tokens_seen, is_resuming = checkpointer.load(
[rank0]:                                                                 ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/mambaforge/envs/fm/lib/python3.11/site-packages/fms_fsdp/utils/checkpointing_utils.py", line 185, in load
[rank0]:     if self._validate_ckp_path(self.ckp_path) is not None:
[rank0]:        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/user/mambaforge/envs/fm/lib/python3.11/site-packages/fms_fsdp/utils/checkpointing_utils.py", line 169, in _validate_ckp_path
[rank0]:     elif "metadata.pth" in os.listdir(latest):
[rank0]:                            ^^^^^^^^^^^^^^^^^^
[rank0]: FileNotFoundError: [Errno 2] No such file or directory: 'checkpoints/downstream_task/v1.0.0/checkpoints/checkpoints/downstream_task/v1.0.0/checkpoints/step_000020_ckp'

This patch removes one of the os.path.join() calls, and also the path.split("/")[-1] part which was needed because a path with "/" dividers was used in the list comprehension.

Code was calling `os.path.join()` too many times and causing the path to be repeated unnecessarily.

Signed-off-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
@weiji14 weiji14 force-pushed the repeated-ckpt-path branch from 3d2364c to d0919fa Compare August 15, 2024 02:58
@weiji14 weiji14 marked this pull request as ready for review August 15, 2024 03:05
],
key=lambda path: int(path.split("/")[-1].split("_")[1]),
[x for x in os.listdir(targdir) if qualifier(os.path.join(targdir, x))],
key=lambda path: int(path.split("_")[1]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably the same issue below with get_oldest ?

maybe we could add a unit test?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, removed os.path.join in the get_oldest function in commit 4a2b4b8, and added unit tests in commit f082164.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks better, though now the ctime key fails in get_oldest

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, have pushed a fix, can you run the tests again to see if it works?

Signed-off-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
When a list of 3 files are passed into the get_oldest and get_latest functions, ensure that the files that were created first and last are returned respectively.

Signed-off-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
@weiji14 weiji14 force-pushed the repeated-ckpt-path branch from 20c95a4 to f082164 Compare August 28, 2024 00:00
Signed-off-by: Wei Ji <23487320+weiji14@users.noreply.github.com>
@daviswer daviswer mentioned this pull request Oct 10, 2024
daviswer added a commit that referenced this pull request Oct 11, 2024
Updates `get_latest` and `get_oldest` to use the same sorting function, and allows the dataloader ckp handler to pass in its custom sort manually. Removes the bug where excessive path joins lead to repeated path prefixes in dataloader ckp loading. Fixes GPTBigCode signatures used for speculator training, to match superclass signatures (currently preventing other PRs from landing).

Includes and subsumes #110 and #96. Full credit to @weiji14 and @Akash-Nayak respectively
@weiji14 weiji14 changed the title Fix bug with repeated checkpoint path Add regression tests for get_latest and get_oldest functions Oct 14, 2024
@weiji14
Copy link
Author

weiji14 commented Oct 14, 2024

Bug fix has been applied in #119, so this PR only add unit tests now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants