-
Notifications
You must be signed in to change notification settings - Fork 46
Description
When the dataloader loads from checkpoint, it expects a path to the checkpoints directory, from which it pulls the most recent checkpoint folder and loads the relevant data.
This is a problem when continuing a completed run, as the final step of a completed run is to save a single-file checkpoint to the checkpoints directory. This messes up the dataloader when resuming, as the most recent item in the checkpoints directory is no longer a folder.
The solution for model checkpointing is to support both the checkpoints path, in which case it pulls the latest item, or a path to a particular checkpoint directory. The dataloader does not currently support the latter. We can either add this capability, or change the single-file save at the end of the run so that it goes outside the checkpoints directory, which should probably contain only checkpoint folders anyhow.