Parallel checkpointing #1810

joshqsumner · 2025-10-01T20:22:09Z

Describe your changes
Implements parallel checkpointing using a new attribute to WorkflowConfig and jupyterconfig called checkpoint. Most changes are downstream of having changed parallel.workflow_inputs from a function into a class so that when it is initialized with the cli arguments from a parallel job it uses those to touch a dummy file to say the workflow was attempted. When workflow_inputs.result is used (the getter is called) the checkpointing file is updated to "complete". Those changes are used throughout parallel to allow for 1: checkpointing in the use case where your jobs all get killed by some server error or whatever and 2: continuous/interim analysis where you are adding images over time and running the same workflow on them but don't need to reanalyze the images that were already analyzed.

Maybe checkpoint directory should be renamed something less likely to exist like _PCV_PARALLEL_CHECKPOINT_? I'd hate to accidentally delete someone's folder because they didn't read the docs.

Type of update
This is a new feature.

Associated issues
Closes #1807

Additional context
See #1807 for some comments and examples on the implementation. This builds on changes from the jupyter-parallelization branch from PR 1803.

Pretty sure the data frame creation on line 71:84 is less than ideal, someone with more pandas/json.load experience might have a much better way to do that (ie, reviewer please help)

For the reviewer
See this page for instructions on how to review the pull request.

PR functionality reviewed in a Jupyter Notebook
All tests pass
Test coverage remains 100%
Documentation tested
New documentation pages added to plantcv/mkdocs.yml
Changes to function input/output signatures added to updating.md
Code reviewed
PR approved

deepsource-io · 2025-10-01T20:24:46Z

Here's the code health analysis summary for commits 33f8429..0a6a438. View details on DeepSource ↗.

Analysis Summary

Analyzer	Status	Summary	Link
Python	✅ Success		View Check ↗
Test coverage	✅ Success		View Check ↗

Code Coverage Report

Metric	Aggregate	Python
Branch Coverage	100%	100%
Composite Coverage	100%	100%
Line Coverage	100%	100%
New Branch Coverage	100%	100%
New Composite Coverage	100%	100%
New Line Coverage	100%, ✅ Above Threshold	100%, ✅ Above Threshold

💡 If you’re a repository administrator, you can configure the quality gates from the settings.

…nt folder deleted

nfahlgren · 2025-12-23T17:08:07Z

Checkpointing is not working for me (tested on macOS). To test it I created a new configuration file using plantcv-run-workflow --template v5-config.json:

{
    "input_dir": "./images",
    "json": "pcv5.output.json",
    "filename_metadata": ["imgtype", "camera", "rotation", "zoom", "lifter", "gain", "exposure", "id"],
    "workflow": "./vis_nir_sv_z1_L1_e82.py",
    "img_outdir": "./output_images",
    "include_all_subdirs": true,
    "tmp_dir": ".",
    "start_date": null,
    "end_date": null,
    "imgformat": "all",
    "delimiter": "_",
    "metadata_filters": {"camera": "SV", "zoom": "z1", "rotation": "0"},
    "metadata_regex": {},
    "timestampformat": "%Y-%m-%d %H:%M:%S.%f",
    "writeimg": false,
    "other_args": {},
    "groupby": [
        "timestamp"
    ],
    "group_name": "imgtype",
    "checkpoint": true,
    "cleanup": false,
    "append": false,
    "cluster": "LocalCluster",
    "cluster_config": {
        "n_workers": 5,
        "cores": 1,
        "memory": "1GB",
        "disk": "1GB",
        "log_directory": null,
        "local_directory": null,
        "job_extra_directives": null
    },
    "metadata_terms": {
        "timestamp": {
            "label": "datetime of image",
            "datatype": "<class 'datetime.datetime'>",
            "value": null
        }
    }
}

Then I ran the workflow on a dataset with 19 images: plantcv-run-workflow --config v5-config.json. Halfway through I killed the run, which produced the directory _PCV_PARALLEL_CHECKPOINT_ with a subdirectory 2025-12-23_11-00-36_2u14sqd3. I started plantcv-run-workflow again and it said it found 19 images to analyze again. I killed the run halfway through again and it added a second temporary directory inside _PCV_PARALLEL_CHECKPOINT_ rather than seeing the existing directory.

This is what I see inside the temp directories:

-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 009fbc6b-dc79-4002-b63e-3019a1d4a41a_complete
-rw-r--r--@ 1 nfahlgren  701685721    22K Dec 23 11:01 009fbc6b-dc79-4002-b63e-3019a1d4a41a.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 020737c4-8b8e-48bb-953f-7395d6d41d2a.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 2439c6e5-d49d-4f19-8bb5-69409836dbf8.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 2beadaf2-48de-4e26-9220-06118cda57aa_complete
-rw-r--r--@ 1 nfahlgren  701685721    22K Dec 23 11:01 2beadaf2-48de-4e26-9220-06118cda57aa.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 2e066c89-ef24-4f2a-a052-03072aaaab93_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 2e066c89-ef24-4f2a-a052-03072aaaab93.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 32a961ed-499a-4fa6-b1ad-06d84e159148_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 32a961ed-499a-4fa6-b1ad-06d84e159148.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 340a86eb-a14e-454f-8bd3-74974db21c1b_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 340a86eb-a14e-454f-8bd3-74974db21c1b.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 3a929d41-e096-4bcb-8efd-beda043dd448.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 45e42853-df23-4e44-9a9f-ee5b20f3ac51.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 498859bd-d72c-45ba-9d8d-4caf314fbc15.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 5770b820-1b0c-4180-8865-e850c07dcac9.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 81f819ae-0521-4a5f-b12b-9f83f062ffd5_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 81f819ae-0521-4a5f-b12b-9f83f062ffd5.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 a12b0191-a2c7-4c58-83d4-e8755a8c0249_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 a12b0191-a2c7-4c58-83d4-e8755a8c0249.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 aac1ffd9-2010-418b-a53a-0dcdec15bcd1_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 aac1ffd9-2010-418b-a53a-0dcdec15bcd1.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 c9f5ac38-dab2-46c8-a237-19af89e50f2f.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 da5c6ab5-ef28-4760-b953-c89ad97c3815_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 da5c6ab5-ef28-4760-b953-c89ad97c3815.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 dd85539e-e6ee-4fd6-a953-fffeafa29fa8.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 e1ff0886-f78b-4c60-98ce-9d37500bed80.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 f6ac7f15-6153-4280-a4ef-a3f05a5c8a50_complete
-rw-r--r--@ 1 nfahlgren  701685721    22K Dec 23 11:01 f6ac7f15-6153-4280-a4ef-a3f05a5c8a50.json

joshqsumner · 2026-01-05T22:05:47Z

@nfahlgren Thank you for catching that, I think I broke it at some point around when I switched the order of config.tmp_dir and the checkpoint folder.
I have it working locally how I would expect and will add it to the dev agenda for this week.

The fix that I went with is to define a new attribute chkpt_start_dir on the fly in run_parallel and in parsers._read_checkpoint_data if it is missing. That keeps the checkpoint reader from failing to find completed jobs when config.tmp_dir is overwritten with the start-time flagged directory in run_parallel.
This does mean that {tmp_dir}/_PCV_PARALLEL_CHECKPOINT_/ can have many folders in it, one per each attempted run. I think that should be fine since process_results is grabbing the same chkpt_start_dir attribute (defaulting to tmp_dir if called directly) and looking from there at _PCV_PARALLEL_CHECKPOINT_

Testing with a workflow that should find 38 images:

(plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ plantcv-run-workflow --config parallel_local_no_regex.json 
  Starting run 2026-01-06_07-53-44

  Reading image metadata...
  Reading image metadata took 0.030014514923095703 seconds.
  Building job list... 
  Task list includes 38 workflows
  Building job list took 0.014407634735107422 seconds.
  Processing images... 
  [#############                           ] | 34% Completed | 21.1s^CTraceback (most recent call last):

  KeyboardInterrupt

  (plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ find _PCV_PARALLEL_CHECKPOINT_/ -iname "*_complete" | wc -l
  15
  (plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ plantcv-run-workflow --config parallel_local_no_regex.json 
  Starting run 2026-01-06_07-54-51

  Reading image metadata...
  Found 15 existing results in checkpoint directory, excluding those jobs.
  Reading image metadata took 0.040940046310424805 seconds.
  Building job list... 
  Task list includes 23 workflows
  Building job list took 0.008803367614746094 seconds.
  Processing images... 
  [##########################              ] | 65% Completed | 21.7s^CTraceback (most recent call last):

  KeyboardInterrupt

  (plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ find _PCV_PARALLEL_CHECKPOINT_/ -iname "*_complete" | wc -l
  30
  (plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ plantcv-run-workflow --config parallel_local_no_regex.json 
  Starting run 2026-01-06_07-55-24

  Reading image metadata...
  Found 30 existing results in checkpoint directory, excluding those jobs.
  Reading image metadata took 0.05278158187866211 seconds.
  Building job list... 
  Task list includes 8 workflows
  Building job list took 0.003099679946899414 seconds.
  Processing images... 
  Processing images took 14.32779598236084 seconds.
  Processing results... 
  Processing results took 0.06502532958984375 seconds.
  Converting json to csv... 
  Processing results took 0.0503840446472168 seconds.
  (plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ cat results-single-value-traits.csv | wc -l
  39

joshqsumner added 23 commits September 29, 2025 13:47

don't group in parser

dac52fe

group in job_builder

76f7977

accounting for ungrouped data

952ebf8

importing pd for methods

f40fd32

save metadata csv

a4e3a02

checkpointing column to metadata

7d52f8f

adding option to resume using csv metadata

caa5020

checkpoint file name

6435363

workflow_inputs as a class

349a208

using dummy files in temp directory to monitor checkpointing

58154c9

minimal working example

16a1ee5

fixed mvp

4a198a9

flexibility for processing results

648b13c

removing files after checkpointing metadata is made

95597cc

finishing docstring

cf7c091

typo

11e0056

informative error if checkpointing files are all completed

3286488

cleaning up from "checkpoint" dir instead of checkpoint/tmp

f9655dc

refactoring to allow for interim analyses use case

3227966

docs and cleaning

e905809

flipping logic for easier testing

fd4b73c

making name images a function available to WorkflowInputs again

f6d32a1

updating tests

773ced7

joshqsumner added new feature New feature ideas and solutions work in progress Mark work in progress merge in order Set of PRs that need to be merged sequentially labels Oct 1, 2025

joshqsumner added 3 commits October 1, 2025 15:28

deepsource linting

162d3be

covering workflow inputs

865e55a

unused argument removal

7d3260d

joshqsumner added 3 commits October 2, 2025 08:56

trying to test remaining lines

c9dc571

adding a message about excluding files if they are found

a056835

adding checkpointed results to removed_df for inspection.

b478a88

joshqsumner added ready to review and removed work in progress Mark work in progress labels Oct 2, 2025

joshqsumner added 6 commits October 2, 2025 13:28

adding to docs to lessen the chance someone has an important checkpoi…

21b707f

…nt folder deleted

using more obscure directory name for safer deletion

853d703

removing cross posting for pipeline_parallel.md

762c4e5

changing ordering of checkpoint and tmp_dir to be configurable

3bdb234

Merge branch 'jupyter-parallelization' into checkpointing

19a08d1

Merge branch 'v5.0' into checkpointing

15610be

joshqsumner added this to the PlantCV v5.0 milestone Oct 24, 2025

joshqsumner added 5 commits October 30, 2025 09:12

Merge branch 'jupyter-parallelization' into checkpointing

a381dda

Merge branch 'v5.0' into checkpointing

1989707

Merge branch 'jupyter-parallelization' into checkpointing

5f78904

Merge branch 'jupyter-parallelization' into checkpointing

6f2277d

Merge branch 'v5.0' into checkpointing

1b52e74

joshqsumner added 7 commits January 5, 2026 14:14

keep track of original tmp_dir for checkpointing

c934548

walk from original tmp_dir to find checkpointing status

dbc9cf8

forcing attribute to exist in helper function

68368ae

deepsource

559a44a

use chkpt_start_dir in processing results

9515830

making expected results directory structure

0051c43

pointing into the checkpoint folder

a46c7a9

stopping timestamp warnings

0a6a438

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parallel checkpointing #1810

Parallel checkpointing #1810

joshqsumner commented Oct 1, 2025 •

edited

Loading

Uh oh!

deepsource-io bot commented Oct 1, 2025 •

edited

Loading

Analysis Summary

Code Coverage Report

Uh oh!

nfahlgren commented Dec 23, 2025

Uh oh!

joshqsumner commented Jan 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Parallel checkpointing #1810

Are you sure you want to change the base?

Parallel checkpointing #1810

Conversation

joshqsumner commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deepsource-io bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Analysis Summary

Code Coverage Report

Uh oh!

nfahlgren commented Dec 23, 2025

Uh oh!

joshqsumner commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joshqsumner commented Oct 1, 2025 •

edited

Loading

deepsource-io bot commented Oct 1, 2025 •

edited

Loading

joshqsumner commented Jan 5, 2026 •

edited

Loading