Skip to content

Conversation

@joshqsumner
Copy link
Contributor

@joshqsumner joshqsumner commented Oct 1, 2025

Describe your changes
Implements parallel checkpointing using a new attribute to WorkflowConfig and jupyterconfig called checkpoint. Most changes are downstream of having changed parallel.workflow_inputs from a function into a class so that when it is initialized with the cli arguments from a parallel job it uses those to touch a dummy file to say the workflow was attempted. When workflow_inputs.result is used (the getter is called) the checkpointing file is updated to "complete". Those changes are used throughout parallel to allow for 1: checkpointing in the use case where your jobs all get killed by some server error or whatever and 2: continuous/interim analysis where you are adding images over time and running the same workflow on them but don't need to reanalyze the images that were already analyzed.

Maybe checkpoint directory should be renamed something less likely to exist like _PCV_PARALLEL_CHECKPOINT_? I'd hate to accidentally delete someone's folder because they didn't read the docs.

Type of update
This is a new feature.

Associated issues
Closes #1807

Additional context
See #1807 for some comments and examples on the implementation. This builds on changes from the jupyter-parallelization branch from PR 1803.

Pretty sure the data frame creation on line 71:84 is less than ideal, someone with more pandas/json.load experience might have a much better way to do that (ie, reviewer please help)

For the reviewer
See this page for instructions on how to review the pull request.

  • PR functionality reviewed in a Jupyter Notebook
  • All tests pass
  • Test coverage remains 100%
  • Documentation tested
  • New documentation pages added to plantcv/mkdocs.yml
  • Changes to function input/output signatures added to updating.md
  • Code reviewed
  • PR approved

@joshqsumner joshqsumner added new feature New feature ideas and solutions work in progress Mark work in progress merge in order Set of PRs that need to be merged sequentially labels Oct 1, 2025
@deepsource-io
Copy link

deepsource-io bot commented Oct 1, 2025

Here's the code health analysis summary for commits 33f8429..0a6a438. View details on DeepSource ↗.

Analysis Summary

AnalyzerStatusSummaryLink
DeepSource Python LogoPython✅ SuccessView Check ↗
DeepSource Test coverage LogoTest coverage✅ SuccessView Check ↗

Code Coverage Report

MetricAggregatePython
Branch Coverage100%100%
Composite Coverage100%100%
Line Coverage100%100%
New Branch Coverage100%100%
New Composite Coverage100%100%
New Line Coverage100%, ✅ Above Threshold100%, ✅ Above Threshold

💡 If you’re a repository administrator, you can configure the quality gates from the settings.

@joshqsumner joshqsumner added ready to review and removed work in progress Mark work in progress labels Oct 2, 2025
@joshqsumner joshqsumner added this to the PlantCV v5.0 milestone Oct 24, 2025
@nfahlgren
Copy link
Member

Checkpointing is not working for me (tested on macOS). To test it I created a new configuration file using plantcv-run-workflow --template v5-config.json:

{
    "input_dir": "./images",
    "json": "pcv5.output.json",
    "filename_metadata": ["imgtype", "camera", "rotation", "zoom", "lifter", "gain", "exposure", "id"],
    "workflow": "./vis_nir_sv_z1_L1_e82.py",
    "img_outdir": "./output_images",
    "include_all_subdirs": true,
    "tmp_dir": ".",
    "start_date": null,
    "end_date": null,
    "imgformat": "all",
    "delimiter": "_",
    "metadata_filters": {"camera": "SV", "zoom": "z1", "rotation": "0"},
    "metadata_regex": {},
    "timestampformat": "%Y-%m-%d %H:%M:%S.%f",
    "writeimg": false,
    "other_args": {},
    "groupby": [
        "timestamp"
    ],
    "group_name": "imgtype",
    "checkpoint": true,
    "cleanup": false,
    "append": false,
    "cluster": "LocalCluster",
    "cluster_config": {
        "n_workers": 5,
        "cores": 1,
        "memory": "1GB",
        "disk": "1GB",
        "log_directory": null,
        "local_directory": null,
        "job_extra_directives": null
    },
    "metadata_terms": {
        "timestamp": {
            "label": "datetime of image",
            "datatype": "<class 'datetime.datetime'>",
            "value": null
        }
    }
}

Then I ran the workflow on a dataset with 19 images: plantcv-run-workflow --config v5-config.json. Halfway through I killed the run, which produced the directory _PCV_PARALLEL_CHECKPOINT_ with a subdirectory 2025-12-23_11-00-36_2u14sqd3. I started plantcv-run-workflow again and it said it found 19 images to analyze again. I killed the run halfway through again and it added a second temporary directory inside _PCV_PARALLEL_CHECKPOINT_ rather than seeing the existing directory.

This is what I see inside the temp directories:

-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 009fbc6b-dc79-4002-b63e-3019a1d4a41a_complete
-rw-r--r--@ 1 nfahlgren  701685721    22K Dec 23 11:01 009fbc6b-dc79-4002-b63e-3019a1d4a41a.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 020737c4-8b8e-48bb-953f-7395d6d41d2a.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 2439c6e5-d49d-4f19-8bb5-69409836dbf8.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 2beadaf2-48de-4e26-9220-06118cda57aa_complete
-rw-r--r--@ 1 nfahlgren  701685721    22K Dec 23 11:01 2beadaf2-48de-4e26-9220-06118cda57aa.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 2e066c89-ef24-4f2a-a052-03072aaaab93_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 2e066c89-ef24-4f2a-a052-03072aaaab93.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 32a961ed-499a-4fa6-b1ad-06d84e159148_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 32a961ed-499a-4fa6-b1ad-06d84e159148.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 340a86eb-a14e-454f-8bd3-74974db21c1b_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 340a86eb-a14e-454f-8bd3-74974db21c1b.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 3a929d41-e096-4bcb-8efd-beda043dd448.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 45e42853-df23-4e44-9a9f-ee5b20f3ac51.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 498859bd-d72c-45ba-9d8d-4caf314fbc15.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 5770b820-1b0c-4180-8865-e850c07dcac9.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 81f819ae-0521-4a5f-b12b-9f83f062ffd5_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 81f819ae-0521-4a5f-b12b-9f83f062ffd5.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 a12b0191-a2c7-4c58-83d4-e8755a8c0249_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 a12b0191-a2c7-4c58-83d4-e8755a8c0249.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 aac1ffd9-2010-418b-a53a-0dcdec15bcd1_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 aac1ffd9-2010-418b-a53a-0dcdec15bcd1.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 c9f5ac38-dab2-46c8-a237-19af89e50f2f.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 da5c6ab5-ef28-4760-b953-c89ad97c3815_complete
-rw-r--r--@ 1 nfahlgren  701685721    23K Dec 23 11:01 da5c6ab5-ef28-4760-b953-c89ad97c3815.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 dd85539e-e6ee-4fd6-a953-fffeafa29fa8.json
-rw-r--r--@ 1 nfahlgren  701685721   2.1K Dec 23 11:01 e1ff0886-f78b-4c60-98ce-9d37500bed80.json
-rw-r--r--@ 1 nfahlgren  701685721     0B Dec 23 11:01 f6ac7f15-6153-4280-a4ef-a3f05a5c8a50_complete
-rw-r--r--@ 1 nfahlgren  701685721    22K Dec 23 11:01 f6ac7f15-6153-4280-a4ef-a3f05a5c8a50.json

@joshqsumner
Copy link
Contributor Author

joshqsumner commented Jan 5, 2026

@nfahlgren Thank you for catching that, I think I broke it at some point around when I switched the order of config.tmp_dir and the checkpoint folder.
I have it working locally how I would expect and will add it to the dev agenda for this week.

The fix that I went with is to define a new attribute chkpt_start_dir on the fly in run_parallel and in parsers._read_checkpoint_data if it is missing. That keeps the checkpoint reader from failing to find completed jobs when config.tmp_dir is overwritten with the start-time flagged directory in run_parallel.
This does mean that {tmp_dir}/_PCV_PARALLEL_CHECKPOINT_/ can have many folders in it, one per each attempted run. I think that should be fine since process_results is grabbing the same chkpt_start_dir attribute (defaulting to tmp_dir if called directly) and looking from there at _PCV_PARALLEL_CHECKPOINT_

Testing with a workflow that should find 38 images:

(plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ plantcv-run-workflow --config parallel_local_no_regex.json 
  Starting run 2026-01-06_07-53-44

  Reading image metadata...
  Reading image metadata took 0.030014514923095703 seconds.
  Building job list... 
  Task list includes 38 workflows
  Building job list took 0.014407634735107422 seconds.
  Processing images... 
  [#############                           ] | 34% Completed | 21.1s^CTraceback (most recent call last):

  KeyboardInterrupt

  (plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ find _PCV_PARALLEL_CHECKPOINT_/ -iname "*_complete" | wc -l
  15
  (plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ plantcv-run-workflow --config parallel_local_no_regex.json 
  Starting run 2026-01-06_07-54-51

  Reading image metadata...
  Found 15 existing results in checkpoint directory, excluding those jobs.
  Reading image metadata took 0.040940046310424805 seconds.
  Building job list... 
  Task list includes 23 workflows
  Building job list took 0.008803367614746094 seconds.
  Processing images... 
  [##########################              ] | 65% Completed | 21.7s^CTraceback (most recent call last):

  KeyboardInterrupt

  (plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ find _PCV_PARALLEL_CHECKPOINT_/ -iname "*_complete" | wc -l
  30
  (plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ plantcv-run-workflow --config parallel_local_no_regex.json 
  Starting run 2026-01-06_07-55-24

  Reading image metadata...
  Found 30 existing results in checkpoint directory, excluding those jobs.
  Reading image metadata took 0.05278158187866211 seconds.
  Building job list... 
  Task list includes 8 workflows
  Building job list took 0.003099679946899414 seconds.
  Processing images... 
  Processing images took 14.32779598236084 seconds.
  Processing results... 
  Processing results took 0.06502532958984375 seconds.
  Converting json to csv... 
  Processing results took 0.0503840446472168 seconds.
  (plantcv) josh@Precision-7550:~/scripts/fahlgren_lab/pcv5/checkpointing_tests$ cat results-single-value-traits.csv | wc -l
  39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge in order Set of PRs that need to be merged sequentially new feature New feature ideas and solutions ready to review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants