Skip to content

[FLINK-38990][Runtime/Checkpointing] Support configurable initial delay for first checkpoint trigger#27484

Open
Myracle wants to merge 1 commit intoapache:masterfrom
Myracle:FLINK-38990-config-checkpoint-initial-delay
Open

[FLINK-38990][Runtime/Checkpointing] Support configurable initial delay for first checkpoint trigger#27484
Myracle wants to merge 1 commit intoapache:masterfrom
Myracle:FLINK-38990-config-checkpoint-initial-delay

Conversation

@Myracle
Copy link
Contributor

@Myracle Myracle commented Jan 28, 2026

What is the purpose of the change

This pull request adds a new configuration option execution.checkpointing.initial-delay that allows users to configure the delay before the first checkpoint is triggered after job startup. This is particularly useful for jobs that need time to warm up or catch up with backlogs (e.g., consuming from Kafka with large lag) before performing the first checkpoint.
Currently, the initial delay before the first checkpoint is randomly chosen between minPauseBetweenCheckpoints and baseInterval. This behavior is not configurable and may not be suitable for scenarios where jobs need a longer warm-up period. With this change, users can explicitly configure the initial delay to avoid checkpoint overhead during the critical catch-up phase.

Brief change log

  • Added new configuration option execution.checkpointing.initial-delay in CheckpointingOptions
  • Extended CheckpointCoordinatorConfiguration to include initialCheckpointDelay field with builder support
  • Added getInitialCheckpointDelay() and setInitialCheckpointDelay() methods to CheckpointConfig
  • Modified getRandomInitDelay() method in CheckpointCoordinator to use configured initial delay with small random jitter
  • Updated StreamGraph to pass the new configuration when building CheckpointCoordinatorConfiguration
  • Added documentation for the new configuration option in both English and Chinese docs

Verifying this change

This change added tests and can be verified as follows:

  • Added CheckpointCoordinatorInitialDelayTest with comprehensive unit tests for the new initial delay feature
  • *Extended CheckpointCoordinatorTriggeringTest *
  • Extended CheckpointConfigFromConfigurationTest to verify the configuration can be loaded from file and set via API

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (yes / no / don't know)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (docs / JavaDocs)

@flinkbot
Copy link
Collaborator

flinkbot commented Jan 28, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

- `execution.checkpointing.dir`: The directory to write checkpoints to. This takes a path URI like *s3://mybucket/flink-app/checkpoints* or *hdfs://namenode:port/flink/checkpoints*.
- `execution.checkpointing.savepoint-dir`: The default directory for savepoints. Takes a path URI, similar to `execution.checkpointing.dir`.
- `execution.checkpointing.interval`: The base interval setting. To enable checkpointing, you need to set this value larger than 0.
- `execution.checkpointing.initial-delay`: The initial delay before the first checkpoint is triggered. This is useful for jobs that need time to warm up or catch up with backlogs (e.g., consuming from Kafka with large lag).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious,

  • can we notice the warm up or catch up with backlogs activity and dynamically wait as long a is appropriate.
  • I suggest it is worth documenting what the impact is if we hit a warm up or catch up with backlogs activity without this delay and some discussion of the trade offs when using this option.

.text(
"The initial delay before the first checkpoint is triggered after the job starts. "
+ "This is useful for jobs that need time to warm up or catch up with backlogs. "
+ "If set to 0 (default), the initial delay will be randomly chosen between "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it would be better to have this as a random jitter above the minimum pause. Otherwise we could randomly get a very long delay for the first checkpoint if CHECKPOINTING_INTERVAL is large.

@github-actions github-actions bot added the community-reviewed PR has been reviewed by the community. label Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants