From 262ec3396f39d3056a705b36cce9625d732b9c33 Mon Sep 17 00:00:00 2001 From: Aljoscha Krettek Date: Fri, 19 Dec 2025 11:37:12 +0100 Subject: [PATCH] design: add More Zero-Downtime Upgrades design doc --- ...downtime_upgrades_physical_isolation_ha.md | 394 ++++++++++++++++++ 1 file changed, 394 insertions(+) create mode 100644 doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md diff --git a/doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md b/doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md new file mode 100644 index 0000000000000..b937483725ca5 --- /dev/null +++ b/doc/developer/design/20251219_more_zero_downtime_upgrades_physical_isolation_ha.md @@ -0,0 +1,394 @@ +# More Zero-Downtime Upgrades, Physical Isolation, High Availability + +We currently do zero-downtime upgrades, where we hydrate an `environmentd` and +its `clusterd` processes at a new version before cutting over to that version. +This makes it so that computation running on clusters is ready when we cut over +and there is no downtime in processing, no lag in freshness. However, due to an +implementation detail of how cutting over from one version to the next works +(see context below) we have an interval of 10–30 seconds where an environment +is not reachable by the user. + +This document proposes a design where we can achieve true zero-downtime +upgrades for DML and DQL queries, with the caveat that we still have a window +(again on the order of 10s of seconds) where users cannot issue DDL. + +The long-term goal is to remove the caveats from zero-downtime upgrades and to +remove downtime at all moments. The latter is usually tracked under the +separate initiatives of _high availability_ and _physical isolation_. However, +the engineering work that they all require overlaps and they thematically fit +together. + +The focus of this document is on improving zero-downtime upgrades in the short +term, but I will describe below what the other initiatives entail, what +engineering work is unique to each and what work overlaps, and sketch how we +can achieve the long-term goal. + +## Goals + +- True zero-downtime upgrades for DML and DQL + +The lived experience for users will be: **no perceived downtime for DQL and +DML, and an error message about DDL not being allowed during version cutover, +which should be on the order of 10s of seconds**. + +## Non-Goals + +- True zero-downtime upgrades for DDL +- high availability, so no downtime during upgrades or any other time +- physical isolation + +## Context + +### Current Zero-Downtime Upgrade Procedure + +1. New `environmentd` starts with higher `deploy_generation`, higher `version` +2. Boots in read-only mode: opens catalog in read-only mode, spawns `clusterd` + processes at new version, hydrates dataflows, everything is kept in + read-only mode +3. Signals readiness: once clusters report hydrated and caught up +4. Orchestrator triggers promotion: new `environmentd` opens catalog in + read/write mode, which writes its `deploy_generation`/`version` to the + catalog, **then halts and is restarted in read/write mode by orchestrator** +5. Old `environmentd` is fenced out: it notices the newer version in the + catalog and **halts** +6. New `environmentd` re-establishes connection to clusters, brings them out of + read-only mode +7. Cutover: network routes updated to point at new generation, orchestrator + reaps old deployment + +Why there's downtime: + +- Old `environmentd` exits immediately when being fenced +- New `environmentd` transitions from read-only to read/write using the + **halt-and-restart** approach +- Network routes update to new process +- During this window, environment is unreachable + +### Catalog Migrations + +When a new version first establishes itself in the durable catalog by writing +down its version and deploy generation, it sometimes has to modify the schema +or contents of parts of the catalog. We call these _catalog migrations_. These +are a large reason for why currently we can't have an older version read or +interact with the catalog once a new version has "touched" it. + +### High Availability + +Technically, high availability would include zero-downtime upgrades because it +means that the system is "always" available. We historically differentiate +between the two because they require slightly different engineering work, +although with overlap. + +Zero-downtime upgrades means that we don't want to impose downtime when doing +upgrades, and high availability means we want to protect from unavailability +that can happen at any other time, say from an `environmentd` process crashing +or a machine failing. + +Thinking in terms of our current architecture, high availability requires +multiple `environmentd`-flavored processes, so that in case of a failure the +load can be immediately moved to another `environmentd`. + +### Physical Isolation + +The goal of physical isolation is to isolate the query/control workload of one +use case from other use cases. Today, users can create separate clusters and +replicas, but all work of query processing and controlling the clusters happens +on a single `environmentd` instance. + +Physical isolation would require to shard the workload across multiple +`environmentd` instances, which is why this is related to both high +availability and zero-downtime upgrades. + +## Conceptual Framework: How These Initiatives Relate + +Zero-downtime upgrades, high availability, and physical isolation all require +multiple `environmentd` instances to coexist and collaborate on shared durable +state (the catalog). One neat insight is: + +**Zero-downtime upgrades are high availability across two versions.** + +This framing clarifies what makes each initiative unique: + +| Initiative | # of Versions | Duration | Primary Challenge | +|------------|---------------|----------|-------------------| +| zero-downtime upgrades | 2 (old + new) | temporary (cutover window) | version compatibility | +| high availability | 1 | permanent | failover coordination | +| physical isolation | 1 (typically) | permanent | workload routing | + +All three share a common foundation: the ability to run multiple `environmentd` +instances that can read/write shared state concurrently. + +## Proposal + +To reduce downtime during upgrades further, I propose that we change from our +current upgrade procedure to this flow: + +1. New `environmentd` starts with higher `deploy_generation`/`version` +2. Boots in read-only mode: opens catalog in read-only mode, spawns `clusterd` + processes at new version, hydrates dataflows, everything is kept in + read-only mode +3. Signals readiness: once clusters report hydrated and caught up +4. Orchestrator triggers promotion: new `environmentd` opens catalog in + read/write mode, which writes its `deploy_generation`/`version` to the + catalog, **then halts and is restarted in read/write mode** +5. Old `environmentd` notices the new version in the catalog and enters + **lame-duck mode**: it does not halt, and none of its cluster processes are + reaped, it can still serve queries but not apply writes to the catalog + anymore, so cannot process DDL queries +6. New `environmentd` re-establishes connection to clusters, brings them out of + read-only mode +7. Cutover: once orchestration determines that the new-version `environmentd` + is ready to serve queries, we update network routes +8. Eventually: resources of old-version deployment are reaped + +The big difference to before is that the fenced deployment is not immediately +halted/reaped but can still serve queries. This includes DML and DQL (so +INSERTs and SELECTs), but not DDL. + +We cut over network routes once the new deployment is fully ready, so any +residual downtime is the route change itself. During that window the old +deployment still accepts connections but rejects DDL with an error message. +When cutting over, we drop connections and rely on reconnects to reach the new +version. + +This modified flow requires a number of changes in different components. I will +sketch these below, but each of the sub-sections will require a small-ish +design document of its own or at the very least a thorough GitHub issue. + +The required work is sectioned into work that is unique to zero-downtime +upgrades, work that is shared with the other initiatives, and then lastly I +will describe the work that is unique to the future initiatives. The latter +will be very sparse, because the focus of this document is the more immediate +improvements to zero-downtime upgrades. + +## Work required for Zero-Downtime Upgrades (for DML/DQL) + +This section and the one about the shared foundation work describe that work +that is required for this proposal. + +### Lame-Duck `environmentd` at Old Version + +The observation that makes this proposal work is that neither the schema nor +the contents of user collections (so tables, sources, etc.) change between +versions. So both the old version and the new version _should_ be able to +collaborate in writing down source data, accepting INSERTs for tables and +serving SELECT queries. + +DDL will not work because the new-version deployment will potentially have +migrated (applied catalog migrations) the catalog and so the old version cannot +be allowed to write to it anymore. And it wouldn't currently be able to read +newer changes. Backward/forward compatibility for catalog changes is one of the +things we need to get true zero-downtime working for all types of queries. + +Once an `environmentd` notices that there is a newer version in the catalog it +enters lame-duck mode, where it does not allow writes to the catalog anymore +and will serve DQL/DML workload off of the catalog snapshot that it has. An +important detail to figure out here is what happens when the old-version +`environmentd` process crashes while we're in a lame-duck phase. If the since +of the catalog shard has advanced, it will not be able to restart and read the +catalog at the old version that it understands. This may require holding back +the since of the catalog shard during upgrades or a similar mechanism. + +TODO: It could even be the case that builtin tables are compatible for writing +between the versions, because of how persist schema backward compatibility +works. We have to audit whether it would work for both the old and new version +to write at the same time. This is important for builtin tables that are not +derived from catalog state, for example `mz_sessions`, the audit log, and +probably others. + +TODO: Figure out if we want to allow background tasks to keep writing. This +includes, but is not limited to storage usage collection. + +### Change How Old-Version Processes are Reaped + +Currently, the fenced-out `environmentd` halts itself when the new version +fences it via the catalog. And a newly establishing `environmentd` will reap +replica processes (`clusterd`) of any versions older than itself. + +We can't have this because the lame-duck `environmentd` still needs all its +processes alive to serve traffic. + +Instead we need to change the orchestration logic to determine when the +new-version `environmentd` is ready to serve traffic, then cut over, and +eventually reap processes of the old version. + +### Get Orchestration Ready for Managing Version Cutover + +We need to update the orchestration logic to use the new flow as outlined +above. + +TODO: What's the latency incurred by us cutting over between `environmentd`s +when everything is ready. That's how close to zero we will get with this +approach. + +## Foundation Work (Required for This Proposal) + +The following changes are required for this proposal and form the foundation +for High Availability and Physical Isolation: + +### Get Builtin Tables Ready for Concurrent Writers + +There are two types of builtin tables: + +1. Tables that mirror catalog state +2. Other tables, which are largely append-only + +We have to audit and find out which is which. For builtin tables that mirror +catalog state we can use the self-correcting approach that we also use for +materialized views and for a number of builtin storage-managed collections. For +others, we have to see if concurrent append-only writers would work. + +One of the thorny tables here is probably storage-usage data. + +### Get Builtin Sources Ready for Concurrent Writers + +Similar to tables, there are two types: + +1. "Differential sources", where the controller knows what the desired state is + and updates the content to match that when needed +2. Append-only sources + +Here again, we have to audit what works with concurrent writers and whether +append-only sources remain correct. + +NOTE: Builtin Sources are also called storage-managed collections because +writing to them is mediated by the storage controller via async background +tasks. + +### Concurrent User-Table Writes + +We need to get user tables ready for concurrent writers. Blind writes are easy. +Read-then-write is hard, but we can do OCC (optimistic concurrency control), +with a loop that does: + +1. Subscribe to changes +2. Read latest changes, derive writes +3. Get write timestamp based on latest read timestamp +4. Attempt write; go back to 2 if that fails; succeed otherwise + +We can even render that as a dataflow and attempt the read-then-write directly +on a cluster, if you squint this is almost a "one-shot continual task" (if +you're familiar with that). But we could initially keep that loop in +`environmentd`, closer to how the current implementation works. + +### Ensure Sources/Sinks are Ready for Concurrent Instances + +I think they are ready, but sources will have a bit of a fight over who gets to +read, for example for Postgres. Kafka is already fine because it already +supports concurrently running ingestion instances. + +Kafka Sinks use a transactional producer ID, so they would also fight over this +but settle down when the old-version cluster processes are eventually reaped. + +TODO: Figure out how big the impact of the above-mentioned squabbling would be. + +### Critical Persist Handles with Concurrent Environmentd Instances + +`StorageCollections`, the component that is responsible for managing the +critical since handles of storage collections, currently has a single critical +handle per collection. When a new version comes online it "takes ownership" of +that handle. + +We have to figure out how that works in the lame-duck phase, where we have two +`environmentd` instances running concurrently, and hence also two instances of +`StorageCollections`. This is closely related (if not the same) to how multiple +instances of `environmentd` have to have handles for physical isolation and +high availability. + +The solution is to track critical handle IDs in the catalog rather than pushing +this complexity into persist. Each version or instance of `environmentd` writes +down its critical handle IDs in the catalog before acquiring them. This allows +other instances or versions to clean up handles when they determine that a +given version or instance can no longer be running. For example, during version +cutover, the new version can retire handles belonging to the old version once +the old version's processes have been reaped. + +This approach unifies handle management across zero-downtime upgrades, high +availability, and physical isolation: in all cases, the catalog serves as the +coordination point for tracking which handles exist and which can be retired. + +### Persist Pubsub + +Currently, `environmentd` acts as the persist pub-sub server. It's the +component that makes the latency between persist writes and readers noticing +changes snappy. + +When we have both the old and new `environmentd` handling traffic, and both +writing at different times, we would see an increase in read latency because +writes from one `environmentd` (and its cluster processes) are not routed to +the other `environmentd` and vice-versa. + +We have to somehow address this, or accept the fact that we will have degraded +latency for the short period where both the old and new `environmentd` serve +traffic (the lame-duck phase). + +## Shared Work Required for all of Physical Isolation, High Availability, and Full Zero-Downtime Upgrades + +The major work required for all of these efforts is enabling multiple actors to +interact with and modify the catalog. Multiple instances of `environmentd` need +to be able to: + +- subscribe to catalog changes and apply their implications +- collaborate in writing down changes to the catalog + +The basic ideas of this are explained in [platform v2 +architecture](20231127_pv2_uci_logical_architecture.md), and there is ongoing +work towards allowing the adapter to subscribe to catalog changes and apply +their implications. + +## Work required for Zero-Downtime Upgrades (including DDL) + +Beyond the ability for multiple actors to work with the durable catalog, for +zero-downtime upgrades with DDL we need the ability for _multiple versions_ to +interact with the catalog. This is currently not possible because of catalog +migrations and because a new version fences out old versions. + +A jumping-off point for this work is the recent work that allows multiple +versions to work with persist shards. Here we already have forward and backward +compatibility by tracking what versions are still "touching" a shard and +deferring the equivalent of migrations to a moment when no older versions are +touching the shard. For Materialize as a whole, this means that changes or new +features need to be gated by what version the catalog currently has, and that +catalog version can be different from the version of Materialize that is +currently running. We would then have gating both by regular feature flags and +by catalog version. + +## Work required for Physical Isolation + +TBD + +## Work required for High Availability + +TBD + +## Alternatives + +### Read-Only `environmentd` Follows Catalog Changes + +Instead of the halt-and-restart approach, the new-version `environmentd` would +listen to catalog changes, apply migrations live, and when promoted fences out +the old `environmentd` and gets all its in-memory structures, controllers, and +the like out of read-only mode. + +This would still have the downside that bringing it out of read-only mode can +take on the order of seconds, but we would shorten the window during which DDL +is not available. + +As described above, this part is hard. + +### Fully Operational `environmentd` at different versions + +This is an extension of the above, and basically all the different work streams +mentioned above (high availability, physical isolation, following catalog +changes and applying migrations). + +The only observable downtime would be when we cut over how traffic gets routed +to different `environmentd`s. + +In the fullness of time we want this, but it's a decent amount of work. +Especially the forward/backward compatibility across a range of versions. On top +of the other hard challenges. + +## Open Questions + +- How much downtime is incurred by rerouting network traffic?