design: builtin schema migration #33863

teskje · 2025-10-15T13:20:47Z

Motivation

This PR proposes a design.

Proposes a design to fix:

Tips for reviewer

Checklist

This PR has adequate test coverage / QA involvement has been duly considered. (trigger-ci for additional test/nightly runs)
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
If this PR includes major user-facing behavior changes, I have pinged the relevant PM to schedule a changelog post.

SangJunBak

Everything makes sense! Just some edge cases and an alternative

doc/developer/design/20251015_builtin_schema_migration.md

SangJunBak · 2025-10-16T13:04:26Z

doc/developer/design/20251015_builtin_schema_migration.md

+To avoid data loss and other surprises caused by automatic builtin schema migrations, we introduce the concept of explicit migration instructions.
+A migration instruction instructs the process which builtin collection to migrate at which version, and which mechanism to use.
+
+Migration instructions are kept in a hard-coded list:


How do we handle the cases where we write Mechanism::Evolution for a schema and it's not backwards compatible?

Similar to the problem above ^ Thinking of this edge case where we upgrade from 0.148.1 to 0.149. By these instructions, we're expecting to just do an evolution since in 0.149, the hard coded list only has evolves. However the evolution isn't possible because the schema isn't backwards compatible

How do we handle the cases where we write Mechanism::Evolution for a schema and it's not backwards compatible?

That would be a bug. The migration by evolution mechanism checks the new schema for backward compatibility and panics if it is not backward compatible.

Thinking of this edge case where we upgrade from 0.148.1 to 0.149.

Hm, yeah this scheme depends on the MIGRATIONS list always being complete and not missing any migrations performed at previous versions. Inserting incompatible migrations retroactively isn't allowed. Inserting compatible migrations is fine, but maybe we still want to disallow them, to not muddy the waters too much. Given that patch releases should be reserved for critical bug fixes, I don't think disallowing builtin schema changes in them is unreasonable.

There is the question of whether we can prevent people from accidentally adding schema migrations retroactively. Your scenario is only possible because semver allows arbitrarily inserting new versions between existing versions. We could go the way of the protobuf migrations and have a BUILTIN_SCHEMA_VERSION: u64 that we increase every time we change a builtin schema. That would leave no way to insert new versions between existing versions and the additional complexity from tracking an additional version seems low enough.

Inserting compatible migrations is fine, but maybe we still want to disallow them, to not muddy the waters too much. Given that patch releases should be reserved for critical bug fixes, I don't think disallowing builtin schema changes in them is unreasonable.

Among other things, it would be a semver break, since we'd end up removing API surface in a minor release.

There is the question of whether we can prevent people from accidentally adding schema migrations retroactively.

I think we can do a decent amount of checking here in CI... verify that the list of migrations in the last release is consistent with the current one. I think we'll want similar CI to ensure that we don't prematurely drop migrations, for example.

Good point about CI! I think the tests don't even have to look at the migrations list specifically, they can just attempt all possible migration paths, and in Jun's example the migration from 0.148.1 to 0.149.0 would fail. The only wrinkle is that you'd only find the issue in the CI for 0.149.0, if you run it after cutting 0.149.0. So we might need to add a step when cutting a new patch release that also checks if upgrading from that patch release to all existing higher versions succeeds.

So we might need to add a step when cutting a new patch release that also checks if upgrading from that patch release to all existing higher versions succeeds.

Good idea for to add to the testing plan/suite and makes sense to me. Will make note

I think that totally works as well, but the cost is very different... testing all possible upgrade paths for N versions is O(2^N), while testing that each version's migration list is a superset of the previous is O(N). Maybe that's fine if we're paying that cost for other reasons!

Well, it's O(2^N) only if we test multi-step upgrades (i.e., when a test scenario can involve more than 2 versions). It's only O(N^2) if each test scenario just upgrades from one version to another. The latter is enough if upgrading from A to B can't be affected by what version we upgraded to A from.

In addition to what Gabor said, I think we get away with only upgrading to and from the version we are currently testing. E.g. in the CI of version 0.148.1, we test upgrading each smaller supported version to 0.148.1 and test upgrading 0.148.1 to each larger existing version. The reasoning being that we don't need to test upgrading to/from other versions because we have already done that in the CI of these versions. In which case the effort would be O(N).

bkirwi

Cool!!

bkirwi · 2025-10-16T18:15:58Z

doc/developer/design/20251015_builtin_schema_migration.md

+
+In the subsequent read-only bootstrap phase, the process creates persist read and write handles using the new schema.
+Read handles perform transparent migration of any data updates that flow through them, so dataflow hydration can proceed using the new schema.
+Write handles only require a matching registered shard schema when writing batches, which is something a read-only process doesn't do.


Two qualifiers:

Right now we require a matching registered schema when creating the handle, though as discussed that can change.

I seem to recall a case in txn-wal where we write empty batches to advance the frontier, and we'd need to make sure that those use the old schema or work differently. (That seemed sketchy to me at the time, so I feel good about changing it.)

Yep, I was pretending that we've already made the write handle change to not distract the reader unnecessarily, optimistically assuming that it's easy to implement and will be done before the design doc merges.

I seem to recall a case in txn-wal where we write empty batches to advance the frontier, and we'd need to make sure that those use the old schema or work differently.

Interesting! So far I've only found that for replaced builtin table shards, we tick forward their frontiers. But we explicitly make sure that they don't get inserted into txn-wal while the environment is in read-only mode.

The relevant method is DataSnapshot::unblock_read - it's documented to do a CaA in some cases, if you trace the callers, you can see that quite a few methods which "look" read-only (snapshot-and-fetch, etc.) do end up calling it.

My memory is that those read-only-looking methods were sometimes called in read-only mode, but I haven't verified this anytime recently.

Personally I don't think unblock_read has to work that way. If this ends up being a real issue and not just a historical one we can probably discuss options then...

bkirwi · 2025-10-16T18:25:56Z

doc/developer/design/20251015_builtin_schema_migration.md

+The general approach to shard replacement matches the existing implementation, but takes care to not needlessly interfere with processes at other versions.
+
+To support schema migration by shard replacement, we need a place to store new shard IDs across restarts, and we will keep using the migration shard for this.
+The migration shard contains entries of the form `(GlobalId, Version) -> ShardId`.


I wonder if we can use the catalog epoch or similar for this? Since being able to abort an upgrade and then upgrade to a smaller version is in scope, if we fail an upgrade from A to C, then successfully upgrade A to B and later B to C, it seems like we might end up with weird state that is partly from the failed early upgrade and partly from the later one.

I think it either needs to be the Mz version, or a new "builtin version" we introduce. In any case, the version needs to identify the schemas of the builtin collections. Otherwise how does a read-only process know if an entry it finds for, e.g. (catalog epoch + 1), was written by a different process with the same builtin schemas, or by one with different schemas?

if we fail an upgrade from A to C, then successfully upgrade A to B and later B to C, it seems like we might end up with weird state that is partly from the failed early upgrade and partly from the later one.

I think that should work out fine:

C starts read-only, mints new shard IDs, populates the migration shard with entries (<gid>, C, <shard-C>). Maybe starts to hydrate but then is aborted.

B starts read-only, mints new shard IDs, populates the migration shard with entries (<gid>, B, <shard-B>). Hydrates successfully and promotes.

Upon B's promotion the <shard-B> shards become the "official" shards, the shards they replace get finalized.

C starts read-only again, finds existing <shard-C> entries in the migration shard, so doesn't mint new one. C hydrates successfully and promotes.

Upon C's promotion the <shard-C> shards become the "official" shards, the <shard-B> shards get finalized.

The above assumes that in version C we have to replace the same shards as in version B (i.e. we made two incompatible schema changes). More likely, version C doesn't have to perform any migration at all, or migrates different shards than B.

Poking through, the specific field I had in mind was deploy_generation, which increases for every unique deploy. My understanding is that a deploy happens at a specific version, so it should be impossible for the schemas to change while the deploy generation remains the same.

I think that should work out fine [...]

To be clear: I'm not worried about the schemas, but rather the actual contents of the shard. It seems like during your step 1, I might start writing data to the shard, then stop writing to it until step 4. This means we may end up with a shard whose semantics are difficult to explain. Today, the semantics of a history shard are at least straightforward to explain - it includes all events from a particular deploy after some arbitrary cutoff time. With this approach we'd also include little chunks of data from previous deploys that were otherwise not visible to any client.

OTOH, if the shards belong to a particular deploy, then if that deploy fails they would be ignored and the data in them wouldn't be observed by any future successful deploy.

To be clear: I'm not worried about the schemas, but rather the actual contents of the shard. It seems like during your step 1, I might start writing data to the shard, then stop writing to it until step 4.

I think this is also mostly fine, at least given how things are currently working, because:

For builtin storage-collections (including all the histories) we don't write them in the read-only environment, just advance their frontiers. So no previous contents to worry about here.

For builtin tables, we truncate them when we start up (or emit a correcting diff, not quite sure). So the code deals with previous contents that way.

Note that we have to be able to handle migration restarts at the same version because envd can restart in read-only mode (and does so every time it observes new DDL from the leader env) and when it does it needs to not become confused running the migration again.

My understanding is that a deploy happens at a specific version, so it should be impossible for the schemas to change while the deploy generation remains the same.

I'm not sure that's true! At least according to this doc the deploy generation is only increased during a leader promotion. So nothing prevents a user to start an upgrade with deploy generation N + 1, then abort that, then start another upgrade with a different Mz version but again using deploy generation N + 1.

All that said, it seems like a good idea to key the migration shard not by deploy generation, but by (Mz version, deploy generation). The reason is that in theory a self-managed user could decide to start two upgrades to the same Mz version but at different deploy generations at the same time. Not saying that this would be a reasonable thing to do, but possible. And I imagine in this case we'd want the prevent the two deploys from sharing migration state, for sanity reasons.

bkirwi · 2025-10-16T18:42:40Z

doc/developer/design/20251015_builtin_schema_migration.md

+To avoid data loss and other surprises caused by automatic builtin schema migrations, we introduce the concept of explicit migration instructions.
+A migration instruction instructs the process which builtin collection to migrate at which version, and which mechanism to use.
+
+Migration instructions are kept in a hard-coded list:


Inserting compatible migrations is fine, but maybe we still want to disallow them, to not muddy the waters too much. Given that patch releases should be reserved for critical bug fixes, I don't think disallowing builtin schema changes in them is unreasonable.

Among other things, it would be a semver break, since we'd end up removing API surface in a minor release.

There is the question of whether we can prevent people from accidentally adding schema migrations retroactively.

I think we can do a decent amount of checking here in CI... verify that the list of migrations in the last release is consistent with the current one. I think we'll want similar CI to ensure that we don't prematurely drop migrations, for example.

bkirwi · 2025-10-16T18:42:57Z

doc/developer/design/20251015_builtin_schema_migration.md

+## Alternatives
+
+As an alternative to schema evolution, we can consider a migration scheme that creates a new shard and copies over all existing data from the old shard, performing the migration in the process.
+Doing so would enable us to perform arbitrary rewrites of the data, as well as breaking schema changes without loss of historical data.


It seems to me like this would be a natural extension of the approach you propose above. (Just another type of entry in the MIGRATIONS slice.) Happy to leave it for future work.

bkirwi · 2025-10-16T18:48:06Z

doc/developer/design/20251015_builtin_schema_migration.md

+
+### Shard Replacement
+
+The general approach to shard replacement matches the existing implementation, but takes care to not needlessly interfere with processes at other versions.


Something I don't understand:

For Persist-level migrations, the resulting shard contents will be the contents of the previous leader up until the new version took leadership, then anything we add after that.

For shard replacements... what's the desired behaviour? This section makes it sound like we'd create the shards even in read-only mode, so shard replacements will include data from the read-only replicas, which is inconsistent. (It will contain logs of data that was never visible to the user.) But if we only create the new shards when taking leadership, why do we even need the migration shard?

Yes, we have to create the replacements in read-only mode because everything in adapter assumes the builtin storage collections to exist. There is also the issue that the entire point of read-only mode is that we can let dataflows hydrate before the cutover. But that means that for builtin collections with replaced shards, you have to write to these shards in read-only mode, or at least tick their frontiers forward, so that dataflows reading from these collections can make progress.

We have special code to tick forward the replacement shards:

materialize/src/storage-controller/src/lib.rs

Lines 1173 to 1181 in 7ed8ee0

// In read-only mode, we use a special read-only table worker

// that allows writing to migrated tables and will continually

// bump their shard upper so that it tracks the txn shard upper.

// We do this, so that they remain readable at a recent

// timestamp, which in turn allows dataflows that depend on them

// to (re-)hydrate.

//

// We only want to register migrated tables, though, and leave

// existing tables out/never write to them in read-only mode.

For builtin tables, we even write their out their contents:

materialize/src/adapter/src/coord.rs

Lines 2204 to 2208 in 7ed8ee0

// When 0dt is enabled, we create new shards for any migrated builtin storage collections.

// In read-only mode, the migrated builtin tables (which are a subset of migrated builtin

// storage collections) need to be back-filled so that any dependent dataflow can be

// hydrated. Additionally, these shards are not registered with the txn-shard, and cannot

// be registered while in read-only, so they are written to directly.

So, yes the inconsistency you point out exists. It also exists regardless of read-only mode with indexes on retained-history collection. Suppose a new Mz version has an optimizer or rendering change that changes the results of dataflow computations. An index with retained history will show the new results for times before the upgrade.

Got it! I don't love the inconsistency, but it sounds like it's no worse than stuff we're already doing, so I will not worry about it.

bkirwi

Some lingering questions about the semantics of various collections, but I think this is a pretty clear step in the right direction - thanks!

SangJunBak

LGTM on my end!

ggevay

Makes sense to me too!

ggevay · 2025-10-20T13:46:21Z

doc/developer/design/20251015_builtin_schema_migration.md

+  * For each object to migrate, perform the migration using the selected mechanism.
+
+Note that merging `Evolution` migrations like this is sound because persist requires that each shard schema is backward compatible with any previous schema registered with the shard.
+Which means schema evolution will succeed even if we skip intermediary schemas.


(In other words, compatibility of schema changes is transitive.)

Go away with your smart words :D

teskje · 2025-10-28T12:43:51Z

TFTRs!

teskje marked this pull request as ready for review October 15, 2025 13:21

teskje requested review from a team, SangJunBak and bkirwi October 15, 2025 13:23

SangJunBak reviewed Oct 16, 2025

View reviewed changes

bkirwi reviewed Oct 16, 2025

View reviewed changes

bkirwi approved these changes Oct 17, 2025

View reviewed changes

SangJunBak approved these changes Oct 17, 2025

View reviewed changes

ggevay approved these changes Oct 20, 2025

View reviewed changes

teskje mentioned this pull request Oct 21, 2025

persist: register schema at write time, not writer open time #33902

Merged

5 tasks

design: builtin schema migration

5c1f071

teskje force-pushed the design-builtin-schema-migration branch from ec68e20 to 5c1f071 Compare October 27, 2025 12:08

teskje merged commit ee60437 into MaterializeInc:main Oct 28, 2025
5 checks passed

teskje deleted the design-builtin-schema-migration branch October 28, 2025 12:43

teskje mentioned this pull request Nov 5, 2025

catalog: builtin schema migration #34011

Merged

5 tasks


		### Shard Replacement

		The general approach to shard replacement matches the existing implementation, but takes care to not needlessly interfere with processes at other versions.

	// In read-only mode, we use a special read-only table worker
	// that allows writing to migrated tables and will continually
	// bump their shard upper so that it tracks the txn shard upper.
	// We do this, so that they remain readable at a recent
	// timestamp, which in turn allows dataflows that depend on them
	// to (re-)hydrate.
	//
	// We only want to register migrated tables, though, and leave
	// existing tables out/never write to them in read-only mode.

	// When 0dt is enabled, we create new shards for any migrated builtin storage collections.
	// In read-only mode, the migrated builtin tables (which are a subset of migrated builtin
	// storage collections) need to be back-filled so that any dependent dataflow can be
	// hydrated. Additionally, these shards are not registered with the txn-shard, and cannot
	// be registered while in read-only, so they are written to directly.

design: builtin schema migration #33863

design: builtin schema migration #33863

Uh oh!

Conversation

teskje commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Tips for reviewer

Checklist

Uh oh!

SangJunBak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkirwi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bkirwi left a comment

Choose a reason for hiding this comment

Uh oh!

SangJunBak left a comment

Choose a reason for hiding this comment

Uh oh!

ggevay left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

teskje commented Oct 28, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

teskje commented Oct 15, 2025 •

edited

Loading