Accept table function as destination for part export #1320

arthurpassos · 2026-01-19T12:06:51Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Accept table function as destination for part export, inherit schema if not explicitly provided. Example:

ALTER TABLE src_merge_tree EXPORT PART '2020_1_1_0' TO TABLE FUNCTION s3(s3_conn, filename='...', partition_strategy...) PARTITION BY ..

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

Regression jobs to run:

github-actions · 2026-01-19T12:07:42Z

Workflow [PR], commit [d613c80]

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0c7f7c7c90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-19T12:17:05Z

src/Storages/MergeTree/ExportPartTask.cpp

+    const auto & destination_storage = manifest.destination_storage_ptr;
+    const auto destination_storage_id = destination_storage->getStorageID();



Revalidate destination table before export task runs

The export task now uses the captured destination_storage_ptr directly without rechecking whether the destination table still exists or is dropped/detached. If the destination table is dropped after the ALTER EXPORT is queued but before this background task executes, destination_storage->import(...) will still run and can write data for a table that no longer exists (or has been recreated with a different definition). Previously the task re-resolved the destination from the catalog and aborted on UNKNOWN_TABLE; consider re-looking it up or at least checking is_dropped/locking before proceeding.

Useful? React with 👍 / 👎.

This is an interesting one. This is kind of an existing problem, not exclusive to this PR.

Plain object storage tables don't have state, and there is no trigger / event for when these tables are dropped. At the same time, it is safe (in terms of thread and memory safety) to write to S3 even after the table has been dropped.

Being that said, I think we have a few options:

Document it and leave it as is, meaning that if a table gets deleted after the export process has started, it will continue writing to the S3 bucket.

Somehow trigger an event when the table gets deleted, and cancel exports. Might leak abstractions

Constantly check in the pipeline if the table remains alive

In any case, I don't think this is a major issue. We probably have bigger fish to fry

chatgpt-codex-connector · 2026-01-19T12:17:06Z

src/Storages/MergeTree/MergeTreePartExportManifest.h

    bool operator<(const MergeTreePartExportManifest & rhs) const 
    {
+        const auto lhs_storage_id = destination_storage_ptr->getStorageID();
+        const auto rhs_storage_id = rhs.destination_storage_ptr->getStorageID();
        // Lexicographic comparison: first compare destination storage, then part name


Avoid set ordering on mutable StorageID

The manifest comparison now derives ordering from destination_storage_ptr->getStorageID(), but StorageID for a table can change on RENAME/DETACH while an export is in progress. Because export_manifests is a std::set, mutating the key of an element already in the set violates ordering invariants and can lead to missed lookups/erasures or undefined behavior. Consider storing a stable StorageID (or a hash of table-function arguments) inside the manifest for ordering, even if you keep the StoragePtr for execution.

Useful? React with 👍 / 👎.

This is a fairly interesting one, and kind of related to #1320 (comment).

arthurpassos · 2026-01-19T12:49:03Z

There is one interesting thing to validate on this one: exports are considered duplicate based on source table + part name + destination table name. I need to check what is the name used for table functions.

Edit1:

arthur :) alter table replicated_source export part '2020_0_0_0' to table function s3(s3_conn, filename='function', structure='id UInt32, year UInt16', format=Parquet, partition_strategy='hive') PARTITION BY year;

ALTER TABLE replicated_source
    (EXPORT PART '2020_0_0_0' TO TABLE FUNCTION s3(s3_conn, filename = 'function', structure = 'id UInt32, year UInt16', format = Parquet, partition_strategy = 'hive') PARTITION BY year)

Query id: 5ca93a65-5ab3-45b3-97c0-b0f87d10e060

Ok.

0 rows in set. Elapsed: 50.131 sec. 

arthur :) alter table replicated_source export part '2020_0_0_0' to table function s3(s3_conn, filename='function2', structure='id UInt32, year UInt16', format=Parquet, partition_strategy='hive') PARTITION BY year;

ALTER TABLE replicated_source
    (EXPORT PART '2020_0_0_0' TO TABLE FUNCTION s3(s3_conn, filename = 'function2', structure = 'id UInt32, year UInt16', format = Parquet, partition_strategy = 'hive') PARTITION BY year)

Query id: 89710dcd-0e8c-48c7-9c3a-f003a82b1dcb


Elapsed: 2.795 sec. 

Received exception from server (version 25.8.12):
Code: 236. DB::Exception: Received from localhost:9000. DB::Exception: Data part '2020_0_0_0' is already being exported to table '_table_function.s3'. (ABORTED)

Edit2:

Table function names, by default, are all the same as observed above.

As stated earlier, the list of manifests is unique based on part name + destination table name. But why? IIRC, the only reason is that we wanted to avoid colliding exports (i.e, exporting the same part to the same file path concurrently). According to the S3 docs (chatgpt, actually), the last upload will win in non-versioned buckets. So whoever sends the last PutObject or CompleteMultiPartUpload, wins.

This isn't a big problem I suppose, but it sounds like a cool protection to have. At the same time, it is not a very good protection as it is based on the table name, not the destination storage. For instance: if you have two s3 tables with different names pointing to the same location, it will export it.

The options we have from the top of my head:

Drop this protection. Easiest, up to the user.
Generate a unique id (query id) for each table function, this will allow multiple exports to table functions, but does not prevent the user from exporting the same part to the same location.
Use a key based on the s3 path.

… be exported simutaneously

arthurpassos · 2026-01-19T19:10:03Z

There is one interesting thing to validate on this one: exports are considered duplicate based on source table + part name + destination table name. I need to check what is the name used for table functions.

Edit1:
arthur :) alter table replicated_source export part '2020_0_0_0' to table function s3(s3_conn, filename='function', structure='id UInt32, year UInt16', format=Parquet, partition_strategy='hive') PARTITION BY year;

ALTER TABLE replicated_source
    (EXPORT PART '2020_0_0_0' TO TABLE FUNCTION s3(s3_conn, filename = 'function', structure = 'id UInt32, year UInt16', format = Parquet, partition_strategy = 'hive') PARTITION BY year)

Query id: 5ca93a65-5ab3-45b3-97c0-b0f87d10e060

Ok.

0 rows in set. Elapsed: 50.131 sec. 

arthur :) alter table replicated_source export part '2020_0_0_0' to table function s3(s3_conn, filename='function2', structure='id UInt32, year UInt16', format=Parquet, partition_strategy='hive') PARTITION BY year;

ALTER TABLE replicated_source
    (EXPORT PART '2020_0_0_0' TO TABLE FUNCTION s3(s3_conn, filename = 'function2', structure = 'id UInt32, year UInt16', format = Parquet, partition_strategy = 'hive') PARTITION BY year)

Query id: 89710dcd-0e8c-48c7-9c3a-f003a82b1dcb


Elapsed: 2.795 sec. 

Received exception from server (version 25.8.12):
Code: 236. DB::Exception: Received from localhost:9000. DB::Exception: Data part '2020_0_0_0' is already being exported to table '_table_function.s3'. (ABORTED)
Edit2:

Table function names, by default, are all the same as observed above.

As stated earlier, the list of manifests is unique based on part name + destination table name. But why? IIRC, the only reason is that we wanted to avoid colliding exports (i.e, exporting the same part to the same file path concurrently). According to the S3 docs (chatgpt, actually), the last upload will win in non-versioned buckets. So whoever sends the last PutObject or CompleteMultiPartUpload, wins.

This isn't a big problem I suppose, but it sounds like a cool protection to have. At the same time, it is not a very good protection as it is based on the table name, not the destination storage. For instance: if you have two s3 tables with different names pointing to the same location, it will export it.

The options we have from the top of my head:

Drop this protection. Easiest, up to the user.

Generate a unique id (query id) for each table function, this will allow multiple exports to table functions, but does not prevent the user from exporting the same part to the same location.

Use a key based on the s3 path.

Discussed with Misha and Dima, two options:

Just use part name as the key, meaning a given part can't be exported to different locations at the same time.
Use a key based on the s3 path (n3 above).

For now, I implemented n1.

arthurpassos · 2026-01-19T23:41:53Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d613c808e3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-01-19T23:52:43Z

src/Storages/MergeTree/MergeTreePartExportManifest.h

    bool operator<(const MergeTreePartExportManifest & rhs) const 
    {
-        // Lexicographic comparison: first compare destination storage, then part name
-        auto lhs_storage = destination_storage_id.getQualifiedName();
-        auto rhs_storage = rhs.destination_storage_id.getQualifiedName();
-
-        if (lhs_storage != rhs_storage)
-            return lhs_storage < rhs_storage;
-
        return data_part->name < rhs.data_part->name;
    }


Include destination in export manifest ordering

Because MergeTreeData::exportPartToTable stores manifests in a std::set and relies on the comparator for uniqueness, comparing only data_part->name makes all exports of the same part equivalent even if the destination differs. This means concurrent exports of the same part to two different tables or table functions will now collide and the second will fail with “already being exported,” which is a regression from the previous per-destination behavior. Consider including the destination (e.g., storage ID or pointer) in operator</operator== so the set only deduplicates identical part+destination pairs.

Useful? React with 👍 / 👎.

accept table function as destination for part export

0c7f7c7

chatgpt-codex-connector bot reviewed Jan 19, 2026

View reviewed changes

do not use table destination as a key of the export, now a part can't…

41d6ed2

… be exported simutaneously

arthurpassos added 3 commits January 19, 2026 16:11

forgot about this

911fade

remove not required comment

8c98d73

update docs

d613c80

chatgpt-codex-connector bot reviewed Jan 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Accept table function as destination for part export #1320

Accept table function as destination for part export #1320

arthurpassos commented Jan 19, 2026

Uh oh!

github-actions bot commented Jan 19, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Jan 19, 2026

Uh oh!

arthurpassos Jan 19, 2026

Uh oh!

arthurpassos Jan 19, 2026

Uh oh!

chatgpt-codex-connector bot Jan 19, 2026

Uh oh!

arthurpassos Jan 19, 2026

Uh oh!

arthurpassos commented Jan 19, 2026 •

edited

Loading

Uh oh!

arthurpassos commented Jan 19, 2026 •

edited

Loading

Uh oh!

arthurpassos commented Jan 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Jan 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const auto & destination_storage = manifest.destination_storage_ptr;
		const auto destination_storage_id = destination_storage->getStorageID();

Accept table function as destination for part export #1320

Are you sure you want to change the base?

Accept table function as destination for part export #1320

Conversation

arthurpassos commented Jan 19, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

CI/CD Options

Exclude tests:

Regression jobs to run:

Uh oh!

github-actions bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

arthurpassos Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

arthurpassos Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

arthurpassos Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

arthurpassos commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthurpassos commented Jan 19, 2026

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions bot commented Jan 19, 2026 •

edited

Loading

arthurpassos commented Jan 19, 2026 •

edited

Loading

arthurpassos commented Jan 19, 2026 •

edited

Loading