Skip to content

Conversation

@JasperHG90
Copy link
Contributor

This PR adds functionality to replace a table's sort order. Closes #1245

Some basic tests are implemented but need to be expanded.

Currently, a new sort order ID is assigned. This adds a sort order to the list of available sort orders rather than replacing it.

@Fokko do you mind taking a look at this and telling me what you think? Thanks in advance!

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!
Generally lgtm, added a few nit comments

name_mapping=self.table_metadata.name_mapping(),
)

def replace_sort_order(self, case_sensitive: bool = True) -> ReplaceSortOrder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wdyt about update_sort_order instead?

from the spec,

A data or delete file is associated with a sort order by the sort order's id within [a manifest](https://iceberg.apache.org/spec/#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.

All sort orders are kept in the manifest in the sort-orders field.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated class & method names

from pyiceberg.table import Transaction


class SortOrderBuilder:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: do we need the builder here? Or just follow UpdateSchema's pattern

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it but personally I like it because it's easier to read. I'll leave it up to you to decide if you want this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored this.

@JasperHG90
Copy link
Contributor Author

Thanks for the comments 🙌 @kevinjqliu . Will look at your comments this weekend.

Comment on lines +78 to +87
def _reuse_or_create_sort_order_id(self) -> int:
"""Return the last assigned sort order id or create a new one."""
new_sort_order_id = INITIAL_SORT_ORDER_ID
for sort_order in self._transaction.table_metadata.sort_orders:
new_sort_order_id = max(new_sort_order_id, sort_order.order_id)
if sort_order.fields == self._fields:
return sort_order.order_id
elif new_sort_order_id <= sort_order.order_id:
new_sort_order_id = sort_order.order_id + 1
return new_sort_order_id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinjqliu added this to reuse an existing sort order instead of creating a new one

super().__init__(transaction)
self._fields: List[SortField] = []
self._case_sensitive: bool = case_sensitive
self._last_assigned_order_id: Optional[int] = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinjqliu this is now None by default

Comment on lines +124 to +129
if (
self._transaction.table_metadata.default_sort_order_id != new_sort_order.order_id
and self._transaction.table_metadata.sort_order_by_id(new_sort_order.order_id) is None
):
self._last_assigned_order_id = new_sort_order.order_id
updates = (AddSortOrderUpdate(sort_order=new_sort_order), SetDefaultSortOrderUpdate(sort_order_id=-1))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinjqliu last assigned sort order only set if new sort order is created.

@JasperHG90 JasperHG90 requested a review from kevinjqliu March 27, 2025 09:26
@JasperHG90
Copy link
Contributor Author

@kevinjqliu @Fokko do either one of you have time to review this PR in the coming weeks? I have some time on my hands and I’d love to wrap this PR up! Thanks in advance

@Fokko
Copy link
Contributor

Fokko commented May 16, 2025

Hey @JasperHG90 sorry for the late reply, the notification got buried in my mailbox. I'll definitely review this, it would be great to get this in 👍


def update_sort_order(self, case_sensitive: bool = True) -> UpdateSortOrder:
"""Create a new UpdateSortOrder to update the sort order of this table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
Args:
case_sensitive: If field names are case-sensitive.

@JasperHG90
Copy link
Contributor Author

Hey @JasperHG90 sorry for the late reply, the notification got buried in my mailbox. I'll definitely review this, it would be great to get this in 👍

No apology necessary. Happy to receive your review!

@mwa28
Copy link

mwa28 commented Jul 8, 2025

Hello, any update on this PR?

@Fokko
Copy link
Contributor

Fokko commented Aug 4, 2025

@JasperHG90 Sorry for the late reply, and chance that you could revisit this PR?

@rambleraptor
Copy link
Contributor

@JasperHG90 are you still working on this? I'm happy to pick it up if you're not

Copy link
Contributor

@jayceslesar jayceslesar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we want to remove the sort order from a table, should there be an explicit API to do so?

Another general question I have is the distinction between modifying an existing sort order and entirely replacing it? Those can technically be expressed the same way but should we make it more obvious to users?

Comment on lines +89 to +109
def asc(
self, source_column_name: str, transform: Transform[Any, Any], null_order: NullOrder = NullOrder.NULLS_LAST
) -> UpdateSortOrder:
"""Add a sort field with ascending order."""
return self._add_sort_field(
source_id=self._column_name_to_id(source_column_name),
transform=transform,
direction=SortDirection.ASC,
null_order=null_order,
)

def desc(
self, source_column_name: str, transform: Transform[Any, Any], null_order: NullOrder = NullOrder.NULLS_LAST
) -> UpdateSortOrder:
"""Add a sort field with descending order."""
return self._add_sort_field(
source_id=self._column_name_to_id(source_column_name),
transform=transform,
direction=SortDirection.DESC,
null_order=null_order,
)
Copy link
Contributor

@jayceslesar jayceslesar Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these separate functions instead of letting the user specify via constructing and passing in sort field objects? We already have an API for describing a sort field so why abstract it? If the sort order arguments change now we have to take care of it here too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    def _add_sort_field(
        self,
        source_id: int,
        transform: Transform[Any, Any],
        direction: SortDirection,
        null_order: NullOrder,
    ) -> UpdateSortOrder:
        """Add a sort field to the sort order list."""
        self._fields.append(
            SortField(
                source_id=source_id,
                transform=transform,
                direction=direction,
                null_order=null_order,
            )
        )
        return self

Why do we have this wrapper is my question, can we not acheive the same functionality by letting the user pass in a list of SortField objects, which would mimic the API for creating a table:

sort_order = SortOrder(SortField(source_id=2, transform='identity'))

catalog.create_table(
    identifier="docs_example.bids",
    schema=schema,
    partition_spec=partition_spec,
    sort_order=sort_order,
)

@jayceslesar
Copy link
Contributor

We could also cover testing against all catalogs via the test_catalog.py in the integration tests!

@JasperHG90
Copy link
Contributor Author

JasperHG90 commented Sep 22, 2025

Hi all. Apologies for my late response. TBH this has slipped from my mind in the past couple of months due to work. @jayceslesar and @rambleraptor thanks for your comments. I have some time on my hands the coming couple of weeks, so I'd be happy to colab on this to finish it up if you like. If you'd like to steamroll ahead and implement it ASAP because you need it, that's also fine by me.

@jayceslesar
Copy link
Contributor

jayceslesar commented Sep 22, 2025

Im certainly in no rush so happy to let you see this through! Would be epic to have this in the next minor release (0.11.0) !

@Fokko Fokko mentioned this pull request Sep 30, 2025
kevinjqliu pushed a commit that referenced this pull request Oct 3, 2025
# Rationale for this change

Fixed the tests in #1500,
but kept the commits so @JasperHG90 gets all the credits for the great
work!

Closes #1500 
Closes #1245

## Are these changes tested?

## Are there any user-facing changes?

<!-- In the case of user-facing changes, please add the changelog label.
-->

---------

Co-authored-by: Jasper Ginn <jasperginn@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feat] Support update table's sort order

6 participants