-
Notifications
You must be signed in to change notification settings - Fork 414
Feat: replace sort order #1500
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: replace sort order #1500
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
Generally lgtm, added a few nit comments
pyiceberg/table/__init__.py
Outdated
| name_mapping=self.table_metadata.name_mapping(), | ||
| ) | ||
|
|
||
| def replace_sort_order(self, case_sensitive: bool = True) -> ReplaceSortOrder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wdyt about update_sort_order instead?
from the spec,
A data or delete file is associated with a sort order by the sort order's id within [a manifest](https://iceberg.apache.org/spec/#manifests). Therefore, the table must declare all the sort orders for lookup. A table could also be configured with a default sort order id, indicating how the new data should be sorted by default. Writers should use this default sort order to sort the data on write, but are not required to if the default order is prohibitively expensive, as it would be for streaming writes.
All sort orders are kept in the manifest in the sort-orders field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Updated class & method names
pyiceberg/table/update/sorting.py
Outdated
| from pyiceberg.table import Transaction | ||
|
|
||
|
|
||
| class SortOrderBuilder: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: do we need the builder here? Or just follow UpdateSchema's pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it but personally I like it because it's easier to read. I'll leave it up to you to decide if you want this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactored this.
|
Thanks for the comments 🙌 @kevinjqliu . Will look at your comments this weekend. |
…assigned id, else set a previous sort order as the default
| def _reuse_or_create_sort_order_id(self) -> int: | ||
| """Return the last assigned sort order id or create a new one.""" | ||
| new_sort_order_id = INITIAL_SORT_ORDER_ID | ||
| for sort_order in self._transaction.table_metadata.sort_orders: | ||
| new_sort_order_id = max(new_sort_order_id, sort_order.order_id) | ||
| if sort_order.fields == self._fields: | ||
| return sort_order.order_id | ||
| elif new_sort_order_id <= sort_order.order_id: | ||
| new_sort_order_id = sort_order.order_id + 1 | ||
| return new_sort_order_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu added this to reuse an existing sort order instead of creating a new one
| super().__init__(transaction) | ||
| self._fields: List[SortField] = [] | ||
| self._case_sensitive: bool = case_sensitive | ||
| self._last_assigned_order_id: Optional[int] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu this is now None by default
| if ( | ||
| self._transaction.table_metadata.default_sort_order_id != new_sort_order.order_id | ||
| and self._transaction.table_metadata.sort_order_by_id(new_sort_order.order_id) is None | ||
| ): | ||
| self._last_assigned_order_id = new_sort_order.order_id | ||
| updates = (AddSortOrderUpdate(sort_order=new_sort_order), SetDefaultSortOrderUpdate(sort_order_id=-1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinjqliu last assigned sort order only set if new sort order is created.
|
@kevinjqliu @Fokko do either one of you have time to review this PR in the coming weeks? I have some time on my hands and I’d love to wrap this PR up! Thanks in advance |
|
Hey @JasperHG90 sorry for the late reply, the notification got buried in my mailbox. I'll definitely review this, it would be great to get this in 👍 |
|
|
||
| def update_sort_order(self, case_sensitive: bool = True) -> UpdateSortOrder: | ||
| """Create a new UpdateSortOrder to update the sort order of this table. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
| Args: | |
| case_sensitive: If field names are case-sensitive. | |
No apology necessary. Happy to receive your review! |
|
Hello, any update on this PR? |
|
@JasperHG90 Sorry for the late reply, and chance that you could revisit this PR? |
|
@JasperHG90 are you still working on this? I'm happy to pick it up if you're not |
jayceslesar
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we want to remove the sort order from a table, should there be an explicit API to do so?
Another general question I have is the distinction between modifying an existing sort order and entirely replacing it? Those can technically be expressed the same way but should we make it more obvious to users?
| def asc( | ||
| self, source_column_name: str, transform: Transform[Any, Any], null_order: NullOrder = NullOrder.NULLS_LAST | ||
| ) -> UpdateSortOrder: | ||
| """Add a sort field with ascending order.""" | ||
| return self._add_sort_field( | ||
| source_id=self._column_name_to_id(source_column_name), | ||
| transform=transform, | ||
| direction=SortDirection.ASC, | ||
| null_order=null_order, | ||
| ) | ||
|
|
||
| def desc( | ||
| self, source_column_name: str, transform: Transform[Any, Any], null_order: NullOrder = NullOrder.NULLS_LAST | ||
| ) -> UpdateSortOrder: | ||
| """Add a sort field with descending order.""" | ||
| return self._add_sort_field( | ||
| source_id=self._column_name_to_id(source_column_name), | ||
| transform=transform, | ||
| direction=SortDirection.DESC, | ||
| null_order=null_order, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these separate functions instead of letting the user specify via constructing and passing in sort field objects? We already have an API for describing a sort field so why abstract it? If the sort order arguments change now we have to take care of it here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _add_sort_field(
self,
source_id: int,
transform: Transform[Any, Any],
direction: SortDirection,
null_order: NullOrder,
) -> UpdateSortOrder:
"""Add a sort field to the sort order list."""
self._fields.append(
SortField(
source_id=source_id,
transform=transform,
direction=direction,
null_order=null_order,
)
)
return selfWhy do we have this wrapper is my question, can we not acheive the same functionality by letting the user pass in a list of SortField objects, which would mimic the API for creating a table:
sort_order = SortOrder(SortField(source_id=2, transform='identity'))
catalog.create_table(
identifier="docs_example.bids",
schema=schema,
partition_spec=partition_spec,
sort_order=sort_order,
)|
We could also cover testing against all catalogs via the |
|
Hi all. Apologies for my late response. TBH this has slipped from my mind in the past couple of months due to work. @jayceslesar and @rambleraptor thanks for your comments. I have some time on my hands the coming couple of weeks, so I'd be happy to colab on this to finish it up if you like. If you'd like to steamroll ahead and implement it ASAP because you need it, that's also fine by me. |
|
Im certainly in no rush so happy to let you see this through! Would be epic to have this in the next minor release (0.11.0) ! |
# Rationale for this change Fixed the tests in #1500, but kept the commits so @JasperHG90 gets all the credits for the great work! Closes #1500 Closes #1245 ## Are these changes tested? ## Are there any user-facing changes? <!-- In the case of user-facing changes, please add the changelog label. --> --------- Co-authored-by: Jasper Ginn <jasperginn@gmail.com>
This PR adds functionality to replace a table's sort order. Closes #1245
Some basic tests are implemented but need to be expanded.
Currently, a new sort order ID is assigned. This adds a sort order to the list of available sort orders rather than replacing it.
@Fokko do you mind taking a look at this and telling me what you think? Thanks in advance!