Fix `add_files` with non-identity transforms #1925

Fokko · 2025-04-16T17:00:13Z

Rationale for this change

Found out I broke this myself after doing a git bisect:

36d383dcb676ae5ef59c34cc2910d16a8e30a80c is the first bad commit
commit 36d383dcb676ae5ef59c34cc2910d16a8e30a80c
Author: Fokko Driesprong <fokko@apache.org>
Date:   Thu Jan 23 07:50:54 2025 +0100

    PyArrow: Avoid buffer-overflow by avoid doing a sort (#1555)
    
    Second attempt of https://github.com/apache/iceberg-python/pull/1539
    
    This was already being discussed back here:
    https://github.com/apache/iceberg-python/issues/208#issuecomment-1889891973
    
    This PR changes from doing a sort, and then a single pass over the table
    to the approach where we determine the unique partition tuples filter on
    them individually.
    
    Fixes https://github.com/apache/iceberg-python/issues/1491
    
    Because the sort caused buffers to be joined where it would overflow in
    Arrow. I think this is an issue on the Arrow side, and it should
    automatically break up into smaller buffers. The `combine_chunks` method
    does this correctly.
    
    Now:
    
    ```
    0.42877754200890195
    Run 1 took: 0.2507691659993725
    Run 2 took: 0.24833179199777078
    Run 3 took: 0.24401691700040828
    Run 4 took: 0.2419595829996979
    Average runtime of 0.28 seconds
    ```
    
    Before:
    
    ```
    Run 0 took: 1.0768639159941813
    Run 1 took: 0.8784021250030492
    Run 2 took: 0.8486490420036716
    Run 3 took: 0.8614017910003895
    Run 4 took: 0.8497851670108503
    Average runtime of 0.9 seconds
    ```
    
    So it comes with a nice speedup as well :)
    
    ---------
    
    Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>

 pyiceberg/io/pyarrow.py                    |  129 ++-
 pyiceberg/partitioning.py                  |   39 +-
 pyiceberg/table/__init__.py                |    6 +-
 pyproject.toml                             |    1 +
 tests/benchmark/test_benchmark.py          |   72 ++
 tests/integration/test_partitioning_key.py | 1299 ++++++++++++++--------------
 tests/table/test_locations.py              |    2 +-
 7 files changed, 805 insertions(+), 743 deletions(-)
 create mode 100644 tests/benchmark/test_benchmark.py

Closes #1917

Are these changes tested?

Are there any user-facing changes?

Closes apache#1917

kevinjqliu

LGTM!

kevinjqliu · 2025-04-16T17:46:01Z

pyiceberg/io/pyarrow.py

-        source_field = schema.find_field(partition_field.source_id)
-        transform = partition_field.transform.transform(source_field.field_type)
-        return transform(lower_value)


ah bug was introduced here

the values need be to transformed first before comparison

My mistake 🙈

Found out I broke this myself after doing a `git bisect`: ``` 36d383d is the first bad commit commit 36d383d Author: Fokko Driesprong <fokko@apache.org> Date: Thu Jan 23 07:50:54 2025 +0100 PyArrow: Avoid buffer-overflow by avoid doing a sort (#1555) Second attempt of #1539 This was already being discussed back here: #208 (comment) This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually. Fixes #1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :) --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> pyiceberg/io/pyarrow.py | 129 ++- pyiceberg/partitioning.py | 39 +- pyiceberg/table/__init__.py | 6 +- pyproject.toml | 1 + tests/benchmark/test_benchmark.py | 72 ++ tests/integration/test_partitioning_key.py | 1299 ++++++++++++++-------------- tests/table/test_locations.py | 2 +- 7 files changed, 805 insertions(+), 743 deletions(-) create mode 100644 tests/benchmark/test_benchmark.py ``` Closes #1917

# Rationale for this change Found out I broke this myself after doing a `git bisect`: ``` 36d383d is the first bad commit commit 36d383d Author: Fokko Driesprong <fokko@apache.org> Date: Thu Jan 23 07:50:54 2025 +0100 PyArrow: Avoid buffer-overflow by avoid doing a sort (apache#1555) Second attempt of apache#1539 This was already being discussed back here: apache#208 (comment) This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually. Fixes apache#1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :) --------- Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com> pyiceberg/io/pyarrow.py | 129 ++- pyiceberg/partitioning.py | 39 +- pyiceberg/table/__init__.py | 6 +- pyproject.toml | 1 + tests/benchmark/test_benchmark.py | 72 ++ tests/integration/test_partitioning_key.py | 1299 ++++++++++++++-------------- tests/table/test_locations.py | 2 +- 7 files changed, 805 insertions(+), 743 deletions(-) create mode 100644 tests/benchmark/test_benchmark.py ``` Closes apache#1917 # Are these changes tested? # Are there any user-facing changes?

Fix add_files with non-identity transforms

e5a7665

Closes apache#1917

Fokko added this to the PyIceberg 0.9.1 milestone Apr 16, 2025

kevinjqliu approved these changes Apr 16, 2025

View reviewed changes

Fokko merged commit 5f10bbc into apache:main Apr 16, 2025
7 checks passed

Fokko deleted the fd-fix-add-non-identity-files branch April 16, 2025 18:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix `add_files` with non-identity transforms #1925

Fix `add_files` with non-identity transforms #1925

Uh oh!

Fokko commented Apr 16, 2025 •

edited

Loading

Uh oh!

kevinjqliu left a comment

Uh oh!

kevinjqliu Apr 16, 2025

Uh oh!

Fokko Apr 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix add_files with non-identity transforms #1925

Fix add_files with non-identity transforms #1925

Uh oh!

Conversation

Fokko commented Apr 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

kevinjqliu Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Fokko Apr 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix `add_files` with non-identity transforms #1925

Fix `add_files` with non-identity transforms #1925

Fokko commented Apr 16, 2025 •

edited

Loading