Commit 5f10bbc
authored
Fix
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
<!-- Closes #${GITHUB_ISSUE_ID} -->
# Rationale for this change
Found out I broke this myself after doing a `git bisect`:
```
36d383d is the first bad commit
commit 36d383d
Author: Fokko Driesprong <fokko@apache.org>
Date: Thu Jan 23 07:50:54 2025 +0100
PyArrow: Avoid buffer-overflow by avoid doing a sort (#1555)
Second attempt of #1539
This was already being discussed back here:
#208 (comment)
This PR changes from doing a sort, and then a single pass over the table
to the approach where we determine the unique partition tuples filter on
them individually.
Fixes #1491
Because the sort caused buffers to be joined where it would overflow in
Arrow. I think this is an issue on the Arrow side, and it should
automatically break up into smaller buffers. The `combine_chunks` method
does this correctly.
Now:
```
0.42877754200890195
Run 1 took: 0.2507691659993725
Run 2 took: 0.24833179199777078
Run 3 took: 0.24401691700040828
Run 4 took: 0.2419595829996979
Average runtime of 0.28 seconds
```
Before:
```
Run 0 took: 1.0768639159941813
Run 1 took: 0.8784021250030492
Run 2 took: 0.8486490420036716
Run 3 took: 0.8614017910003895
Run 4 took: 0.8497851670108503
Average runtime of 0.9 seconds
```
So it comes with a nice speedup as well :)
---------
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
pyiceberg/io/pyarrow.py | 129 ++-
pyiceberg/partitioning.py | 39 +-
pyiceberg/table/__init__.py | 6 +-
pyproject.toml | 1 +
tests/benchmark/test_benchmark.py | 72 ++
tests/integration/test_partitioning_key.py | 1299 ++++++++++++++--------------
tests/table/test_locations.py | 2 +-
7 files changed, 805 insertions(+), 743 deletions(-)
create mode 100644 tests/benchmark/test_benchmark.py
```
Closes #1917
# Are these changes tested?
# Are there any user-facing changes?
<!-- In the case of user-facing changes, please add the changelog label.
-->add_files with non-identity transforms (#1925)1 parent eb8756a commit 5f10bbc
2 files changed
+49
-14
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2241 | 2241 | | |
2242 | 2242 | | |
2243 | 2243 | | |
2244 | | - | |
| 2244 | + | |
| 2245 | + | |
| 2246 | + | |
| 2247 | + | |
2245 | 2248 | | |
2246 | 2249 | | |
2247 | 2250 | | |
2248 | 2251 | | |
2249 | | - | |
2250 | | - | |
2251 | | - | |
2252 | | - | |
| 2252 | + | |
| 2253 | + | |
| 2254 | + | |
| 2255 | + | |
| 2256 | + | |
| 2257 | + | |
| 2258 | + | |
| 2259 | + | |
2253 | 2260 | | |
2254 | | - | |
2255 | | - | |
2256 | | - | |
2257 | | - | |
| 2261 | + | |
| 2262 | + | |
| 2263 | + | |
| 2264 | + | |
| 2265 | + | |
| 2266 | + | |
2258 | 2267 | | |
2259 | 2268 | | |
2260 | 2269 | | |
2261 | 2270 | | |
2262 | 2271 | | |
2263 | 2272 | | |
2264 | | - | |
2265 | | - | |
2266 | | - | |
| 2273 | + | |
2267 | 2274 | | |
2268 | 2275 | | |
2269 | 2276 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
36 | | - | |
| 36 | + | |
37 | 37 | | |
38 | 38 | | |
39 | 39 | | |
40 | 40 | | |
41 | 41 | | |
42 | | - | |
| 42 | + | |
43 | 43 | | |
44 | 44 | | |
45 | 45 | | |
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| 50 | + | |
50 | 51 | | |
51 | 52 | | |
52 | 53 | | |
| |||
898 | 899 | | |
899 | 900 | | |
900 | 901 | | |
| 902 | + | |
| 903 | + | |
| 904 | + | |
| 905 | + | |
| 906 | + | |
| 907 | + | |
| 908 | + | |
| 909 | + | |
| 910 | + | |
| 911 | + | |
| 912 | + | |
| 913 | + | |
| 914 | + | |
| 915 | + | |
| 916 | + | |
| 917 | + | |
| 918 | + | |
| 919 | + | |
| 920 | + | |
| 921 | + | |
| 922 | + | |
| 923 | + | |
| 924 | + | |
| 925 | + | |
| 926 | + | |
| 927 | + | |
| 928 | + | |
0 commit comments