-
Notifications
You must be signed in to change notification settings - Fork 412
Expire snapshot mutability issue #2430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expire snapshot mutability issue #2430
Conversation
|
Tried this change out in code where we are expiring snapshots from 2 iceberg tables in separate threads and all is working fine now. 👍 |
Thanks for testing it!!! Let me know if you bump into any other issues. |
pyiceberg/table/update/snapshot.py
Outdated
| _snapshot_ids_to_expire: Set[int] = set() | ||
| _updates: Tuple[TableUpdate, ...] = () | ||
| _requirements: Tuple[TableRequirement, ...] = () | ||
| def __init__(self, transaction: Transaction) -> None: | ||
| super().__init__(transaction) | ||
| # Initialize instance-level attributes to avoid sharing state between instances | ||
| self._snapshot_ids_to_expire: Set[int] = set() | ||
| self._updates: Tuple[TableUpdate, ...] = () | ||
| self._requirements: Tuple[TableRequirement, ...] = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
Nit: I'd personally keep class-level annotations here (with assignment in the constructor, so state still shouldn't be shared), so the code would look similar to what we have for Transaction:
iceberg-python/pyiceberg/table/__init__.py
Lines 247 to 263 in 52d810e
| class Transaction: | |
| _table: Table | |
| _autocommit: bool | |
| _updates: Tuple[TableUpdate, ...] | |
| _requirements: Tuple[TableRequirement, ...] | |
| def __init__(self, table: Table, autocommit: bool = False): | |
| """Open a transaction to stage and commit changes to a table. | |
| Args: | |
| table: The table that will be altered. | |
| autocommit: Option to automatically commit the changes when they are staged. | |
| """ | |
| self._table = table | |
| self._autocommit = autocommit | |
| self._updates = () | |
| self._requirements = () |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@smaheshwar-pltr i applied the changes
rambleraptor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This look sgreat. Ideally, we wouldn't have to have so many mocks in the tests (my understanding is that it's mostly to avoid boilerplate), but I think it'll be fine
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this @ForeverAngry 🙌 I've checked the tests, and the majority do not really test the underlying issue since they are passing on main as well. How about removing those? Next to that, I left one small comment on the test that actually fails on the main branch. Apart from that, this looks good to go 🚀
| assert len(table_v2.metadata.snapshots) == 1 | ||
|
|
||
|
|
||
| def test_thread_safety_fix() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test fails on the old code 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your saying this is the one, good test, right? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, indeed :)
| super().__init__(transaction) | ||
| self._updates = () | ||
| self._requirements = () | ||
| self._snapshot_ids_to_expire = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the smoking gun, since the set() is mutable, and the tuple() isn't 👍
Co-authored-by: Fokko Driesprong <fokko@apache.org>
…ead safety assertions
ForeverAngry
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko, I applied your suggestions, however, since both tests are just manipulating the _snapshot_ids_to_expire directly and never call .commit(), deciced to just use Mock() for both tests as the alternative would require changing the core types (or atleast thats what it seemed like to me). Let me know what you think.
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good @ForeverAngry Thanks for adding this, and thanks @jayceslesar, @rambleraptor and @smaheshwar-pltr for the review!
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
Related to #2409, and partially closes #2427
# Rationale for this change
This PR fixes a thread safety issue in the `ManageSnapshots` class
similar to the one identified in `ExpireSnapshots` (#2409). While the
original issue specifically mentioned `ExpireSnapshots`, the same thread
safety vulnerability exists in `ManageSnapshots` due to identical
problematic design patterns. The same testing methodology used in #2430
was adapted for this.
**Root Cause:**
The `ManageSnapshots` class had class-level attributes (`_updates`,
`_requirements`) that were shared across all instances. When multiple
threads created different `ManageSnapshots` instances for concurrent
operations, they all shared the same underlying tuple objects for
tracking updates and requirements.
**Potential Impact:**
- Thread 1: `table1.manage_snapshots().create_tag(...)` adds updates to
shared tuple
- Thread 2: `table2.manage_snapshots().create_branch(...)` adds updates
to same shared tuple
- Result: Both threads would have mixed updates, potentially causing
incorrect operations or failures
**Solution:**
Applied the same fix as ExpireSnapshots - moved the shared class-level
attributes to instance-level attributes in the `__init__` method,
ensuring each `ManageSnapshots` instance has its own isolated state.
**Relationship to #2409:**
While #2409 specifically reported ExpireSnapshots thread safety issues,
this PR proactively addresses the same vulnerability pattern in
ManageSnapshots to prevent similar issues from occurring with snapshot
management operations (tags, branches, etc.).
## Are these changes tested?
> 📢 🔥 Big shout-out to @QlikFrederic, as the testing methodology was
largely derived from the testing and analysis done by the user! 🔥 📢
Yes, comprehensive test coverage has been added with a dedicated test
file `test_manage_snapshots_thread_safety.py`:
- **`test_manage_snapshots_thread_safety_fix()`** - Verifies that
different ManageSnapshots instances have separate update/requirement
tuples
- **`test_manage_snapshots_concurrent_operations()`** - Tests concurrent
operations don't contaminate each other
- **`test_manage_snapshots_concurrent_different_tables()`** - Tests
concurrent operations on different tables work correctly
- **`test_manage_snapshots_cross_table_isolation()`** - Validates no
cross-contamination of operations between tables
-
**`test_manage_snapshots_concurrent_same_table_different_operations()`**
- Tests concurrent operations on the same table
All tests demonstrate that the thread safety fix works correctly and
that concurrent ManageSnapshots operations maintain proper isolation.
## Are there any user-facing changes?
**No breaking changes.** The public API remains identical:
- All existing `ManageSnapshots` methods work the same way
(`create_tag`, `create_branch`, `delete_tag`, etc.)
- Method signatures are unchanged
- Behavior is identical except for the thread safety improvement
**Behavioral improvement:**
- Concurrent `manage_snapshots()` operations on different tables now
work correctly without interference
- No risk of mixed updates/requirements between different
ManageSnapshots instances in multi-threaded environments
- Improved reliability for applications using ManageSnapshots in
concurrent scenarios
This is a pure bug fix.
Closes #2409, and partially closes #2427
Rationale for this change
This PR fixes a critical thread safety issue in the
ExpireSnapshotsclass where concurrent snapshot expiration operations on different tables would share snapshot IDs, causing operations to fail with "snapshot does not exist" errors.Root Cause:
The
ExpireSnapshotsclass had class-level attributes (_snapshot_ids_to_expire,_updates,_requirements) that were shared across all instances. When multiple threads created differentExpireSnapshotsinstances, they all shared the same underlyingset()object for tracking snapshot IDs.Impact:
table1.expire_snapshots().by_id(1001)adds1001to shared settable2.expire_snapshots().by_id(2001)adds2001to same shared set{1001, 2001}and try to expire snapshot1001fromtable2, causing failureSolution:
Moved the shared class-level attributes to instance-level attributes in the
__init__method, ensuring eachExpireSnapshotsinstance has its own isolated state.Are these changes tested?
Yes, comprehensive test coverage has been added:
test_thread_safety_fix()- Verifies that different ExpireSnapshots instances have separate snapshot setstest_concurrent_operations()- Tests concurrent operations don't contaminate each othertest_concurrent_different_tables_expiration()- Reproduces the exact scenario from GitHub issue commit on expire_snapshot tries to remove snapshot from wrong table. #2409test_concurrent_same_table_different_snapshots()- Tests concurrent operations on the same tabletest_cross_table_snapshot_id_isolation()- Validates no cross-contamination of snapshot IDs between tablestest_batch_expire_snapshots()- Tests batch expiration operations in threaded environmentsAll existing tests continue to pass, ensuring no regression in functionality.
Are there any user-facing changes?
No breaking changes. The public API remains identical:
ExpireSnapshotsmethods work the same wayBehavioral improvement:
expire_snapshots()operations on different tables now work correctlyThis is a pure bug fix with no user-facing API changes.