Skip to content

Conversation

@ForeverAngry
Copy link
Contributor

@ForeverAngry ForeverAngry commented Sep 4, 2025

Closes #2409, and partially closes #2427

Rationale for this change

This PR fixes a critical thread safety issue in the ExpireSnapshots class where concurrent snapshot expiration operations on different tables would share snapshot IDs, causing operations to fail with "snapshot does not exist" errors.

Root Cause:
The ExpireSnapshots class had class-level attributes (_snapshot_ids_to_expire, _updates, _requirements) that were shared across all instances. When multiple threads created different ExpireSnapshots instances, they all shared the same underlying set() object for tracking snapshot IDs.

Impact:

  • Thread 1: table1.expire_snapshots().by_id(1001) adds 1001 to shared set
  • Thread 2: table2.expire_snapshots().by_id(2001) adds 2001 to same shared set
  • Result: Both threads have {1001, 2001} and try to expire snapshot 1001 from table2, causing failure

Solution:
Moved the shared class-level attributes to instance-level attributes in the __init__ method, ensuring each ExpireSnapshots instance has its own isolated state.

Are these changes tested?

📢 🔥 Big shout-out to @QlikFrederic, as the testing methodology was largely derived from the testing and analysis done by the user! 🔥 📢

Yes, comprehensive test coverage has been added:

  • test_thread_safety_fix() - Verifies that different ExpireSnapshots instances have separate snapshot sets
  • test_concurrent_operations() - Tests concurrent operations don't contaminate each other
  • test_concurrent_different_tables_expiration() - Reproduces the exact scenario from GitHub issue commit on expire_snapshot tries to remove snapshot from wrong table. #2409
  • test_concurrent_same_table_different_snapshots() - Tests concurrent operations on the same table
  • test_cross_table_snapshot_id_isolation() - Validates no cross-contamination of snapshot IDs between tables
  • test_batch_expire_snapshots() - Tests batch expiration operations in threaded environments

All existing tests continue to pass, ensuring no regression in functionality.

Are there any user-facing changes?

No breaking changes. The public API remains identical:

  • All existing ExpireSnapshots methods work the same way
  • Method signatures are unchanged
  • Behavior is identical except for the thread safety fix

Behavioral improvement:

  • Concurrent expire_snapshots() operations on different tables now work correctly
  • No more "snapshot does not exist" errors when using ExpireSnapshots in multi-threaded environments

This is a pure bug fix with no user-facing API changes.

@QlikFrederic
Copy link

Tried this change out in code where we are expiring snapshots from 2 iceberg tables in separate threads and all is working fine now. 👍

@ForeverAngry
Copy link
Contributor Author

Tried this change out in code where we are expiring snapshots from 2 iceberg tables in separate threads and all is working fine now. 👍

Thanks for testing it!!! Let me know if you bump into any other issues.

Comment on lines 927 to 932
_snapshot_ids_to_expire: Set[int] = set()
_updates: Tuple[TableUpdate, ...] = ()
_requirements: Tuple[TableRequirement, ...] = ()
def __init__(self, transaction: Transaction) -> None:
super().__init__(transaction)
# Initialize instance-level attributes to avoid sharing state between instances
self._snapshot_ids_to_expire: Set[int] = set()
self._updates: Tuple[TableUpdate, ...] = ()
self._requirements: Tuple[TableRequirement, ...] = ()
Copy link
Contributor

@smaheshwar-pltr smaheshwar-pltr Sep 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Nit: I'd personally keep class-level annotations here (with assignment in the constructor, so state still shouldn't be shared), so the code would look similar to what we have for Transaction:

class Transaction:
_table: Table
_autocommit: bool
_updates: Tuple[TableUpdate, ...]
_requirements: Tuple[TableRequirement, ...]
def __init__(self, table: Table, autocommit: bool = False):
"""Open a transaction to stage and commit changes to a table.
Args:
table: The table that will be altered.
autocommit: Option to automatically commit the changes when they are staged.
"""
self._table = table
self._autocommit = autocommit
self._updates = ()
self._requirements = ()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smaheshwar-pltr i applied the changes

Copy link
Contributor

@rambleraptor rambleraptor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This look sgreat. Ideally, we wouldn't have to have so many mocks in the tests (my understanding is that it's mostly to avoid boilerplate), but I think it'll be fine

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this @ForeverAngry 🙌 I've checked the tests, and the majority do not really test the underlying issue since they are passing on main as well. How about removing those? Next to that, I left one small comment on the test that actually fails on the main branch. Apart from that, this looks good to go 🚀

assert len(table_v2.metadata.snapshots) == 1


def test_thread_safety_fix() -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails on the old code 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your saying this is the one, good test, right? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed :)

super().__init__(transaction)
self._updates = ()
self._requirements = ()
self._snapshot_ids_to_expire = set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the smoking gun, since the set() is mutable, and the tuple() isn't 👍

ForeverAngry and others added 2 commits September 23, 2025 21:08
Copy link
Contributor Author

@ForeverAngry ForeverAngry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko, I applied your suggestions, however, since both tests are just manipulating the _snapshot_ids_to_expire directly and never call .commit(), deciced to just use Mock() for both tests as the alternative would require changing the core types (or atleast thats what it seemed like to me). Let me know what you think.

@ForeverAngry ForeverAngry requested a review from Fokko September 24, 2025 23:57
@Fokko Fokko changed the title Expire snapshot thread safety issue 2409 Expire snapshot mutability issue Sep 26, 2025
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @ForeverAngry Thanks for adding this, and thanks @jayceslesar, @rambleraptor and @smaheshwar-pltr for the review!

@Fokko Fokko merged commit d5e039f into apache:main Sep 26, 2025
10 checks passed
Fokko pushed a commit that referenced this pull request Sep 26, 2025
<!--
Thanks for opening a pull request!
-->

<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
Related to #2409, and partially closes #2427

# Rationale for this change

This PR fixes a thread safety issue in the `ManageSnapshots` class
similar to the one identified in `ExpireSnapshots` (#2409). While the
original issue specifically mentioned `ExpireSnapshots`, the same thread
safety vulnerability exists in `ManageSnapshots` due to identical
problematic design patterns. The same testing methodology used in #2430
was adapted for this.

**Root Cause:** 
The `ManageSnapshots` class had class-level attributes (`_updates`,
`_requirements`) that were shared across all instances. When multiple
threads created different `ManageSnapshots` instances for concurrent
operations, they all shared the same underlying tuple objects for
tracking updates and requirements.

**Potential Impact:**
- Thread 1: `table1.manage_snapshots().create_tag(...)` adds updates to
shared tuple
- Thread 2: `table2.manage_snapshots().create_branch(...)` adds updates
to same shared tuple
- Result: Both threads would have mixed updates, potentially causing
incorrect operations or failures

**Solution:**
Applied the same fix as ExpireSnapshots - moved the shared class-level
attributes to instance-level attributes in the `__init__` method,
ensuring each `ManageSnapshots` instance has its own isolated state.

**Relationship to #2409:**
While #2409 specifically reported ExpireSnapshots thread safety issues,
this PR proactively addresses the same vulnerability pattern in
ManageSnapshots to prevent similar issues from occurring with snapshot
management operations (tags, branches, etc.).

## Are these changes tested?

> 📢 🔥 Big shout-out to @QlikFrederic, as the testing methodology was
largely derived from the testing and analysis done by the user! 🔥 📢

Yes, comprehensive test coverage has been added with a dedicated test
file `test_manage_snapshots_thread_safety.py`:

- **`test_manage_snapshots_thread_safety_fix()`** - Verifies that
different ManageSnapshots instances have separate update/requirement
tuples
- **`test_manage_snapshots_concurrent_operations()`** - Tests concurrent
operations don't contaminate each other
- **`test_manage_snapshots_concurrent_different_tables()`** - Tests
concurrent operations on different tables work correctly
- **`test_manage_snapshots_cross_table_isolation()`** - Validates no
cross-contamination of operations between tables
-
**`test_manage_snapshots_concurrent_same_table_different_operations()`**
- Tests concurrent operations on the same table

All tests demonstrate that the thread safety fix works correctly and
that concurrent ManageSnapshots operations maintain proper isolation.

## Are there any user-facing changes?

**No breaking changes.** The public API remains identical:
- All existing `ManageSnapshots` methods work the same way
(`create_tag`, `create_branch`, `delete_tag`, etc.)
- Method signatures are unchanged
- Behavior is identical except for the thread safety improvement

**Behavioral improvement:**
- Concurrent `manage_snapshots()` operations on different tables now
work correctly without interference
- No risk of mixed updates/requirements between different
ManageSnapshots instances in multi-threaded environments
- Improved reliability for applications using ManageSnapshots in
concurrent scenarios

This is a pure bug fix.
@ForeverAngry ForeverAngry deleted the expire-snapshot-thread-safety-issue-2409 branch November 7, 2025 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

check for class with mutable state as class attributes commit on expire_snapshot tries to remove snapshot from wrong table.

6 participants