-
Notifications
You must be signed in to change notification settings - Fork 418
feat: validate_deleted_data_files
#1938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 18 commits
beff92b
c369720
41bb8a4
763e9f4
f200beb
7f6bf9d
c63cc55
f2f3a88
74d5569
167f9e4
efe50b4
0793713
89120aa
a39abd2
cf5b061
645e8df
caf7e36
a52c422
d3df4a7
9eca29f
fc83d42
ee3959a
a51c2da
ec86695
0a6b781
a96da01
03b2913
aae22e6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -14,11 +14,17 @@ | |||||||||
| # KIND, either express or implied. See the License for the | ||||||||||
| # specific language governing permissions and limitations | ||||||||||
| # under the License. | ||||||||||
| from typing import Iterator, Optional | ||||||||||
|
|
||||||||||
| from pyiceberg.exceptions import ValidationException | ||||||||||
| from pyiceberg.manifest import ManifestContent, ManifestFile | ||||||||||
| from pyiceberg.expressions import BooleanExpression | ||||||||||
| from pyiceberg.expressions.visitors import _StrictMetricsEvaluator | ||||||||||
| from pyiceberg.manifest import ManifestContent, ManifestEntry, ManifestEntryStatus, ManifestFile | ||||||||||
| from pyiceberg.table import Table | ||||||||||
| from pyiceberg.table.snapshots import Operation, Snapshot, ancestors_between | ||||||||||
| from pyiceberg.typedef import Record | ||||||||||
|
|
||||||||||
| VALIDATE_DATA_FILES_EXIST_OPERATIONS = {Operation.OVERWRITE, Operation.REPLACE, Operation.DELETE} | ||||||||||
|
|
||||||||||
|
|
||||||||||
| def validation_history( | ||||||||||
|
|
@@ -69,3 +75,74 @@ def validation_history( | |||||||||
| raise ValidationException("No matching snapshot found.") | ||||||||||
|
|
||||||||||
| return manifests_files, snapshots | ||||||||||
|
|
||||||||||
|
|
||||||||||
| def deleted_data_files( | ||||||||||
| table: Table, | ||||||||||
| starting_snapshot: Snapshot, | ||||||||||
| data_filter: Optional[BooleanExpression], | ||||||||||
| parent_snapshot: Optional[Snapshot], | ||||||||||
| partition_set: Optional[set[Record]], | ||||||||||
|
||||||||||
| parent_snapshot: Optional[Snapshot], | |
| partition_set: Optional[set[Record]], | |
| partition_set: Optional[set[Record]], | |
| parent_snapshot: Optional[Snapshot], |
It looks like the function call in validate_deleted_data_files is assuming that parent_snapshot is the last argument
conflicting_entries = deleted_data_files(table, starting_snapshot, data_filter, None, parent_snapshot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs some more work. Next to @sungwy's comment, we in the code below:
(entry.data_file.spec_id, entry.data_file.partition) not in partition_setWhere it checks if a tuple is in a partition_set, but the partition_set only contains the Record according to the signature.
This triggered me, because if you do:
ALTER TABLE prod.db.taxis REPLACE PARTITION FIELD pickup_timestamp WITH day(pickup_timestamp);
-- and then
ALTER TABLE prod.db.taxis REPLACE PARTITION FIELD pickup_timestamp WITH day(dropoff_timestamp);Both of the partitioning strategies will produce a Record[int] because it will contain the number of days since epoch. But the meaning is completely different.
jayceslesar marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks out of order to, because validation_history has to_snapshot as the second argument, and from_snapshot as the third argument.
| starting_snapshot, | |
| parent_snapshot, | |
| starting_snapshot, | |
| parent_snapshot, |
Do you think it would be better to update validation_history function to use the following function signature instead? I think it's a lot more expected to have from_snapshot then to_snapshot
def validation_history(
table: Table,
from_snapshot: Snapshot,
to_snapshot: Snapshot,
matching_operations: set[Operation],
manifest_content_filter: ManifestContent,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you clarify a little more? Are you saying we should move to the terms starting_snapshot (which is from_snapshot) and parent_snapshot (to_snapshot)
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure if this is correct, because ManifestGroup.entries seems to be using inclusive projection.
Should we be using inclusive_projection here instead?
Summoning @Fokko for a second review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that this should be inclusive projection, since we want to know if there are any matches. Inclusive projection returns rows_might_match and rows_cannot_match. If they cannot be matched, then we can skip it :)
Uh oh!
There was an error while loading. Please reload this page.