-
Notifications
You must be signed in to change notification settings - Fork 414
feat: validate_no_new_added_delete_files
#2176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
sungwy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gabeiglio - thank you for working on this PR, and sorry for the delayed review!
I've just started reviewing the proposed changes, and will continue to add some more comments this week.
I think this is a great start! I've added some comments regarding the PartitionMap class by referring to the Java implementation.
0c82650 to
c0c9236
Compare
|
@sungwy thanks for the review! applied the changes. |
sungwy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adopting the feedback @gabeiglio ! I've taken a pass through the validate module and made some suggestions.
| if start >= len(self._entries): | ||
| return [] | ||
|
|
||
| if start == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like the logic for filtering positionDeletes. Should we implement an equivalent of canContainEqDeletesForFile to ensure that we are filtering this correctly?
|
|
||
| file: DataFile = entry.data_file | ||
| content: DataFileContent = file.content | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now have support for Deletion Vectors in PyIceberg. Could we add support for adding deletes from deletion vectors similar to the java reference implementation? Including this will be important for accuracy.
Here's an example of how a function in PyIceberg detects that a Delete File is stored in the form of a deletion vector.
iceberg-python/pyiceberg/io/pyarrow.py
Lines 981 to 999 in ad8263b
| def _read_deletes(io: FileIO, data_file: DataFile) -> Dict[str, pa.ChunkedArray]: | |
| if data_file.file_format == FileFormat.PARQUET: | |
| with io.new_input(data_file.file_path).open() as fi: | |
| delete_fragment = _get_file_format( | |
| data_file.file_format, dictionary_columns=("file_path",), pre_buffer=True, buffer_size=ONE_MEGABYTE | |
| ).make_fragment(fi) | |
| table = ds.Scanner.from_fragment(fragment=delete_fragment).to_table() | |
| table = table.unify_dictionaries() | |
| return { | |
| file.as_py(): table.filter(pc.field("file_path") == file).column("pos") | |
| for file in table.column("file_path").chunks[0].dictionary | |
| } | |
| elif data_file.file_format == FileFormat.PUFFIN: | |
| with io.new_input(data_file.file_path).open() as fi: | |
| payload = fi.read() | |
| return PuffinFile(payload).to_vector() | |
| else: | |
| raise ValueError(f"Delete file format not supported: {data_file.file_format}") |
|
|
||
| def __init__( | ||
| self, | ||
| delete_entries: list[ManifestEntry], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be helpful to have a more descriptive docstring explaining the inputs and outputs.
For example, I've noticed a discrepancy in the inputs for the DeleteFileIndex in PyIceberg compared to Java. Here, we are using a list of Manifest entries instead of a list of DeleteFiles. Hence, I think it would be important that we filter the entries to ensure that we are only looking at entries that have the file content type of a DELETE_FILE and ignore the DATA_FILE. If we are assuming that the list of ManifestEntry is already filtered when it is provided as an input into this init function, I think it would be important to make note of that in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not want to make the PR much bigger (it is already big) so I tried to trim down some functionality that were not immediately needed by the validation. It would be free to add an extra check for DELETE_FILE inside the DeleteFileIndex so might as well add it. And I hear your feedback on the docstrings ill add them also. Thanks!
|
@sungwy Do you think its worth while to break this PR in multiple? I feel this PR got really big (on my local branch its arround 900 lines with the tests included). I was thinking in the following PRs: PR# 1 - Partition Map wdyt? |
|
@gabeiglio that sounds good to me, I think we can use the partition map here too?
Supporting concurrency in 0.11.0 would be awesome |
Closes #1930
Rationale for this change
Add
_validate_no_new_delete_filesand dependencies likeDeleteFileIndex,PartitionMapthat are needed for the validationAre these changes tested?
Added test for the validation, DeleteFileIndex, and PartitionMap
Are there any user-facing changes?
No