Skip to content

Conversation

@gabeiglio
Copy link
Contributor

@gabeiglio gabeiglio commented Jul 6, 2025

Closes #1930

Rationale for this change

Add _validate_no_new_delete_files and dependencies like DeleteFileIndex, PartitionMap that are needed for the validation

Are these changes tested?

Added test for the validation, DeleteFileIndex, and PartitionMap

Are there any user-facing changes?

No

@gabeiglio gabeiglio marked this pull request as ready for review July 8, 2025 20:14
Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gabeiglio - thank you for working on this PR, and sorry for the delayed review!

I've just started reviewing the proposed changes, and will continue to add some more comments this week.

I think this is a great start! I've added some comments regarding the PartitionMap class by referring to the Java implementation.

@gabeiglio gabeiglio force-pushed the delete-file-index branch from 0c82650 to c0c9236 Compare July 17, 2025 16:58
@gabeiglio
Copy link
Contributor Author

gabeiglio commented Jul 17, 2025

@sungwy thanks for the review! applied the changes.

Copy link
Collaborator

@sungwy sungwy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adopting the feedback @gabeiglio ! I've taken a pass through the validate module and made some suggestions.

if start >= len(self._entries):
return []

if start == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like the logic for filtering positionDeletes. Should we implement an equivalent of canContainEqDeletesForFile to ensure that we are filtering this correctly?


file: DataFile = entry.data_file
content: DataFileContent = file.content

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now have support for Deletion Vectors in PyIceberg. Could we add support for adding deletes from deletion vectors similar to the java reference implementation? Including this will be important for accuracy.

Here's an example of how a function in PyIceberg detects that a Delete File is stored in the form of a deletion vector.

def _read_deletes(io: FileIO, data_file: DataFile) -> Dict[str, pa.ChunkedArray]:
if data_file.file_format == FileFormat.PARQUET:
with io.new_input(data_file.file_path).open() as fi:
delete_fragment = _get_file_format(
data_file.file_format, dictionary_columns=("file_path",), pre_buffer=True, buffer_size=ONE_MEGABYTE
).make_fragment(fi)
table = ds.Scanner.from_fragment(fragment=delete_fragment).to_table()
table = table.unify_dictionaries()
return {
file.as_py(): table.filter(pc.field("file_path") == file).column("pos")
for file in table.column("file_path").chunks[0].dictionary
}
elif data_file.file_format == FileFormat.PUFFIN:
with io.new_input(data_file.file_path).open() as fi:
payload = fi.read()
return PuffinFile(payload).to_vector()
else:
raise ValueError(f"Delete file format not supported: {data_file.file_format}")


def __init__(
self,
delete_entries: list[ManifestEntry],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be helpful to have a more descriptive docstring explaining the inputs and outputs.

For example, I've noticed a discrepancy in the inputs for the DeleteFileIndex in PyIceberg compared to Java. Here, we are using a list of Manifest entries instead of a list of DeleteFiles. Hence, I think it would be important that we filter the entries to ensure that we are only looking at entries that have the file content type of a DELETE_FILE and ignore the DATA_FILE. If we are assuming that the list of ManifestEntry is already filtered when it is provided as an input into this init function, I think it would be important to make note of that in the docstring.

Copy link
Contributor Author

@gabeiglio gabeiglio Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to make the PR much bigger (it is already big) so I tried to trim down some functionality that were not immediately needed by the validation. It would be free to add an extra check for DELETE_FILE inside the DeleteFileIndex so might as well add it. And I hear your feedback on the docstrings ill add them also. Thanks!

@gabeiglio
Copy link
Contributor Author

gabeiglio commented Jul 25, 2025

@sungwy Do you think its worth while to break this PR in multiple? I feel this PR got really big (on my local branch its arround 900 lines with the tests included). I was thinking in the following PRs:

PR# 1 - Partition Map
PR# 2 - EqualityDeleteFile, EqualityDeletes, PositionDelete classes
PR# 3 - DeleteFileIndex + validations

wdyt?

@jayceslesar
Copy link
Contributor

jayceslesar commented Sep 13, 2025

@gabeiglio that sounds good to me, I think we can use the partition map here too?

if partition_set is not None:
Unless I am misremembering but I do remember not implementing the java version in favor of doing the inline logic https://github.com/apache/iceberg/blob/05cd8b52e7d06d0d246475bd11f138d25947fd13/core/src/main/java/org/apache/iceberg/util/PartitionSet.java#L36.

Supporting concurrency in 0.11.0 would be awesome

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support Concurrency Safety Validation: Implement validateNoNewDeleteFiles

3 participants