-
Notifications
You must be signed in to change notification settings - Fork 413
feat: delete orphaned files #1958
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @jayceslesar, sorry for the late review.
I think this is a great start, I left some comments, let me know what you think!
smaheshwar-pltr
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @jayceslesar, using InpsectTable to get orphaned files to submit to the executor pool is a nice idea! Just some concerns / suggestions / debugging help 😄
kevinjqliu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I added a few comments. ptal :)
|
a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use i like the idea of having all the table maintenance functions together, similar to delta table's optimize |
I think that makes sense -- would #1880 end up there too? Also ideally there is a CLI that exposes all the maintenance actions too right? I think moving things to a new |
That's a good point. However, I think we should be able to either run them separate as well. For example, delete orphan files won't affect the speed of the table, so it is more of a maintenance feature to reduce object storage costs. Delete orphan files can also be pretty costly because of the list operation, ideally you would delegate this to the catalog that uses, for example, s3 inventory. |
Anton-Tarazi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, left some minor comments. Looking forward to this feature :)
| executor = ExecutorFactory.get_or_create() | ||
| snapshot_ids = [snapshot.snapshot_id for snapshot in snapshots] | ||
| files_by_snapshots: Iterator[Set[str]] = executor.map( | ||
| lambda snapshot_id: set(self.files(snapshot_id)["file_path"].to_pylist()), snapshot_ids |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be nice if InspectTable.files or InspectTable._files took an Optional[Union[int, Snapshot]] so we didn't have to get the id from a snapshot and then turn it back into a Snapshot inside InspectTable._files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think there are a lot of places where we arbitrarily use one over the other and imo would be nice to standardize. Probably out of scope for this PR but I think would definitely clean things up
pyiceberg/table/maintenance.py
Outdated
| as_of = datetime.now(timezone.utc) - older_than | ||
| all_files = [ | ||
| f.path for f in fs.get_file_info(selector) if f.type == FileType.File and (as_of is None or (f.mtime < as_of)) | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would as_of be None? Also can we construct a set directly here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, cleaner now
pyiceberg/table/maintenance.py
Outdated
| except ModuleNotFoundError as e: | ||
| raise ModuleNotFoundError("For metadata operations PyArrow needs to be installed") from e | ||
|
|
||
| def _orphaned_files(self, location: str, older_than: timedelta = timedelta(days=3)) -> Set[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: could we get rid of the default here since its in remove_orphan_files? could also make this default to None and update handling of as_of below to support None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be implemented
|
@Fokko we probably also want pyiceberg to have some idea about https://iceberg.apache.org/spec/#delete-formats right? Is it currently aware of those files? |
|
@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files. |
Sounds good, I will add the partition statistics files when that is merged! |
|
Once issue I've found with this PR is that the catalog properties need to propagate to |
pyiceberg/table/maintenance.py
Outdated
| flat_known_files: set[str] = reduce(set.union, all_known_files.values(), set()) | ||
|
|
||
| scheme, _, _ = PyArrowFileIO.parse_location(location) | ||
| pyarrow_io = PyArrowFileIO() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pyarrow_io = PyArrowFileIO() | |
| pyarrow_io = PyArrowFileIO(properties=self.tbl.catalog.properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Id like to see if I can achieve this without pyarrow and will attempt to do so after working in #2146
pyiceberg/table/maintenance.py
Outdated
| if older_than is None: | ||
| older_than = timedelta(0) | ||
| as_of = datetime.now(timezone.utc) - older_than | ||
| all_files = [f.path for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| all_files = [f.path for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of] | |
| all_files = [f"{scheme}://{f.path}" for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of] |
| try: | ||
| import pyarrow as pa # noqa: F401 | ||
| except ModuleNotFoundError as e: | ||
| raise ModuleNotFoundError( | ||
| "For deleting orphaned files with a PyArrowFileIO, PyArrow needs to be installed" | ||
| ) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this error ever happen? If the table's io is a PyArrowFileIo I think we've already verified that PyArrow is installed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We dont ask if its pyarrowfilio we ask if it isnt fsspecfilio
Co-authored-by: aammar5 <89264433+aammar5@users.noreply.github.com>
|
Going to get around adding tests for both types of FileIO... @Fokko @kevinjqliu anything else you think we need here? |
|
@jayceslesar how's this coming? Let me know if i can help with anything. Id like to use this in prod as well! |
Closes #1200
Rationale for this change
Ability to do more table maintenance from pyiceberg (iceberg-python?)
Are these changes tested?
Added a test!
Are there any user-facing changes?
Yes, this is a new method on the
Tableclass.