feat: delete orphaned files #1958

jayceslesar · 2025-04-29T22:42:05Z

Closes #1200

Rationale for this change

Ability to do more table maintenance from pyiceberg (iceberg-python?)

Are these changes tested?

Added a test!

Are there any user-facing changes?

Yes, this is a new method on the Table class.

pyiceberg/table/__init__.py

pyiceberg/table/inspect.py

pyiceberg/table/__init__.py

Fokko

Thanks for working on this @jayceslesar, sorry for the late review.

I think this is a great start, I left some comments, let me know what you think!

pyiceberg/table/__init__.py

pyiceberg/table/inspect.py

smaheshwar-pltr

Thanks for the PR @jayceslesar, using InpsectTable to get orphaned files to submit to the executor pool is a nice idea! Just some concerns / suggestions / debugging help 😄

pyiceberg/table/inspect.py

kevinjqliu

Thanks for the PR! I added a few comments. ptal :)

pyiceberg/table/__init__.py

pyiceberg/table/inspect.py

pyiceberg/table/__init__.py

pyiceberg/table/inspect.py

kevinjqliu · 2025-05-04T01:21:41Z

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

jayceslesar · 2025-05-04T16:53:12Z

a meta question, wydt of moving the orphan file function to its own file/namespace, similar to how to use .inspect.

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

I think that makes sense -- would #1880 end up there too?

Also ideally there is a CLI that exposes all the maintenance actions too right?

I think moving things to a new OptimizeTable class in a new namespace optimize.py makes a lot of sense, can be modeled very similar to the InspectTable and generally makes things cleaner -- I think it still makes sense to have the all_known_files inside of inspect though, and can still use that in the new OptimizeTable

Fokko · 2025-05-13T14:42:37Z

i like the idea of having all the table maintenance functions together, similar to delta table's optimize

That's a good point. However, I think we should be able to either run them separate as well. For example, delete orphan files won't affect the speed of the table, so it is more of a maintenance feature to reduce object storage costs. Delete orphan files can also be pretty costly because of the list operation, ideally you would delegate this to the catalog that uses, for example, s3 inventory.

pyiceberg/table/__init__.py

pyiceberg/table/inspect.py

Anton-Tarazi

Nice work, left some minor comments. Looking forward to this feature :)

Anton-Tarazi · 2025-06-16T03:49:58Z

pyiceberg/table/inspect.py

+        executor = ExecutorFactory.get_or_create()
+        snapshot_ids = [snapshot.snapshot_id for snapshot in snapshots]
+        files_by_snapshots: Iterator[Set[str]] = executor.map(
+            lambda snapshot_id: set(self.files(snapshot_id)["file_path"].to_pylist()), snapshot_ids


might be nice if InspectTable.files or InspectTable._files took an Optional[Union[int, Snapshot]] so we didn't have to get the id from a snapshot and then turn it back into a Snapshot inside InspectTable._files

Yeah I think there are a lot of places where we arbitrarily use one over the other and imo would be nice to standardize. Probably out of scope for this PR but I think would definitely clean things up

Anton-Tarazi · 2025-06-16T03:51:52Z

pyiceberg/table/maintenance.py

+        as_of = datetime.now(timezone.utc) - older_than
+        all_files = [
+            f.path for f in fs.get_file_info(selector) if f.type == FileType.File and (as_of is None or (f.mtime < as_of))
+        ]


when would as_of be None? Also can we construct a set directly here?

Good catch, cleaner now

Anton-Tarazi · 2025-06-16T03:57:49Z

pyiceberg/table/maintenance.py

+        except ModuleNotFoundError as e:
+            raise ModuleNotFoundError("For metadata operations PyArrow needs to be installed") from e
+
+    def _orphaned_files(self, location: str, older_than: timedelta = timedelta(days=3)) -> Set[str]:


nit: could we get rid of the default here since its in remove_orphan_files? could also make this default to None and update handling of as_of below to support None

This should be implemented

jayceslesar · 2025-06-24T12:35:57Z

@Fokko we probably also want pyiceberg to have some idea about https://iceberg.apache.org/spec/#delete-formats right? Is it currently aware of those files?

Fokko · 2025-06-24T14:44:08Z

@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files.

jayceslesar · 2025-06-24T15:35:22Z

@jayceslesar I believe the merge-on-read delete files (positional deletes, equality deletes, and deletion vectors) are returned by the all-files. The only part that's missing is the partition statistics files.

Sounds good, I will add the partition statistics files when that is merged!

aammar5 · 2025-07-10T15:30:08Z

Once issue I've found with this PR is that the catalog properties need to propagate to PyArrowFileIO(properties=...) otherwise endpoint/authentication/etc to things like s3 simply fail ...

aammar5 · 2025-07-10T15:41:36Z

pyiceberg/table/maintenance.py

+        flat_known_files: set[str] = reduce(set.union, all_known_files.values(), set())
+
+        scheme, _, _ = PyArrowFileIO.parse_location(location)
+        pyarrow_io = PyArrowFileIO()


Suggested change

pyarrow_io = PyArrowFileIO()

pyarrow_io = PyArrowFileIO(properties=self.tbl.catalog.properties)

Id like to see if I can achieve this without pyarrow and will attempt to do so after working in #2146

aammar5 · 2025-07-10T16:16:51Z

pyiceberg/table/maintenance.py

+        if older_than is None:
+            older_than = timedelta(0)
+        as_of = datetime.now(timezone.utc) - older_than
+        all_files = [f.path for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of]


Suggested change

all_files = [f.path for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of]

all_files = [f"{scheme}://{f.path}" for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of]

Anton-Tarazi · 2025-08-09T23:16:23Z

pyiceberg/table/maintenance.py

+            try:
+                import pyarrow as pa  # noqa: F401
+            except ModuleNotFoundError as e:
+                raise ModuleNotFoundError(
+                    "For deleting orphaned files with a PyArrowFileIO, PyArrow needs to be installed"
+                ) from e


will this error ever happen? If the table's io is a PyArrowFileIo I think we've already verified that PyArrow is installed

We dont ask if its pyarrowfilio we ask if it isnt fsspecfilio

pyiceberg/table/maintenance.py

pyiceberg/table/inspect.py

Co-authored-by: aammar5 <89264433+aammar5@users.noreply.github.com>

jayceslesar · 2025-09-22T21:01:29Z

Going to get around adding tests for both types of FileIO... @Fokko @kevinjqliu anything else you think we need here?

ForeverAngry · 2025-11-10T15:53:59Z

@jayceslesar how's this coming? Let me know if i can help with anything. Id like to use this in prod as well!

jayceslesar and others added 3 commits April 29, 2025 16:58

feat: delete orphaned files

9dcb580

simpler and a test

e43505c

remove

eed5ea8

jayceslesar commented Apr 29, 2025

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

jayceslesar commented Apr 29, 2025

View reviewed changes

pyiceberg/table/inspect.py Show resolved Hide resolved

jayceslesar commented Apr 29, 2025

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed May 2, 2025

View reviewed changes

jayceslesar added 3 commits May 2, 2025 17:22

updates from review!

8cca600

include dry run and older than

75b1240

add case for dry run

6379480

smaheshwar-pltr suggested changes May 3, 2025

View reviewed changes

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

jayceslesar added 7 commits May 3, 2025 14:16

use .path so we get paths pack

0c2822e

actually pass in iterable

aaf8fc2

capture manifest_list files

b09641b

refactor into all_known_files

beec233

fix type in docstring

b888c56

mildly more readable

ff461ed

beef up tests

3b3b10e

kevinjqliu reviewed May 4, 2025

View reviewed changes

jayceslesar added 4 commits May 4, 2025 12:54

make older_than required

a62c8cf

move under optimize namespace

07cbf1b

add some better logging about what was/was not deleted

54e1e00

Merge branch 'main' into feat/orphan-files

7c780d3

Fokko mentioned this pull request May 13, 2025

Add all filles metadata tables #1626

Merged

Merge branch 'main' into feat/orphan-files

9b6c9ed

Fokko reviewed May 16, 2025

View reviewed changes

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

Fokko reviewed May 16, 2025

View reviewed changes

pyiceberg/table/inspect.py Outdated Show resolved Hide resolved

jayceslesar added 2 commits May 28, 2025 14:55

Merge branch 'main' into feat/orphan-files

85b4ab3

Merge branch 'main' into feat/orphan-files

c414df8

smaheshwar-pltr mentioned this pull request Jun 11, 2025

Added ExpireSnapshots Feature #1880

Merged

Anton-Tarazi reviewed Jun 16, 2025

View reviewed changes

jayceslesar added 3 commits June 21, 2025 11:40

Merge branch 'main' into feat/orphan-files

aa9d536

fix test

b4c14fc

allow older_than to be None

f4d98d2

aammar5 reviewed Jul 10, 2025

View reviewed changes

jayceslesar added 3 commits July 13, 2025 12:50

Merge branch 'main' into feat/orphan-files

acd8ed6

add partition statistics

2a9c607

safer

aae92bc

aammar5 mentioned this pull request Jul 16, 2025

refactor: consolidate snapshot expiration into MaintenanceTable #2143

Merged

jayceslesar mentioned this pull request Jul 30, 2025

feature partity: fsspec vs pyarrow #2259

Open

jayceslesar added 2 commits August 4, 2025 21:49

Merge branch 'main' into feat/orphan-files

756e199

work with both file IO's

ad5387a

Anton-Tarazi reviewed Aug 9, 2025

View reviewed changes

aammar5 reviewed Aug 15, 2025

View reviewed changes

pyiceberg/table/inspect.py Show resolved Hide resolved

jayceslesar and others added 4 commits September 22, 2025 16:48

Merge branch 'main' into feat/orphan-files

654a51e

Update pyiceberg/table/inspect.py

f97611a

Co-authored-by: aammar5 <89264433+aammar5@users.noreply.github.com>

undo

12f6d44

helper dataclass

2223460

Anton-Tarazi mentioned this pull request Oct 12, 2025

Remove deleted data files with expire_snapshots #2604

Open

ForeverAngry mentioned this pull request Nov 10, 2025

Tracking issues of PyIceberg 0.11 release #2574

Open

	pyarrow_io = PyArrowFileIO()
	pyarrow_io = PyArrowFileIO(properties=self.tbl.catalog.properties)

	all_files = [f.path for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of]
	all_files = [f"{scheme}://{f.path}" for f in fs.get_file_info(selector) if f.type == FileType.File and f.mtime < as_of]

feat: delete orphaned files #1958

Are you sure you want to change the base?

feat: delete orphaned files #1958

Uh oh!

Conversation

jayceslesar commented Apr 29, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

smaheshwar-pltr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kevinjqliu commented May 4, 2025

Uh oh!

jayceslesar commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Fokko commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

Anton-Tarazi left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayceslesar commented Jun 24, 2025

Uh oh!

Fokko commented Jun 24, 2025

Uh oh!

jayceslesar commented Jun 24, 2025

Uh oh!

aammar5 commented Jul 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayceslesar commented May 4, 2025 •

edited

Loading