-
Notifications
You must be signed in to change notification settings - Fork 413
perf: optimize inspect.partitions
#2359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Parallelizes manifest processing to improve performance for large tables with many manifest files. After parallel processing, merges the resulting partition maps to produce the final aggregated result.
|
Hey @jayceslesar could you please take a look on this PR? Thank you. I took this PR as ref and wanted to apply to |
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @emilie-wang Thanks for speeding this up. While at it, I think we need to do a minor refactor as well to keep everything readable.
pyiceberg/table/inspect.py
Outdated
| partitions_map: Dict[Tuple[str, Any], Any] = {} | ||
| snapshot = self._get_snapshot(snapshot_id) | ||
| for manifest in snapshot.manifests(self.tbl.io): | ||
| def process_manifest(manifest: ManifestFile) -> Dict[Tuple[str, Any], Any]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're at it, I would suggest two things:
- Move the inline function to the class level, and add an underscore to the name, to indicate that it is considered private
_process_manifest. - Merge this function with
update_partitions_map, since that function isn't used anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @Fokko, thank you for the review and updated with the code refactoring.
|
Thanks for fixing this @emilie-wang 🙌 |
Parallelizes manifest processing to improve performance for large tables with many manifest files. After parallel processing, merges the resulting partition maps to produce the final aggregated result.
Previous example ref: e937f6a
Rationale for this change
Perf improvement.
We experienced slowness with table.inspect.partitions() with large table.
Are these changes tested?
Yes.
Are there any user-facing changes?
No.