-
Notifications
You must be signed in to change notification settings - Fork 412
Add comprehensive ORC read support to PyArrow I/O #2432
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
local example working for our iceberg table `>>> table = get_patched_table()
|
|
Thank you @mccormickt12 , I really like this feature and find it useful for enabling data analysis over cold storage historical data. I am keen on helping you release this soon. Feels free to ask for any help if you get stuck, 🚀 🎸 🦜 |
@tushar-choudhary-tc is there anything i should add/do to help get this in? are you able to trigger tests? make lint and make test are both passing |
| # The PARQUET: in front means that it is Parquet specific, in this case the field_id | ||
| PYARROW_PARQUET_FIELD_ID_KEY = b"PARQUET:field_id" | ||
| # ORC field ID key for Iceberg field IDs in ORC metadata | ||
| ORC_FIELD_ID_KEY = b"iceberg.id" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is in line with the Java impl, should we also set the required attribute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Fokko I don't have a ton of contexts on this. Do you think this is required for this PR? could it be a separate PR?
Features implemented: - Record batching and table reading via ArrowScan - Column projection and row filtering with predicate pushdown - Positional deletes support (with ORC-specific non-dictionary handling) - Schema mapping for files without field IDs - Streaming via Iterator[pa.RecordBatch] for memory efficiency - Full integration with Iceberg metadata and partitioning
- Add ORC_FIELD_ID_KEY constant for ORC field ID metadata - Update _get_field_id function to support both Parquet and ORC field IDs - Update schema_to_pyarrow to accept file_format parameter for proper field ID handling - Update _ConvertToArrowSchema to add correct field IDs based on file format - Add comprehensive test coverage: * test_orc_field_id_extraction: Tests field ID extraction from PyArrow metadata * test_orc_schema_with_field_ids: Tests ORC reading with embedded field IDs (no name mapping needed) * test_orc_schema_conversion_with_field_ids: Tests schema conversion with ORC field IDs These changes fix the original error where ORC files with field IDs couldn't be read without name mapping, and provide comprehensive test coverage to prevent regression.
… size, and compression interactions. Tests show ORC batching is based on stripes (like Parquet row groups), with near-perfect 1:1 mapping achievable using large stripe sizes (2-5MB) and hard-to-compress data, achieving 0.91-0.97 ratios between stripe size and actual file size.
ffcd6e9 to
3649411
Compare
Fokko
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this forward for now, and follow up on https://github.com/apache/iceberg-python/pull/2432/files#r2353406023
Features implemented:
Rationale for this change
Are these changes tested?
Are there any user-facing changes?