Add comprehensive ORC read support to PyArrow I/O #2432

mccormickt12 · 2025-09-05T19:26:35Z

Features implemented:

Record batching and table reading via ArrowScan
Column projection and row filtering with predicate pushdown
Positional deletes support (with ORC-specific non-dictionary handling)
Schema mapping for files without field IDs
Streaming via Iterator[pa.RecordBatch] for memory efficiency
Full integration with Iceberg metadata and partitioning

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

mccormickt12 · 2025-09-05T20:06:21Z

local example working for our iceberg table

`>>> table = get_patched_table()
25/09/05 20:01:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/05 20:01:39 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Found metadata files: ['/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00000-3ec53886-ceae-46f2-a926-050afb7f95b9.metadata.json', '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json']
Using latest: hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json
hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json

at = table.scan().to_arrow()
/export/home/tmccormi/venv_310/lib/python3.10/site-packages/pyiceberg/avro/decoder.py:185: UserWarning: Falling back to pure Python Avro decoder, missing Cython implementation
warnings.warn("Falling back to pure Python Avro decoder, missing Cython implementation")
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
at.num_rows
10000
df = at.to_pandas()
df.head(20)
company ticker ts
0 Company_3344 SYM44 2025-06-15 00:00:00+00:00
1 Company_3358 SYM58 2025-06-15 00:00:00+00:00
2 Company_3371 SYM71 2025-06-15 00:00:00+00:00
3 Company_3373 SYM73 2025-06-15 00:00:00+00:00
4 Company_3374 SYM74 2025-06-15 00:00:00+00:00
5 Company_3396 SYM96 2025-06-15 00:00:00+00:00
6 Company_3400 SYM0 2025-06-15 00:00:00+00:00
7 Company_3415 SYM15 2025-06-15 00:00:00+00:00
8 Company_3427 SYM27 2025-06-15 00:00:00+00:00
9 Company_3430 SYM30 2025-06-15 00:00:00+00:00
10 Company_3437 SYM37 2025-06-15 00:00:00+00:00
11 Company_3439 SYM39 2025-06-15 00:00:00+00:00
12 Company_3445 SYM45 2025-06-15 00:00:00+00:00
13 Company_3447 SYM47 2025-06-15 00:00:00+00:00
14 Company_3450 SYM50 2025-06-15 00:00:00+00:00
15 Company_3471 SYM71 2025-06-15 00:00:00+00:00
16 Company_3483 SYM83 2025-06-15 00:00:00+00:00
17 Company_3497 SYM97 2025-06-15 00:00:00+00:00
18 Company_3500 SYM0 2025-06-15 00:00:00+00:00
19 Company_3511 SYM11 2025-06-15 00:00:00+00:00
exit()
tmccormi@ltx1-hcl14866 [ ~/python ]$ hdfs dfs -ls hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/
Found 4 items
-rw-r--r-- 3 openhouse openhouse 2900 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00000-3ec53886-ceae-46f2-a926-050afb7f95b9.metadata.json
-rw-r--r-- 3 openhouse openhouse 4366 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata
tmccormi@ltx1-hcl14866 [ ~/python ]$ hdfs dfs -ls hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata
Found 2 items
-rw-r--r-- 3 openhouse openhouse 7498 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata/326f2c33-8423-4ed4-b3cf-8c5a08613705-m0.avro
-rw-r--r-- 3 openhouse openhouse 4330 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata/snap-6430134116377485926-1-326f2c33-8423-4ed4-b3cf-8c5a08613705.avro
tmccormi@ltx1-hcl14866 [ ~/python ]$ hdfs dfs -ls hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data
Found 10 items
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-15
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-16
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-17
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-18
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-19
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-20`

tests/io/test_pyarrow.py

tushar-choudhary-tc · 2025-09-09T04:59:28Z

Thank you @mccormickt12 , I really like this feature and find it useful for enabling data analysis over cold storage historical data. I am keen on helping you release this soon. Feels free to ask for any help if you get stuck, 🚀 🎸 🦜

tests/io/test_pyarrow.py

mccormickt12 · 2025-09-16T16:21:24Z

Thank you @mccormickt12 , I really like this feature and find it useful for enabling data analysis over cold storage historical data. I am keen on helping you release this soon. Feels free to ask for any help if you get stuck, 🚀 🎸 🦜

@tushar-choudhary-tc is there anything i should add/do to help get this in? are you able to trigger tests?

make lint and make test are both passing

Fokko · 2025-09-16T19:09:45Z

pyiceberg/io/pyarrow.py

 # The PARQUET: in front means that it is Parquet specific, in this case the field_id
 PYARROW_PARQUET_FIELD_ID_KEY = b"PARQUET:field_id"
+# ORC field ID key for Iceberg field IDs in ORC metadata
+ORC_FIELD_ID_KEY = b"iceberg.id"


This is in line with the Java impl, should we also set the required attribute?

https://github.com/apache/iceberg/blob/ee90c10e39cec0ccceb9425e03a3e0b5690daf3b/orc/src/main/java/org/apache/iceberg/orc/ORCSchemaUtil.java#L69

@Fokko I don't have a ton of contexts on this. Do you think this is required for this PR? could it be a separate PR?

pyiceberg/io/pyarrow.py

tests/io/test_pyarrow.py

tests/table/test_init.py

Features implemented: - Record batching and table reading via ArrowScan - Column projection and row filtering with predicate pushdown - Positional deletes support (with ORC-specific non-dictionary handling) - Schema mapping for files without field IDs - Streaming via Iterator[pa.RecordBatch] for memory efficiency - Full integration with Iceberg metadata and partitioning

- Add ORC_FIELD_ID_KEY constant for ORC field ID metadata - Update _get_field_id function to support both Parquet and ORC field IDs - Update schema_to_pyarrow to accept file_format parameter for proper field ID handling - Update _ConvertToArrowSchema to add correct field IDs based on file format - Add comprehensive test coverage: * test_orc_field_id_extraction: Tests field ID extraction from PyArrow metadata * test_orc_schema_with_field_ids: Tests ORC reading with embedded field IDs (no name mapping needed) * test_orc_schema_conversion_with_field_ids: Tests schema conversion with ORC field IDs These changes fix the original error where ORC files with field IDs couldn't be read without name mapping, and provide comprehensive test coverage to prevent regression.

… size, and compression interactions. Tests show ORC batching is based on stripes (like Parquet row groups), with near-perfect 1:1 mapping achievable using large stripe sizes (2-5MB) and hard-to-compress data, achieving 0.91-0.97 ratios between stripe size and actual file size.

Fokko

Let's move this forward for now, and follow up on https://github.com/apache/iceberg-python/pull/2432/files#r2353406023

cbb330 reviewed Sep 8, 2025

View reviewed changes

tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

mccormickt12 commented Sep 16, 2025

View reviewed changes

tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

mccormickt12 commented Sep 16, 2025

View reviewed changes

tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

Fokko reviewed Sep 16, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

Fokko reviewed Sep 16, 2025

View reviewed changes

pyiceberg/io/pyarrow.py Outdated Show resolved Hide resolved

Fokko reviewed Sep 16, 2025

View reviewed changes

tests/io/test_pyarrow.py Outdated Show resolved Hide resolved

Fokko reviewed Sep 16, 2025

View reviewed changes

tests/table/test_init.py Outdated Show resolved Hide resolved

Tom McCormick added 7 commits September 22, 2025 12:07

fix make lint

7762677

cleanup tests

f6d694c

spark write, pyiceberg orc read integration test

5f171a2

fix integration test

3649411

mccormickt12 force-pushed the orc-read-support branch from ffcd6e9 to 3649411 Compare September 22, 2025 18:44

fix lint

dd067f5

Fokko approved these changes Sep 24, 2025

View reviewed changes

Fokko merged commit e5e7453 into apache:main Sep 24, 2025
19 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add comprehensive ORC read support to PyArrow I/O #2432

Add comprehensive ORC read support to PyArrow I/O #2432

Uh oh!

mccormickt12 commented Sep 5, 2025

Uh oh!

mccormickt12 commented Sep 5, 2025

Uh oh!

Uh oh!

tushar-choudhary-tc commented Sep 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

mccormickt12 commented Sep 16, 2025

Uh oh!

Fokko Sep 16, 2025

Uh oh!

mccormickt12 Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add comprehensive ORC read support to PyArrow I/O #2432

Add comprehensive ORC read support to PyArrow I/O #2432

Uh oh!

Conversation

mccormickt12 commented Sep 5, 2025

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

mccormickt12 commented Sep 5, 2025

Uh oh!

Uh oh!

tushar-choudhary-tc commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mccormickt12 commented Sep 16, 2025

Uh oh!

Fokko Sep 16, 2025

Choose a reason for hiding this comment

Uh oh!

mccormickt12 Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tushar-choudhary-tc commented Sep 9, 2025 •

edited

Loading