Skip to content

Conversation

@mccormickt12
Copy link
Contributor

Features implemented:

  • Record batching and table reading via ArrowScan
  • Column projection and row filtering with predicate pushdown
  • Positional deletes support (with ORC-specific non-dictionary handling)
  • Schema mapping for files without field IDs
  • Streaming via Iterator[pa.RecordBatch] for memory efficiency
  • Full integration with Iceberg metadata and partitioning

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

@mccormickt12
Copy link
Contributor Author

local example working for our iceberg table

`>>> table = get_patched_table()
25/09/05 20:01:38 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/05 20:01:39 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
Found metadata files: ['/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00000-3ec53886-ceae-46f2-a926-050afb7f95b9.metadata.json', '/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json']
Using latest: hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json
hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json

at = table.scan().to_arrow()
/export/home/tmccormi/venv_310/lib/python3.10/site-packages/pyiceberg/avro/decoder.py:185: UserWarning: Falling back to pure Python Avro decoder, missing Cython implementation
warnings.warn("Falling back to pure Python Avro decoder, missing Cython implementation")
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
Iceberg does not yet support 'ns' timestamp precision. Downcasting to 'us'.
at.num_rows
10000
df = at.to_pandas()
df.head(20)
company ticker ts
0 Company_3344 SYM44 2025-06-15 00:00:00+00:00
1 Company_3358 SYM58 2025-06-15 00:00:00+00:00
2 Company_3371 SYM71 2025-06-15 00:00:00+00:00
3 Company_3373 SYM73 2025-06-15 00:00:00+00:00
4 Company_3374 SYM74 2025-06-15 00:00:00+00:00
5 Company_3396 SYM96 2025-06-15 00:00:00+00:00
6 Company_3400 SYM0 2025-06-15 00:00:00+00:00
7 Company_3415 SYM15 2025-06-15 00:00:00+00:00
8 Company_3427 SYM27 2025-06-15 00:00:00+00:00
9 Company_3430 SYM30 2025-06-15 00:00:00+00:00
10 Company_3437 SYM37 2025-06-15 00:00:00+00:00
11 Company_3439 SYM39 2025-06-15 00:00:00+00:00
12 Company_3445 SYM45 2025-06-15 00:00:00+00:00
13 Company_3447 SYM47 2025-06-15 00:00:00+00:00
14 Company_3450 SYM50 2025-06-15 00:00:00+00:00
15 Company_3471 SYM71 2025-06-15 00:00:00+00:00
16 Company_3483 SYM83 2025-06-15 00:00:00+00:00
17 Company_3497 SYM97 2025-06-15 00:00:00+00:00
18 Company_3500 SYM0 2025-06-15 00:00:00+00:00
19 Company_3511 SYM11 2025-06-15 00:00:00+00:00
exit()
tmccormi@ltx1-hcl14866 [ ~/python ]$ hdfs dfs -ls hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/
Found 4 items
-rw-r--r-- 3 openhouse openhouse 2900 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00000-3ec53886-ceae-46f2-a926-050afb7f95b9.metadata.json
-rw-r--r-- 3 openhouse openhouse 4366 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/00001-fc1f6c92-0449-4deb-8908-097db5f6589a.metadata.json
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata
tmccormi@ltx1-hcl14866 [ ~/python ]$ hdfs dfs -ls hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata
Found 2 items
-rw-r--r-- 3 openhouse openhouse 7498 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata/326f2c33-8423-4ed4-b3cf-8c5a08613705-m0.avro
-rw-r--r-- 3 openhouse openhouse 4330 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/metadata/snap-6430134116377485926-1-326f2c33-8423-4ed4-b3cf-8c5a08613705.avro
tmccormi@ltx1-hcl14866 [ ~/python ]$ hdfs dfs -ls hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data
Found 10 items
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-15
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-16
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-17
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-18
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-19
drwxr-xr-x - openhouse openhouse 0 2025-06-24 22:56 hdfs://ltx1-yugioh-cluster01.linkfs.prod-ltx1.atd.prod.linkedin.com:9000/jobs/openhouse/cutover_zdt_testing_db/cutover_zdt_testing_table_partitioned_one-f814050d-6416-4fa8-ae85-c63ac74b4567/data/ts_day=2025-06-20`

@tushar-choudhary-tc
Copy link

tushar-choudhary-tc commented Sep 9, 2025

Thank you @mccormickt12 , I really like this feature and find it useful for enabling data analysis over cold storage historical data. I am keen on helping you release this soon. Feels free to ask for any help if you get stuck, 🚀 🎸 🦜

@mccormickt12
Copy link
Contributor Author

Thank you @mccormickt12 , I really like this feature and find it useful for enabling data analysis over cold storage historical data. I am keen on helping you release this soon. Feels free to ask for any help if you get stuck, 🚀 🎸 🦜

@tushar-choudhary-tc is there anything i should add/do to help get this in? are you able to trigger tests?

make lint and make test are both passing

# The PARQUET: in front means that it is Parquet specific, in this case the field_id
PYARROW_PARQUET_FIELD_ID_KEY = b"PARQUET:field_id"
# ORC field ID key for Iceberg field IDs in ORC metadata
ORC_FIELD_ID_KEY = b"iceberg.id"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Fokko I don't have a ton of contexts on this. Do you think this is required for this PR? could it be a separate PR?

Tom McCormick added 7 commits September 22, 2025 12:07
Features implemented:
- Record batching and table reading via ArrowScan
- Column projection and row filtering with predicate pushdown
- Positional deletes support (with ORC-specific non-dictionary handling)
- Schema mapping for files without field IDs
- Streaming via Iterator[pa.RecordBatch] for memory efficiency
- Full integration with Iceberg metadata and partitioning
- Add ORC_FIELD_ID_KEY constant for ORC field ID metadata
- Update _get_field_id function to support both Parquet and ORC field IDs
- Update schema_to_pyarrow to accept file_format parameter for proper field ID handling
- Update _ConvertToArrowSchema to add correct field IDs based on file format
- Add comprehensive test coverage:
  * test_orc_field_id_extraction: Tests field ID extraction from PyArrow metadata
  * test_orc_schema_with_field_ids: Tests ORC reading with embedded field IDs (no name mapping needed)
  * test_orc_schema_conversion_with_field_ids: Tests schema conversion with ORC field IDs

These changes fix the original error where ORC files with field IDs couldn't be read
without name mapping, and provide comprehensive test coverage to prevent regression.
… size, and compression interactions.

Tests show ORC batching is based on stripes (like Parquet row groups), with near-perfect 1:1 mapping achievable using large stripe sizes (2-5MB) and hard-to-compress data, achieving 0.91-0.97 ratios between stripe size and actual file size.
Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this forward for now, and follow up on https://github.com/apache/iceberg-python/pull/2432/files#r2353406023

@Fokko Fokko merged commit e5e7453 into apache:main Sep 24, 2025
19 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants