Skip to content

Conversation

@sundy-li
Copy link
Contributor

@sundy-li sundy-li commented Jan 6, 2026

Which issue does this PR close?

  • Closes #.

What changes are included in this PR?

This PR enables projection of nested fields within struct columns when reading parquet files. Previously, selecting a field nested inside a struct would result in a FeatureUnsupported error.

Problem

When users try to select nested fields like person.name from a schema such as:

id: Int (field_id=1)
person: Struct (field_id=2)
  name: String (field_id=3)
  age: Int (field_id=4)

The scan would fail with "Projecting nested field is not supported now" error, blocking access to nested column data.

Solution

1. crates/iceberg/src/arrow/reader.rs

  • Add RecordBatchProjector integration to detect and handle nested field projections
  • After parquet projection, detect if any requested field IDs are nested (not direct children of the schema's top-level struct)
  • Create a RecordBatchProjector to extract nested fields from their parent structs, flattening them into the output record batch
  • Exclude metadata fields (like _file) from nested field detection

2. crates/iceberg/src/arrow/record_batch_transformer.rs

  • Extend build_field_id_to_arrow_schema_map to recursively index nested struct fields
  • Add helper function collect_field_ids_recursive to traverse the field hierarchy
  • This allows the transformer to find field IDs that are nested within structs

3. crates/iceberg/src/scan/mod.rs

  • Remove the restriction that blocked nested field selection (the FeatureUnsupported error)

How it works

  1. When processing a FileScanTask, detect if any requested field IDs are nested by checking if schema.as_struct().field_by_id(id) returns None
  2. If nested fields are detected, create a RecordBatchProjector with the projected arrow schema
  3. The projector builds index paths to locate nested fields (e.g., [1, 0] means column 1, inner field 0)
  4. After parquet reads the data, the projector extracts nested fields from their parent structs
  5. The transformer then processes the flattened batch normally

Are these changes tested?

Yes, added test_read_nested_parquet_column test that:

  • Creates a parquet file with nested struct data (id, person { name, age })
  • Reads with projection [1, 3] (selecting id and nested name)
  • Verifies both the top-level field and nested field are correctly extracted
  • All 1051 existing tests continue to pass

Enable projection of nested fields within struct columns when reading
parquet files. Previously, selecting a field nested inside a struct
would result in a FeatureUnsupported error.

Changes:
- Add RecordBatchProjector integration to extract nested fields from
  struct columns after parquet projection
- Extend RecordBatchTransformer's field ID mapping to recursively
  index nested struct fields
- Remove the nested field restriction in scan module that blocked
  nested field selection

The implementation detects when requested field IDs are nested (not
direct children of the schema) and creates a RecordBatchProjector to
extract those fields from their parent structs, flattening them into
the output record batch.
@mbutrovich
Copy link
Collaborator

I'll try to take a look at this in the next few days. It's interesting to me since Comet already generates FileScanTasks that manage to read nested types with the current reader, so I want to understand the scope of changes there.

@mbutrovich mbutrovich self-requested a review January 8, 2026 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants