Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Dec 29, 2025

Which issue does this PR close?

Rationale for this change

DataFusion’s Parquet row-level filter pushdown previously rejected all nested Arrow types (lists/structs), which prevented common and performance-sensitive filters on list columns (for example array_has, array_has_all, array_has_any) from being evaluated during Parquet decoding.

Enabling safe pushdown for a small, well-defined set of list-aware predicates allows Parquet decoding to apply these filters earlier, reducing materialization work and improving scan performance, while still keeping unsupported nested projections (notably structs) evaluated after batches are materialized.

What changes are included in this PR?

  • Allow a registry of list-aware predicates to be considered pushdown-compatible:

    • array_has, array_has_all, array_has_any
    • IS NULL / IS NOT NULL
  • Introduce supported_predicates module to detect whether an expression tree contains supported list predicates.

  • Update Parquet filter candidate selection to:

    • Accept list columns only when the predicate semantics are supported.
    • Continue rejecting struct columns (and other unsupported nested types).
  • Switch Parquet projection mask construction from root indices to leaf indices (ProjectionMask::leaves) so nested list filters project the correct leaf columns for decoding-time evaluation.

  • Expand root column indices to leaf indices for nested columns using the Parquet SchemaDescriptor.

  • Add unit tests verifying:

    • List columns are accepted for pushdown when used by supported predicates.
    • Struct columns (and mixed struct+primitive predicates) prevent pushdown.
    • array_has, array_has_all, array_has_any actually filter rows during decoding using a temp Parquet file.
  • Add sqllogictest coverage proving both correctness and plan behavior:

    • Queries return expected results.
    • EXPLAIN shows predicates pushed into DataSourceExec for Parquet.

Are these changes tested?

Yes.

  • Rust unit tests in datafusion/datasource-parquet/src/row_filter.rs:

    • Validate pushdown eligibility for list vs struct predicates.
    • Create a temp Parquet file and confirm list predicates prune/match the expected rows via Parquet decoding row filtering.
  • SQL logic tests in datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:

    • Add end-to-end coverage for array_has, array_has_all, array_has_any and combinations (OR / AND with other predicates).
    • Confirm pushdown appears in the physical plan (DataSourceExec ... predicate=...).

Are there any user-facing changes?

Yes.

  • Parquet filter pushdown now supports list columns for the following predicates:

    • array_has, array_has_all, array_has_any
    • IS NULL, IS NOT NULL

This can improve query performance for workloads that filter on array/list columns.

No breaking changes are introduced; unsupported nested types (for example structs) continue to be evaluated after decoding.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Document supported nested pushdown semantics and update
row-level predicate construction to utilize leaf-based
projection masks. Enable list-aware predicates like
array_has_all while maintaining unsupported nested
structures on the fallback path.

Expand filter candidate building for root and leaf
projections of nested columns, facilitating cost
estimation and mask creation aligned with Parquet leaf
layouts. Include struct/list pushdown checks and add a
new integration test to validate array_has_all
pushdown behavior against Parquet row filters.
Introduce dev dependencies for nested function helpers
and temporary file creation used in the new tests.
Extract supports_list_predicates() into its own module and
create a SUPPORTED_ARRAY_FUNCTIONS constant registry for
improved management. Add is_supported_list_predicate()
helper function for easier extensibility, along with
comprehensive documentation and unit tests.

Refactor check_single_column() using intermediate variables
to clarify logic for handling structs and unsupported lists.
Introduce a new test case for mixed primitive and struct
predicates to ensure proper functionality and validation
of pushable predicates.
Extract common test logic into test_array_predicate_pushdown helper
function to reduce duplication and ensure parity across all three
supported array functions (array_has, array_has_all, array_has_any).

This makes it easier to maintain and extend test coverage for new
array functions in the future.

Benefits:
- Reduces code duplication from ~70 lines × 3 to ~10 lines × 3
- Ensures consistent test methodology across all array functions
- Clear documentation of expected behavior for each function
- Easier to add tests for new supported functions
Add detailed rustdoc examples to can_expr_be_pushed_down_with_schemas()
showing three key scenarios:

1. Primitive column filters (allowed) - e.g., age > 30
2. Struct column filters (blocked) - e.g., person IS NOT NULL
3. List column filters with supported predicates (allowed) -
   e.g., array_has_all(tags, ['rust'])

These examples help users understand when filter pushdown to the
Parquet decoder is available and guide them in writing efficient
queries.

Benefits:
- Clear documentation of supported and unsupported cases
- Helps users optimize query performance
- Provides copy-paste examples for common patterns
- Updated to reflect new list column support
- Replace 'while let Some(batch) = reader.next()' with idiomatic 'for batch in reader'
- Remove unnecessary mut from reader variable
- Addresses clippy::while_let_on_iterator warning
- Document function name detection assumption in supported_predicates
  - Note reliance on exact string matching
  - Suggest trait-based approach for future robustness
- Explain ProjectionMask::leaves() choice for nested columns
  - Clarify why leaf indices are needed for nested structures
  - Helps reviewers understand Parquet schema descriptor usage

These comments address Low Priority suggestions from code review,
improving maintainability and onboarding for future contributors.
Remove SUPPORTED_ARRAY_FUNCTIONS array. Introduce dedicated predicate
functions for NULL checks and scalar function support. Utilize
pattern matching with matches! macro instead of array lookups.
Enhance code clarity and idiomatic Rust usage with is_some_and()
for condition checks and simplify recursion using a single
expression.
Extract helper functions to reduce code duplication in
array pushdown and physical plan tests. Consolidate similar
assertions and checks, simplifying tests from ~50 to ~30
lines. Transform display tests into a single parameterized
test, maintaining coverage while eliminating repeated code.
…monstrations"

This reverts commit 94f1a99cee4e44e5176450156a684a2316af78e1.
Extract handle_nested_type() to encapsulate logic for
determining if a nested type prevents pushdown. Introduce
is_nested_type_supported() to isolate type checking for
List/LargeList/FixedSizeList and predicate support.
Simplify check_single_column() by reducing nesting depth
and delegating nested type logic to helper methods.
@github-actions github-actions bot added sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Dec 29, 2025
@kosiew kosiew marked this pull request as ready for review December 29, 2025 14:15
@kosiew kosiew requested a review from zhuqi-lucas December 29, 2025 14:53
@zhuqi-lucas
Copy link
Contributor

Thank you @kosiew , i will review it soon!

@zhuqi-lucas
Copy link
Contributor

Great work on implementing list type pushdown!

I suggest adding a benchmark to demonstrate the performance improvement more clearly. Here's a proposed test scenario:

Benchmark Setup:

  1. Create a dataset with a List<String> column sorted by list values (lexicographic order)
  2. Use a row group size that results in multiple row groups (e.g., 10K rows per group, 100K total rows)
  3. Apply a selective filter like array_has(list_col, 'target_value') that matches only ~10% of row groups

Expected Results:

  • Without pushdown: All row groups must be decoded and filtered → baseline time
  • With pushdown: ~90% of row groups skipped based on min/max statistics → faster execution

@kosiew kosiew marked this pull request as draft December 31, 2025 10:46
kosiew added 6 commits January 6, 2026 20:01
Implement a new Criterion benchmark to test row-level
pushdown for array_has predicates on a 100K-row dataset.
Compare pushdown versus baseline scans and assert
90% row pruning. Enable the benchmark target and update
the proposed dependencies in the Parquet datasource crate.
Extend list-predicate detection to include function names from
scalar UDF expressions, ensuring array_has, array_has_all,
and array_has_any qualify for Parquet list pushdown.
Add row filter tests to verify support for list predicates
and confirm correct row filters for list columns.
Introduce an additional large binary payload column in the Parquet
nested filter pushdown benchmark. Update dataset generation and
batch construction to populate the new column while maintaining
existing pruning assertions. Benchmark results show improved
performance with pushdown, averaging ~9.4 ms compared to ~37.7 ms
without it.
…low criterion best practices

Move schema and predicate setup outside the benchmark loop to measure
only execution time, not plan creation. This follows the pattern used
in topk_aggregate benchmarks where:

- setup_reader() creates the schema once per case
- create_predicate() builds the filter expression once per case
- scan_with_predicate() performs the actual file scan and filtering
  inside the loop

This ensures consistent benchmark measurements focused on filter
pushdown effectiveness rather than setup overhead.
@kosiew
Copy link
Contributor Author

kosiew commented Jan 6, 2026

image

The issue of cargo audit failing is being addressed in #19657

@kosiew kosiew marked this pull request as ready for review January 6, 2026 12:37
@kosiew
Copy link
Contributor Author

kosiew commented Jan 6, 2026

hi @zhuqi-lucas

I added a benchmark.
It shows improvement:

image

Copy link
Contributor

@zhuqi-lucas zhuqi-lucas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, looks good to me, left some comments.


// `ScalarUDFExpr` is currently an alias of `ScalarFunctionExpr` in this crate,
// but keep a separate type to support potential future divergence.
type ScalarUDFExpr = ScalarFunctionExpr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't see a difference here, why not just use ScalarFunctionExpr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove it.

/// Does the expression reference any columns not present in the file schema?
projected_columns: bool,
/// Indices into the file schema of columns required to evaluate the expression.
required_columns: BTreeSet<usize>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not related to this PR, if we can optimize to required_columns: Vec for small column set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking issue - #19673

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kosiew !

@zhuqi-lucas
Copy link
Contributor

run benchmarks

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing nested-filter-18560 (36d6744) to e5ca510 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and nested-filter-18560
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ nested-filter-18560 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2592.00 ms │          2569.55 ms │ no change │
│ QQuery 1     │  1084.26 ms │          1044.28 ms │ no change │
│ QQuery 2     │  2013.47 ms │          2023.78 ms │ no change │
│ QQuery 3     │  1172.18 ms │          1166.87 ms │ no change │
│ QQuery 4     │  2294.75 ms │          2197.30 ms │ no change │
│ QQuery 5     │ 28039.43 ms │         28188.66 ms │ no change │
│ QQuery 6     │  4012.31 ms │          3986.32 ms │ no change │
│ QQuery 7     │  3504.40 ms │          3605.80 ms │ no change │
└──────────────┴─────────────┴─────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 44712.80ms │
│ Total Time (nested-filter-18560)   │ 44782.57ms │
│ Average Time (HEAD)                │  5589.10ms │
│ Average Time (nested-filter-18560) │  5597.82ms │
│ Queries Faster                     │          0 │
│ Queries Slower                     │          0 │
│ Queries with No Change             │          8 │
│ Queries with Failure               │          0 │
└────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ nested-filter-18560 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.42 ms │             2.39 ms │     no change │
│ QQuery 1     │    49.32 ms │            50.56 ms │     no change │
│ QQuery 2     │   136.25 ms │           137.09 ms │     no change │
│ QQuery 3     │   152.90 ms │           154.51 ms │     no change │
│ QQuery 4     │  1058.61 ms │          1082.12 ms │     no change │
│ QQuery 5     │  1443.69 ms │          1455.30 ms │     no change │
│ QQuery 6     │     2.01 ms │             2.17 ms │  1.08x slower │
│ QQuery 7     │    57.26 ms │            54.99 ms │     no change │
│ QQuery 8     │  1402.61 ms │          1401.61 ms │     no change │
│ QQuery 9     │  1925.83 ms │          2210.80 ms │  1.15x slower │
│ QQuery 10    │   360.79 ms │           378.94 ms │  1.05x slower │
│ QQuery 11    │   397.86 ms │           428.24 ms │  1.08x slower │
│ QQuery 12    │  1316.67 ms │          1480.30 ms │  1.12x slower │
│ QQuery 13    │  1985.60 ms │          2028.86 ms │     no change │
│ QQuery 14    │  1264.05 ms │          1246.81 ms │     no change │
│ QQuery 15    │  1234.70 ms │          1216.91 ms │     no change │
│ QQuery 16    │  2492.38 ms │          2514.18 ms │     no change │
│ QQuery 17    │  2434.74 ms │          2473.13 ms │     no change │
│ QQuery 18    │  4891.00 ms │          4729.53 ms │     no change │
│ QQuery 19    │   121.75 ms │           121.00 ms │     no change │
│ QQuery 20    │  1927.84 ms │          1861.93 ms │     no change │
│ QQuery 21    │  2174.96 ms │          2181.10 ms │     no change │
│ QQuery 22    │  3762.11 ms │          3776.72 ms │     no change │
│ QQuery 23    │ 12341.81 ms │         12232.61 ms │     no change │
│ QQuery 24    │   209.28 ms │           214.19 ms │     no change │
│ QQuery 25    │   478.00 ms │           469.52 ms │     no change │
│ QQuery 26    │   210.59 ms │           218.79 ms │     no change │
│ QQuery 27    │  2761.96 ms │          2708.81 ms │     no change │
│ QQuery 28    │ 21773.16 ms │         21827.73 ms │     no change │
│ QQuery 29    │   949.23 ms │           973.36 ms │     no change │
│ QQuery 30    │  1319.08 ms │          1312.27 ms │     no change │
│ QQuery 31    │  1308.35 ms │          1324.14 ms │     no change │
│ QQuery 32    │  5235.46 ms │          4845.12 ms │ +1.08x faster │
│ QQuery 33    │  5826.09 ms │          5688.50 ms │     no change │
│ QQuery 34    │  5956.92 ms │          6534.39 ms │  1.10x slower │
│ QQuery 35    │  1889.21 ms │          1899.08 ms │     no change │
│ QQuery 36    │    67.55 ms │            67.95 ms │     no change │
│ QQuery 37    │    44.95 ms │            45.81 ms │     no change │
│ QQuery 38    │    65.51 ms │            64.55 ms │     no change │
│ QQuery 39    │    99.70 ms │           104.43 ms │     no change │
│ QQuery 40    │    26.68 ms │            26.62 ms │     no change │
│ QQuery 41    │    23.47 ms │            25.22 ms │  1.07x slower │
│ QQuery 42    │    19.27 ms │            19.72 ms │     no change │
└──────────────┴─────────────┴─────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 91201.61ms │
│ Total Time (nested-filter-18560)   │ 91591.99ms │
│ Average Time (HEAD)                │  2120.97ms │
│ Average Time (nested-filter-18560) │  2130.05ms │
│ Queries Faster                     │          1 │
│ Queries Slower                     │          7 │
│ Queries with No Change             │         35 │
│ Queries with Failure               │          0 │
└────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ nested-filter-18560 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 115.79 ms │           112.89 ms │     no change │
│ QQuery 2     │  29.44 ms │            29.23 ms │     no change │
│ QQuery 3     │  35.13 ms │            37.66 ms │  1.07x slower │
│ QQuery 4     │  28.20 ms │            28.44 ms │     no change │
│ QQuery 5     │  88.22 ms │            87.32 ms │     no change │
│ QQuery 6     │  19.39 ms │            19.64 ms │     no change │
│ QQuery 7     │ 228.69 ms │           223.18 ms │     no change │
│ QQuery 8     │  37.71 ms │            35.92 ms │     no change │
│ QQuery 9     │ 101.84 ms │           102.78 ms │     no change │
│ QQuery 10    │  62.53 ms │            62.21 ms │     no change │
│ QQuery 11    │  18.17 ms │            17.72 ms │     no change │
│ QQuery 12    │  51.28 ms │            51.18 ms │     no change │
│ QQuery 13    │  47.52 ms │            45.64 ms │     no change │
│ QQuery 14    │  13.78 ms │            13.40 ms │     no change │
│ QQuery 15    │  23.87 ms │            24.04 ms │     no change │
│ QQuery 16    │  24.24 ms │            24.29 ms │     no change │
│ QQuery 17    │ 148.57 ms │           148.91 ms │     no change │
│ QQuery 18    │ 277.97 ms │           278.02 ms │     no change │
│ QQuery 19    │  39.14 ms │            37.86 ms │     no change │
│ QQuery 20    │  49.96 ms │            48.98 ms │     no change │
│ QQuery 21    │ 310.91 ms │           293.12 ms │ +1.06x faster │
│ QQuery 22    │  16.86 ms │            17.07 ms │     no change │
└──────────────┴───────────┴─────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 1769.20ms │
│ Total Time (nested-filter-18560)   │ 1739.50ms │
│ Average Time (HEAD)                │   80.42ms │
│ Average Time (nested-filter-18560) │   79.07ms │
│ Queries Faster                     │         1 │
│ Queries Slower                     │         1 │
│ Queries with No Change             │        20 │
│ Queries with Failure               │         0 │
└────────────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Jan 6, 2026

run benchmarks

@alamb
Copy link
Contributor

alamb commented Jan 6, 2026

I restarted the benchmarks to see if the results are reproducible -- I also merged up to get the Ci green

@alamb-ghbot
Copy link

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing nested-filter-18560 (a8bbe5e) to 924037e diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb-ghbot
Copy link

🤖: Benchmark completed

Details

Comparing HEAD and nested-filter-18560
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ nested-filter-18560 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2338.85 ms │          2359.00 ms │ no change │
│ QQuery 1     │   946.34 ms │           921.54 ms │ no change │
│ QQuery 2     │  1907.90 ms │          1851.10 ms │ no change │
│ QQuery 3     │  1157.24 ms │          1161.32 ms │ no change │
│ QQuery 4     │  2250.03 ms │          2304.65 ms │ no change │
│ QQuery 5     │ 28272.65 ms │         28168.38 ms │ no change │
│ QQuery 6     │  3820.10 ms │          3880.60 ms │ no change │
│ QQuery 7     │  3868.64 ms │          3868.18 ms │ no change │
└──────────────┴─────────────┴─────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 44561.75ms │
│ Total Time (nested-filter-18560)   │ 44514.77ms │
│ Average Time (HEAD)                │  5570.22ms │
│ Average Time (nested-filter-18560) │  5564.35ms │
│ Queries Faster                     │          0 │
│ Queries Slower                     │          0 │
│ Queries with No Change             │          8 │
│ Queries with Failure               │          0 │
└────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ nested-filter-18560 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     1.47 ms │             1.49 ms │     no change │
│ QQuery 1     │    48.44 ms │            49.69 ms │     no change │
│ QQuery 2     │   132.76 ms │           133.51 ms │     no change │
│ QQuery 3     │   149.45 ms │           153.84 ms │     no change │
│ QQuery 4     │  1082.10 ms │          1119.39 ms │     no change │
│ QQuery 5     │  1360.37 ms │          1394.71 ms │     no change │
│ QQuery 6     │     1.44 ms │             1.46 ms │     no change │
│ QQuery 7     │    56.55 ms │            53.85 ms │     no change │
│ QQuery 8     │  1456.57 ms │          1450.22 ms │     no change │
│ QQuery 9     │  1794.12 ms │          1800.57 ms │     no change │
│ QQuery 10    │   341.74 ms │           336.89 ms │     no change │
│ QQuery 11    │   386.84 ms │           389.11 ms │     no change │
│ QQuery 12    │  1256.41 ms │          1260.86 ms │     no change │
│ QQuery 13    │  1930.10 ms │          1994.84 ms │     no change │
│ QQuery 14    │  1221.47 ms │          1241.62 ms │     no change │
│ QQuery 15    │  1250.66 ms │          1230.80 ms │     no change │
│ QQuery 16    │  2545.96 ms │          2611.28 ms │     no change │
│ QQuery 17    │  2516.84 ms │          2538.54 ms │     no change │
│ QQuery 18    │  5933.85 ms │          4961.34 ms │ +1.20x faster │
│ QQuery 19    │   119.89 ms │           119.84 ms │     no change │
│ QQuery 20    │  1895.11 ms │          1867.54 ms │     no change │
│ QQuery 21    │  2211.35 ms │          2150.36 ms │     no change │
│ QQuery 22    │  3947.37 ms │          3726.14 ms │ +1.06x faster │
│ QQuery 23    │ 18077.39 ms │         12247.91 ms │ +1.48x faster │
│ QQuery 24    │   213.27 ms │           190.93 ms │ +1.12x faster │
│ QQuery 25    │   446.55 ms │           457.85 ms │     no change │
│ QQuery 26    │   218.95 ms │           213.79 ms │     no change │
│ QQuery 27    │  2756.77 ms │          2686.02 ms │     no change │
│ QQuery 28    │ 23487.27 ms │         23361.00 ms │     no change │
│ QQuery 29    │   983.73 ms │           966.78 ms │     no change │
│ QQuery 30    │  1353.56 ms │          1334.45 ms │     no change │
│ QQuery 31    │  1336.10 ms │          1342.71 ms │     no change │
│ QQuery 32    │  5332.98 ms │          4969.01 ms │ +1.07x faster │
│ QQuery 33    │  5942.27 ms │          5346.96 ms │ +1.11x faster │
│ QQuery 34    │  5703.68 ms │          5826.95 ms │     no change │
│ QQuery 35    │  1949.60 ms │          1864.80 ms │     no change │
│ QQuery 36    │    65.46 ms │            64.82 ms │     no change │
│ QQuery 37    │    43.72 ms │            44.48 ms │     no change │
│ QQuery 38    │    66.55 ms │            64.70 ms │     no change │
│ QQuery 39    │   102.88 ms │           100.36 ms │     no change │
│ QQuery 40    │    24.35 ms │            25.39 ms │     no change │
│ QQuery 41    │    22.91 ms │            22.66 ms │     no change │
│ QQuery 42    │    19.00 ms │            17.92 ms │ +1.06x faster │
└──────────────┴─────────────┴─────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 99787.89ms │
│ Total Time (nested-filter-18560)   │ 91737.38ms │
│ Average Time (HEAD)                │  2320.65ms │
│ Average Time (nested-filter-18560) │  2133.43ms │
│ Queries Faster                     │          7 │
│ Queries Slower                     │          0 │
│ Queries with No Change             │         36 │
│ Queries with Failure               │          0 │
└────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ nested-filter-18560 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 147.45 ms │           118.18 ms │ +1.25x faster │
│ QQuery 2     │  33.21 ms │            27.23 ms │ +1.22x faster │
│ QQuery 3     │  41.38 ms │            34.69 ms │ +1.19x faster │
│ QQuery 4     │  36.02 ms │            28.86 ms │ +1.25x faster │
│ QQuery 5     │  89.52 ms │            86.81 ms │     no change │
│ QQuery 6     │  20.31 ms │            19.57 ms │     no change │
│ QQuery 7     │ 223.00 ms │           232.95 ms │     no change │
│ QQuery 8     │  35.03 ms │            33.23 ms │ +1.05x faster │
│ QQuery 9     │ 104.94 ms │           109.16 ms │     no change │
│ QQuery 10    │  61.21 ms │            63.87 ms │     no change │
│ QQuery 11    │  16.61 ms │            16.41 ms │     no change │
│ QQuery 12    │  49.65 ms │            50.95 ms │     no change │
│ QQuery 13    │  45.40 ms │            46.84 ms │     no change │
│ QQuery 14    │  13.37 ms │            13.05 ms │     no change │
│ QQuery 15    │  23.99 ms │            23.99 ms │     no change │
│ QQuery 16    │  24.29 ms │            24.33 ms │     no change │
│ QQuery 17    │ 151.57 ms │           153.49 ms │     no change │
│ QQuery 18    │ 276.33 ms │           276.35 ms │     no change │
│ QQuery 19    │  36.14 ms │            36.87 ms │     no change │
│ QQuery 20    │  49.73 ms │            49.98 ms │     no change │
│ QQuery 21    │ 330.25 ms │           306.24 ms │ +1.08x faster │
│ QQuery 22    │  17.73 ms │            18.02 ms │     no change │
└──────────────┴───────────┴─────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 1827.13ms │
│ Total Time (nested-filter-18560)   │ 1771.08ms │
│ Average Time (HEAD)                │   83.05ms │
│ Average Time (nested-filter-18560) │   80.50ms │
│ Queries Faster                     │         6 │
│ Queries Slower                     │         0 │
│ Queries with No Change             │        16 │
│ Queries with Failure               │         0 │
└────────────────────────────────────┴───────────┘

@kosiew
Copy link
Contributor Author

kosiew commented Jan 7, 2026

@zhuqi-lucas
Thanks for your review and feedback.

@zhuqi-lucas
Copy link
Contributor

Thank you @kosiew, great work, LGTM now.

@kosiew kosiew added this pull request to the merge queue Jan 7, 2026
Merged via the queue into apache:main with commit 566bcde Jan 7, 2026
31 checks passed
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why some predicates are "safe" and others are not "safe"? What is unsafe about reading a single field of a struct column? Won't you also run into #9066?

/// // Can safely push down to Parquet decoder
/// }
/// ```
pub trait SupportsListPushdown {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this is a trait. Do any concrete types implement this trait? Why not just write fn supports_list_pushdown(expr: &dyn PhyiscalExpr)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale is

// Current (trait-based) - natural and discoverable
if expr.supports_list_pushdown() { ... }
while
// Alternative (function-only) - requires searching for a function
if supports_list_pushdown(&expr) { ... }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's only used within this one module. Either way I think this may all be changed soon by #19556

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

datasource Changes to the datasource crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support nested datatype filter pushdown to parquet

5 participants