Parquet: Push down supported list predicates (array_has/any/all) during decoding #19545

kosiew · 2025-12-29T12:44:39Z

Which issue does this PR close?

Closes Support nested datatype filter pushdown to parquet #18560.

Rationale for this change

DataFusion’s Parquet row-level filter pushdown previously rejected all nested Arrow types (lists/structs), which prevented common and performance-sensitive filters on list columns (for example array_has, array_has_all, array_has_any) from being evaluated during Parquet decoding.

Enabling safe pushdown for a small, well-defined set of list-aware predicates allows Parquet decoding to apply these filters earlier, reducing materialization work and improving scan performance, while still keeping unsupported nested projections (notably structs) evaluated after batches are materialized.

What changes are included in this PR?

Allow a registry of list-aware predicates to be considered pushdown-compatible:
- array_has, array_has_all, array_has_any
- IS NULL / IS NOT NULL
Introduce supported_predicates module to detect whether an expression tree contains supported list predicates.
Update Parquet filter candidate selection to:
- Accept list columns only when the predicate semantics are supported.
- Continue rejecting struct columns (and other unsupported nested types).
Switch Parquet projection mask construction from root indices to leaf indices (ProjectionMask::leaves) so nested list filters project the correct leaf columns for decoding-time evaluation.
Expand root column indices to leaf indices for nested columns using the Parquet SchemaDescriptor.
Add unit tests verifying:
- List columns are accepted for pushdown when used by supported predicates.
- Struct columns (and mixed struct+primitive predicates) prevent pushdown.
- array_has, array_has_all, array_has_any actually filter rows during decoding using a temp Parquet file.
Add sqllogictest coverage proving both correctness and plan behavior:
- Queries return expected results.
- EXPLAIN shows predicates pushed into DataSourceExec for Parquet.

Are these changes tested?

Yes.

Rust unit tests in datafusion/datasource-parquet/src/row_filter.rs:
- Validate pushdown eligibility for list vs struct predicates.
- Create a temp Parquet file and confirm list predicates prune/match the expected rows via Parquet decoding row filtering.
SQL logic tests in datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt:
- Add end-to-end coverage for array_has, array_has_all, array_has_any and combinations (OR / AND with other predicates).
- Confirm pushdown appears in the physical plan (DataSourceExec ... predicate=...).

Are there any user-facing changes?

Yes.

Parquet filter pushdown now supports list columns for the following predicates:
- array_has, array_has_all, array_has_any
- IS NULL, IS NOT NULL

This can improve query performance for workloads that filter on array/list columns.

No breaking changes are introduced; unsupported nested types (for example structs) continue to be evaluated after decoding.

LLM-generated code disclosure

This PR includes LLM-generated code and comments. All LLM-generated content has been manually reviewed and tested.

Document supported nested pushdown semantics and update row-level predicate construction to utilize leaf-based projection masks. Enable list-aware predicates like array_has_all while maintaining unsupported nested structures on the fallback path. Expand filter candidate building for root and leaf projections of nested columns, facilitating cost estimation and mask creation aligned with Parquet leaf layouts. Include struct/list pushdown checks and add a new integration test to validate array_has_all pushdown behavior against Parquet row filters. Introduce dev dependencies for nested function helpers and temporary file creation used in the new tests.

Extract supports_list_predicates() into its own module and create a SUPPORTED_ARRAY_FUNCTIONS constant registry for improved management. Add is_supported_list_predicate() helper function for easier extensibility, along with comprehensive documentation and unit tests. Refactor check_single_column() using intermediate variables to clarify logic for handling structs and unsupported lists. Introduce a new test case for mixed primitive and struct predicates to ensure proper functionality and validation of pushable predicates.

Extract common test logic into test_array_predicate_pushdown helper function to reduce duplication and ensure parity across all three supported array functions (array_has, array_has_all, array_has_any). This makes it easier to maintain and extend test coverage for new array functions in the future. Benefits: - Reduces code duplication from ~70 lines × 3 to ~10 lines × 3 - Ensures consistent test methodology across all array functions - Clear documentation of expected behavior for each function - Easier to add tests for new supported functions

Add detailed rustdoc examples to can_expr_be_pushed_down_with_schemas() showing three key scenarios: 1. Primitive column filters (allowed) - e.g., age > 30 2. Struct column filters (blocked) - e.g., person IS NOT NULL 3. List column filters with supported predicates (allowed) - e.g., array_has_all(tags, ['rust']) These examples help users understand when filter pushdown to the Parquet decoder is available and guide them in writing efficient queries. Benefits: - Clear documentation of supported and unsupported cases - Helps users optimize query performance - Provides copy-paste examples for common patterns - Updated to reflect new list column support

- Replace 'while let Some(batch) = reader.next()' with idiomatic 'for batch in reader' - Remove unnecessary mut from reader variable - Addresses clippy::while_let_on_iterator warning

- Document function name detection assumption in supported_predicates - Note reliance on exact string matching - Suggest trait-based approach for future robustness - Explain ProjectionMask::leaves() choice for nested columns - Clarify why leaf indices are needed for nested structures - Helps reviewers understand Parquet schema descriptor usage These comments address Low Priority suggestions from code review, improving maintainability and onboarding for future contributors.

…port

Remove SUPPORTED_ARRAY_FUNCTIONS array. Introduce dedicated predicate functions for NULL checks and scalar function support. Utilize pattern matching with matches! macro instead of array lookups. Enhance code clarity and idiomatic Rust usage with is_some_and() for condition checks and simplify recursion using a single expression.

Extract helper functions to reduce code duplication in array pushdown and physical plan tests. Consolidate similar assertions and checks, simplifying tests from ~50 to ~30 lines. Transform display tests into a single parameterized test, maintaining coverage while eliminating repeated code.

…ions

…monstrations" This reverts commit 94f1a99cee4e44e5176450156a684a2316af78e1.

Extract handle_nested_type() to encapsulate logic for determining if a nested type prevents pushdown. Introduce is_nested_type_supported() to isolate type checking for List/LargeList/FixedSizeList and predicate support. Simplify check_single_column() by reducing nesting depth and delegating nested type logic to helper methods.

zhuqi-lucas · 2025-12-30T02:59:59Z

Thank you @kosiew , i will review it soon!

zhuqi-lucas · 2025-12-30T08:32:42Z

Great work on implementing list type pushdown!

I suggest adding a benchmark to demonstrate the performance improvement more clearly. Here's a proposed test scenario:

Benchmark Setup:

Create a dataset with a List<String> column sorted by list values (lexicographic order)
Use a row group size that results in multiple row groups (e.g., 10K rows per group, 100K total rows)
Apply a selective filter like array_has(list_col, 'target_value') that matches only ~10% of row groups

Expected Results:

Without pushdown: All row groups must be decoded and filtered → baseline time
With pushdown: ~90% of row groups skipped based on min/max statistics → faster execution

Implement a new Criterion benchmark to test row-level pushdown for array_has predicates on a 100K-row dataset. Compare pushdown versus baseline scans and assert 90% row pruning. Enable the benchmark target and update the proposed dependencies in the Parquet datasource crate.

Extend list-predicate detection to include function names from scalar UDF expressions, ensuring array_has, array_has_all, and array_has_any qualify for Parquet list pushdown. Add row filter tests to verify support for list predicates and confirm correct row filters for list columns.

…pe alias

Introduce an additional large binary payload column in the Parquet nested filter pushdown benchmark. Update dataset generation and batch construction to populate the new column while maintaining existing pruning assertions. Benchmark results show improved performance with pushdown, averaging ~9.4 ms compared to ~37.7 ms without it.

…low criterion best practices Move schema and predicate setup outside the benchmark loop to measure only execution time, not plan creation. This follows the pattern used in topk_aggregate benchmarks where: - setup_reader() creates the schema once per case - create_predicate() builds the filter expression once per case - scan_with_predicate() performs the actual file scan and filtering inside the loop This ensures consistent benchmark measurements focused on filter pushdown effectiveness rather than setup overhead.

kosiew · 2026-01-06T12:35:56Z

The issue of cargo audit failing is being addressed in #19657

kosiew · 2026-01-06T12:40:01Z

hi @zhuqi-lucas

I added a benchmark.
It shows improvement:

zhuqi-lucas

Nice work, looks good to me, left some comments.

datafusion/datasource-parquet/src/row_filter.rs

zhuqi-lucas · 2026-01-06T13:55:16Z

datafusion/datasource-parquet/src/supported_predicates.rs

+
+// `ScalarUDFExpr` is currently an alias of `ScalarFunctionExpr` in this crate,
+// but keep a separate type to support potential future divergence.
+type ScalarUDFExpr = ScalarFunctionExpr;


I can't see a difference here, why not just use ScalarFunctionExpr?

I'll remove it.

datafusion/datasource-parquet/src/supported_predicates.rs

zhuqi-lucas · 2026-01-06T14:22:46Z

datafusion/datasource-parquet/src/row_filter.rs

    /// Does the expression reference any columns not present in the file schema?
    projected_columns: bool,
    /// Indices into the file schema of columns required to evaluate the expression.
    required_columns: BTreeSet<usize>,


Not related to this PR, if we can optimize to required_columns: Vec for small column set?

Tracking issue - #19673

Thank you @kosiew !

zhuqi-lucas · 2026-01-06T14:39:01Z

run benchmarks

alamb-ghbot · 2026-01-06T19:51:13Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing nested-filter-18560 (36d6744) to e5ca510 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb-ghbot · 2026-01-06T20:30:09Z

🤖: Benchmark completed

Details

Comparing HEAD and nested-filter-18560
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ nested-filter-18560 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2592.00 ms │          2569.55 ms │ no change │
│ QQuery 1     │  1084.26 ms │          1044.28 ms │ no change │
│ QQuery 2     │  2013.47 ms │          2023.78 ms │ no change │
│ QQuery 3     │  1172.18 ms │          1166.87 ms │ no change │
│ QQuery 4     │  2294.75 ms │          2197.30 ms │ no change │
│ QQuery 5     │ 28039.43 ms │         28188.66 ms │ no change │
│ QQuery 6     │  4012.31 ms │          3986.32 ms │ no change │
│ QQuery 7     │  3504.40 ms │          3605.80 ms │ no change │
└──────────────┴─────────────┴─────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 44712.80ms │
│ Total Time (nested-filter-18560)   │ 44782.57ms │
│ Average Time (HEAD)                │  5589.10ms │
│ Average Time (nested-filter-18560) │  5597.82ms │
│ Queries Faster                     │          0 │
│ Queries Slower                     │          0 │
│ Queries with No Change             │          8 │
│ Queries with Failure               │          0 │
└────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ nested-filter-18560 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.42 ms │             2.39 ms │     no change │
│ QQuery 1     │    49.32 ms │            50.56 ms │     no change │
│ QQuery 2     │   136.25 ms │           137.09 ms │     no change │
│ QQuery 3     │   152.90 ms │           154.51 ms │     no change │
│ QQuery 4     │  1058.61 ms │          1082.12 ms │     no change │
│ QQuery 5     │  1443.69 ms │          1455.30 ms │     no change │
│ QQuery 6     │     2.01 ms │             2.17 ms │  1.08x slower │
│ QQuery 7     │    57.26 ms │            54.99 ms │     no change │
│ QQuery 8     │  1402.61 ms │          1401.61 ms │     no change │
│ QQuery 9     │  1925.83 ms │          2210.80 ms │  1.15x slower │
│ QQuery 10    │   360.79 ms │           378.94 ms │  1.05x slower │
│ QQuery 11    │   397.86 ms │           428.24 ms │  1.08x slower │
│ QQuery 12    │  1316.67 ms │          1480.30 ms │  1.12x slower │
│ QQuery 13    │  1985.60 ms │          2028.86 ms │     no change │
│ QQuery 14    │  1264.05 ms │          1246.81 ms │     no change │
│ QQuery 15    │  1234.70 ms │          1216.91 ms │     no change │
│ QQuery 16    │  2492.38 ms │          2514.18 ms │     no change │
│ QQuery 17    │  2434.74 ms │          2473.13 ms │     no change │
│ QQuery 18    │  4891.00 ms │          4729.53 ms │     no change │
│ QQuery 19    │   121.75 ms │           121.00 ms │     no change │
│ QQuery 20    │  1927.84 ms │          1861.93 ms │     no change │
│ QQuery 21    │  2174.96 ms │          2181.10 ms │     no change │
│ QQuery 22    │  3762.11 ms │          3776.72 ms │     no change │
│ QQuery 23    │ 12341.81 ms │         12232.61 ms │     no change │
│ QQuery 24    │   209.28 ms │           214.19 ms │     no change │
│ QQuery 25    │   478.00 ms │           469.52 ms │     no change │
│ QQuery 26    │   210.59 ms │           218.79 ms │     no change │
│ QQuery 27    │  2761.96 ms │          2708.81 ms │     no change │
│ QQuery 28    │ 21773.16 ms │         21827.73 ms │     no change │
│ QQuery 29    │   949.23 ms │           973.36 ms │     no change │
│ QQuery 30    │  1319.08 ms │          1312.27 ms │     no change │
│ QQuery 31    │  1308.35 ms │          1324.14 ms │     no change │
│ QQuery 32    │  5235.46 ms │          4845.12 ms │ +1.08x faster │
│ QQuery 33    │  5826.09 ms │          5688.50 ms │     no change │
│ QQuery 34    │  5956.92 ms │          6534.39 ms │  1.10x slower │
│ QQuery 35    │  1889.21 ms │          1899.08 ms │     no change │
│ QQuery 36    │    67.55 ms │            67.95 ms │     no change │
│ QQuery 37    │    44.95 ms │            45.81 ms │     no change │
│ QQuery 38    │    65.51 ms │            64.55 ms │     no change │
│ QQuery 39    │    99.70 ms │           104.43 ms │     no change │
│ QQuery 40    │    26.68 ms │            26.62 ms │     no change │
│ QQuery 41    │    23.47 ms │            25.22 ms │  1.07x slower │
│ QQuery 42    │    19.27 ms │            19.72 ms │     no change │
└──────────────┴─────────────┴─────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 91201.61ms │
│ Total Time (nested-filter-18560)   │ 91591.99ms │
│ Average Time (HEAD)                │  2120.97ms │
│ Average Time (nested-filter-18560) │  2130.05ms │
│ Queries Faster                     │          1 │
│ Queries Slower                     │          7 │
│ Queries with No Change             │         35 │
│ Queries with Failure               │          0 │
└────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ nested-filter-18560 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 115.79 ms │           112.89 ms │     no change │
│ QQuery 2     │  29.44 ms │            29.23 ms │     no change │
│ QQuery 3     │  35.13 ms │            37.66 ms │  1.07x slower │
│ QQuery 4     │  28.20 ms │            28.44 ms │     no change │
│ QQuery 5     │  88.22 ms │            87.32 ms │     no change │
│ QQuery 6     │  19.39 ms │            19.64 ms │     no change │
│ QQuery 7     │ 228.69 ms │           223.18 ms │     no change │
│ QQuery 8     │  37.71 ms │            35.92 ms │     no change │
│ QQuery 9     │ 101.84 ms │           102.78 ms │     no change │
│ QQuery 10    │  62.53 ms │            62.21 ms │     no change │
│ QQuery 11    │  18.17 ms │            17.72 ms │     no change │
│ QQuery 12    │  51.28 ms │            51.18 ms │     no change │
│ QQuery 13    │  47.52 ms │            45.64 ms │     no change │
│ QQuery 14    │  13.78 ms │            13.40 ms │     no change │
│ QQuery 15    │  23.87 ms │            24.04 ms │     no change │
│ QQuery 16    │  24.24 ms │            24.29 ms │     no change │
│ QQuery 17    │ 148.57 ms │           148.91 ms │     no change │
│ QQuery 18    │ 277.97 ms │           278.02 ms │     no change │
│ QQuery 19    │  39.14 ms │            37.86 ms │     no change │
│ QQuery 20    │  49.96 ms │            48.98 ms │     no change │
│ QQuery 21    │ 310.91 ms │           293.12 ms │ +1.06x faster │
│ QQuery 22    │  16.86 ms │            17.07 ms │     no change │
└──────────────┴───────────┴─────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 1769.20ms │
│ Total Time (nested-filter-18560)   │ 1739.50ms │
│ Average Time (HEAD)                │   80.42ms │
│ Average Time (nested-filter-18560) │   79.07ms │
│ Queries Faster                     │         1 │
│ Queries Slower                     │         1 │
│ Queries with No Change             │        20 │
│ Queries with Failure               │         0 │
└────────────────────────────────────┴───────────┘

alamb · 2026-01-06T20:39:49Z

run benchmarks

alamb · 2026-01-06T20:40:26Z

I restarted the benchmarks to see if the results are reproducible -- I also merged up to get the Ci green

alamb-ghbot · 2026-01-06T23:00:32Z

🤖 ./gh_compare_branch.sh gh_compare_branch.sh Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing nested-filter-18560 (a8bbe5e) to 924037e diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb-ghbot · 2026-01-06T23:42:08Z

🤖: Benchmark completed

Details

Comparing HEAD and nested-filter-18560
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ nested-filter-18560 ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2338.85 ms │          2359.00 ms │ no change │
│ QQuery 1     │   946.34 ms │           921.54 ms │ no change │
│ QQuery 2     │  1907.90 ms │          1851.10 ms │ no change │
│ QQuery 3     │  1157.24 ms │          1161.32 ms │ no change │
│ QQuery 4     │  2250.03 ms │          2304.65 ms │ no change │
│ QQuery 5     │ 28272.65 ms │         28168.38 ms │ no change │
│ QQuery 6     │  3820.10 ms │          3880.60 ms │ no change │
│ QQuery 7     │  3868.64 ms │          3868.18 ms │ no change │
└──────────────┴─────────────┴─────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 44561.75ms │
│ Total Time (nested-filter-18560)   │ 44514.77ms │
│ Average Time (HEAD)                │  5570.22ms │
│ Average Time (nested-filter-18560) │  5564.35ms │
│ Queries Faster                     │          0 │
│ Queries Slower                     │          0 │
│ Queries with No Change             │          8 │
│ Queries with Failure               │          0 │
└────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ nested-filter-18560 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     1.47 ms │             1.49 ms │     no change │
│ QQuery 1     │    48.44 ms │            49.69 ms │     no change │
│ QQuery 2     │   132.76 ms │           133.51 ms │     no change │
│ QQuery 3     │   149.45 ms │           153.84 ms │     no change │
│ QQuery 4     │  1082.10 ms │          1119.39 ms │     no change │
│ QQuery 5     │  1360.37 ms │          1394.71 ms │     no change │
│ QQuery 6     │     1.44 ms │             1.46 ms │     no change │
│ QQuery 7     │    56.55 ms │            53.85 ms │     no change │
│ QQuery 8     │  1456.57 ms │          1450.22 ms │     no change │
│ QQuery 9     │  1794.12 ms │          1800.57 ms │     no change │
│ QQuery 10    │   341.74 ms │           336.89 ms │     no change │
│ QQuery 11    │   386.84 ms │           389.11 ms │     no change │
│ QQuery 12    │  1256.41 ms │          1260.86 ms │     no change │
│ QQuery 13    │  1930.10 ms │          1994.84 ms │     no change │
│ QQuery 14    │  1221.47 ms │          1241.62 ms │     no change │
│ QQuery 15    │  1250.66 ms │          1230.80 ms │     no change │
│ QQuery 16    │  2545.96 ms │          2611.28 ms │     no change │
│ QQuery 17    │  2516.84 ms │          2538.54 ms │     no change │
│ QQuery 18    │  5933.85 ms │          4961.34 ms │ +1.20x faster │
│ QQuery 19    │   119.89 ms │           119.84 ms │     no change │
│ QQuery 20    │  1895.11 ms │          1867.54 ms │     no change │
│ QQuery 21    │  2211.35 ms │          2150.36 ms │     no change │
│ QQuery 22    │  3947.37 ms │          3726.14 ms │ +1.06x faster │
│ QQuery 23    │ 18077.39 ms │         12247.91 ms │ +1.48x faster │
│ QQuery 24    │   213.27 ms │           190.93 ms │ +1.12x faster │
│ QQuery 25    │   446.55 ms │           457.85 ms │     no change │
│ QQuery 26    │   218.95 ms │           213.79 ms │     no change │
│ QQuery 27    │  2756.77 ms │          2686.02 ms │     no change │
│ QQuery 28    │ 23487.27 ms │         23361.00 ms │     no change │
│ QQuery 29    │   983.73 ms │           966.78 ms │     no change │
│ QQuery 30    │  1353.56 ms │          1334.45 ms │     no change │
│ QQuery 31    │  1336.10 ms │          1342.71 ms │     no change │
│ QQuery 32    │  5332.98 ms │          4969.01 ms │ +1.07x faster │
│ QQuery 33    │  5942.27 ms │          5346.96 ms │ +1.11x faster │
│ QQuery 34    │  5703.68 ms │          5826.95 ms │     no change │
│ QQuery 35    │  1949.60 ms │          1864.80 ms │     no change │
│ QQuery 36    │    65.46 ms │            64.82 ms │     no change │
│ QQuery 37    │    43.72 ms │            44.48 ms │     no change │
│ QQuery 38    │    66.55 ms │            64.70 ms │     no change │
│ QQuery 39    │   102.88 ms │           100.36 ms │     no change │
│ QQuery 40    │    24.35 ms │            25.39 ms │     no change │
│ QQuery 41    │    22.91 ms │            22.66 ms │     no change │
│ QQuery 42    │    19.00 ms │            17.92 ms │ +1.06x faster │
└──────────────┴─────────────┴─────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 99787.89ms │
│ Total Time (nested-filter-18560)   │ 91737.38ms │
│ Average Time (HEAD)                │  2320.65ms │
│ Average Time (nested-filter-18560) │  2133.43ms │
│ Queries Faster                     │          7 │
│ Queries Slower                     │          0 │
│ Queries with No Change             │         36 │
│ Queries with Failure               │          0 │
└────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ nested-filter-18560 ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 147.45 ms │           118.18 ms │ +1.25x faster │
│ QQuery 2     │  33.21 ms │            27.23 ms │ +1.22x faster │
│ QQuery 3     │  41.38 ms │            34.69 ms │ +1.19x faster │
│ QQuery 4     │  36.02 ms │            28.86 ms │ +1.25x faster │
│ QQuery 5     │  89.52 ms │            86.81 ms │     no change │
│ QQuery 6     │  20.31 ms │            19.57 ms │     no change │
│ QQuery 7     │ 223.00 ms │           232.95 ms │     no change │
│ QQuery 8     │  35.03 ms │            33.23 ms │ +1.05x faster │
│ QQuery 9     │ 104.94 ms │           109.16 ms │     no change │
│ QQuery 10    │  61.21 ms │            63.87 ms │     no change │
│ QQuery 11    │  16.61 ms │            16.41 ms │     no change │
│ QQuery 12    │  49.65 ms │            50.95 ms │     no change │
│ QQuery 13    │  45.40 ms │            46.84 ms │     no change │
│ QQuery 14    │  13.37 ms │            13.05 ms │     no change │
│ QQuery 15    │  23.99 ms │            23.99 ms │     no change │
│ QQuery 16    │  24.29 ms │            24.33 ms │     no change │
│ QQuery 17    │ 151.57 ms │           153.49 ms │     no change │
│ QQuery 18    │ 276.33 ms │           276.35 ms │     no change │
│ QQuery 19    │  36.14 ms │            36.87 ms │     no change │
│ QQuery 20    │  49.73 ms │            49.98 ms │     no change │
│ QQuery 21    │ 330.25 ms │           306.24 ms │ +1.08x faster │
│ QQuery 22    │  17.73 ms │            18.02 ms │     no change │
└──────────────┴───────────┴─────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                  ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                  │ 1827.13ms │
│ Total Time (nested-filter-18560)   │ 1771.08ms │
│ Average Time (HEAD)                │   83.05ms │
│ Average Time (nested-filter-18560) │   80.50ms │
│ Queries Faster                     │         6 │
│ Queries Slower                     │         0 │
│ Queries with No Change             │        16 │
│ Queries with Failure               │         0 │
└────────────────────────────────────┴───────────┘

Co-authored-by: Qi Zhu <821684824@qq.com>

kosiew · 2026-01-07T06:12:04Z

@zhuqi-lucas
Thanks for your review and feedback.

zhuqi-lucas · 2026-01-07T08:29:31Z

Thank you @kosiew, great work, LGTM now.

adriangb

Can you elaborate on why some predicates are "safe" and others are not "safe"? What is unsafe about reading a single field of a struct column? Won't you also run into #9066?

adriangb · 2026-01-07T15:47:40Z

datafusion/datasource-parquet/src/supported_predicates.rs

+///     // Can safely push down to Parquet decoder
+/// }
+/// ```
+pub trait SupportsListPushdown {


I don't understand why this is a trait. Do any concrete types implement this trait? Why not just write fn supports_list_pushdown(expr: &dyn PhyiscalExpr)?

The rationale is

// Current (trait-based) - natural and discoverable
if expr.supports_list_pushdown() { ... }
while
// Alternative (function-only) - requires searching for a function
if supports_list_pushdown(&expr) { ... }

But it's only used within this one module. Either way I think this may all be changed soon by #19556

kosiew added 22 commits December 29, 2025 19:56

refactor: use for loop instead of while-let in array pushdown tests

81f4051

- Replace 'while let Some(batch) = reader.next()' with idiomatic 'for batch in reader' - Remove unnecessary mut from reader variable - Addresses clippy::while_let_on_iterator warning

refactor: remove dead code expectation from ProjectionColumns struct

dc27504

refactor: remove root_indices from ProjectionColumns struct

40405cb

refactor: implement trait-based approach for list column pushdown sup…

ce9da3b

…port

Add tests for array_has...

6325c97

test: add physical plan tests for array functions and pushdown support

459988e

remove tests in supported_predicates.rs

b815629

add array functions example with logical and physical plan demonstrat…

0a769bd

…ions

Revert "add array functions example with logical and physical plan de…

94adbba

…monstrations" This reverts commit 94f1a99cee4e44e5176450156a684a2316af78e1.

test: add array function predicate pushdown tests for Parquet files

2007e47

refactor: rename NestedBehavior to NestedColumnSupport for clarity

e0f9cd3

refactor: rename ProjectionColumns to LeafProjection for clarity

13d685b

refactor: remove redundant array function predicate pushdown tests

904c0d9

test: add array function predicate pushdown slt tests for DataSourceExec

a5bab42

github-actions bot added sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate labels Dec 29, 2025

kosiew marked this pull request as ready for review December 29, 2025 14:15

kosiew requested a review from zhuqi-lucas December 29, 2025 14:53

feat: add benchmark for Parquet nested filter pushdown performance

6280665

kosiew marked this pull request as draft December 31, 2025 10:46

kosiew added 6 commits January 6, 2026 20:01

Remove unused #[allow(dead_code)] attribute from ScalarUDFExpr ty…

61dd01e

…pe alias

clippy fix

8548e33

kosiew marked this pull request as ready for review January 6, 2026 12:37

zhuqi-lucas approved these changes Jan 6, 2026

View reviewed changes

Merge branch 'main' into nested-filter-18560

a8bbe5e

kosiew and others added 5 commits January 7, 2026 13:41

Update datafusion/datasource-parquet/src/row_filter.rs

a01f0b2

Co-authored-by: Qi Zhu <821684824@qq.com>

Update datafusion/datasource-parquet/src/row_filter.rs

3110518

Co-authored-by: Qi Zhu <821684824@qq.com>

Update datafusion/datasource-parquet/src/supported_predicates.rs

87dcc7a

Co-authored-by: Qi Zhu <821684824@qq.com>

refactor: Remove alias for ScalarUDFExpr in supported_predicates.rs

3a5bc41

cargo fmt fix

d995490

kosiew mentioned this pull request Jan 7, 2026

Optimize required_columns from BTreeSet<usize> to Vec<usize> in struct PushdownChecker<'schema> #19673

Open

kosiew added this pull request to the merge queue Jan 7, 2026

Merged via the queue into apache:main with commit 566bcde Jan 7, 2026
31 checks passed

adriangb reviewed Jan 7, 2026

View reviewed changes

Parquet: Push down supported list predicates (array_has/any/all) during decoding #19545

Parquet: Push down supported list predicates (array_has/any/all) during decoding #19545

Uh oh!

Conversation

kosiew commented Dec 29, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

LLM-generated code disclosure

Uh oh!

zhuqi-lucas commented Dec 30, 2025

Uh oh!

zhuqi-lucas commented Dec 30, 2025

Uh oh!

kosiew commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kosiew commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuqi-lucas left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhuqi-lucas commented Jan 6, 2026

Uh oh!

alamb-ghbot commented Jan 6, 2026

Uh oh!

alamb-ghbot commented Jan 6, 2026

Uh oh!

alamb commented Jan 6, 2026

Uh oh!

alamb commented Jan 6, 2026

Uh oh!

alamb-ghbot commented Jan 6, 2026

Uh oh!

alamb-ghbot commented Jan 6, 2026

Uh oh!

kosiew commented Jan 7, 2026

Uh oh!

zhuqi-lucas commented Jan 7, 2026

Uh oh!

Uh oh!

adriangb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kosiew commented Jan 6, 2026 •

edited

Loading

kosiew commented Jan 6, 2026 •

edited

Loading