Skip to content

Conversation

@xudong963
Copy link
Member

@xudong963 xudong963 commented Nov 21, 2025

Which issue does this PR close?

Rationale for this change

See #18860 (comment)

What changes are included in this PR?

  1. How to decide if we can do limit pruning without messing up the sql semantics.
  2. Add logic to decide if a row group is fully matched, all rows in the row group are matched the predicated.
  3. Use the fully matched row groups to return limit rows.

Are these changes tested?

Yes

Are there any user-facing changes?

No, no new configs, or API change

@xudong963 xudong963 marked this pull request as draft November 21, 2025 09:43
@github-actions github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate execution Related to the execution crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Nov 21, 2025
@github-actions github-actions bot added logical-expr Logical plan and expressions optimizer Optimizer rules catalog Related to the catalog crate proto Related to proto crate labels Nov 25, 2025
@xudong963 xudong963 force-pushed the row_group_limit_pruning branch from 52f012f to d075c86 Compare November 25, 2025 05:26
@xudong963 xudong963 marked this pull request as ready for review November 25, 2025 05:31
@xudong963
Copy link
Member Author

I also like to construct some good-fit cases to show some benchmark results.

@alamb
Copy link
Contributor

alamb commented Nov 25, 2025

I plan to review this PR carefully tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this PR @xudong963

The idea is very cool, and a nicely done PR; I have several suggestions, but I think it is very close.

Suggestions:

  1. Rename the flag to preserve_order
  2. Make this new flag on TableScan visible in EXPLAIN plans so we have a better chance of understanding when it is being applied/not applied
  3. Implement an end to end test for the behavior you want to see (specifically, a test that shows a query on a parquet file and demonstrates that only a single row group is fetched with this flag, and more than one is fetched without the flag)

In my mind, 3 is the most important

@alamb
Copy link
Contributor

alamb commented Nov 27, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1018-gcp #19~24.04.1-Ubuntu SMP Wed Sep 24 23:23:09 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing row_group_limit_pruning (24ae747) to f1ecacc diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Nov 27, 2025

🤖: Benchmark completed

Details

Comparing HEAD and row_group_limit_pruning
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ row_group_limit_pruning ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  2564.84 ms │              2711.60 ms │ 1.06x slower │
│ QQuery 1     │  1177.86 ms │              1306.25 ms │ 1.11x slower │
│ QQuery 2     │  2412.87 ms │              2423.96 ms │    no change │
│ QQuery 3     │  1156.85 ms │              1145.44 ms │    no change │
│ QQuery 4     │  2318.20 ms │              2308.57 ms │    no change │
│ QQuery 5     │ 28372.67 ms │             27912.28 ms │    no change │
│ QQuery 6     │  4041.22 ms │              4170.72 ms │    no change │
│ QQuery 7     │  3812.31 ms │              3670.54 ms │    no change │
└──────────────┴─────────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 45856.83ms │
│ Total Time (row_group_limit_pruning)   │ 45649.37ms │
│ Average Time (HEAD)                    │  5732.10ms │
│ Average Time (row_group_limit_pruning) │  5706.17ms │
│ Queries Faster                         │          0 │
│ Queries Slower                         │          2 │
│ Queries with No Change                 │          6 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ row_group_limit_pruning ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.11 ms │                 2.20 ms │     no change │
│ QQuery 1     │    49.14 ms │                49.42 ms │     no change │
│ QQuery 2     │   134.43 ms │               137.18 ms │     no change │
│ QQuery 3     │   165.90 ms │               166.30 ms │     no change │
│ QQuery 4     │  1082.62 ms │              1093.44 ms │     no change │
│ QQuery 5     │  1509.20 ms │              1523.29 ms │     no change │
│ QQuery 6     │     2.23 ms │                 2.12 ms │     no change │
│ QQuery 7     │    53.46 ms │                53.16 ms │     no change │
│ QQuery 8     │  1464.07 ms │              1443.88 ms │     no change │
│ QQuery 9     │  1920.32 ms │              1861.46 ms │     no change │
│ QQuery 10    │   368.92 ms │               369.59 ms │     no change │
│ QQuery 11    │   425.35 ms │               415.20 ms │     no change │
│ QQuery 12    │  1360.27 ms │              1366.53 ms │     no change │
│ QQuery 13    │  2088.62 ms │              2099.38 ms │     no change │
│ QQuery 14    │  1291.80 ms │              1276.05 ms │     no change │
│ QQuery 15    │  1287.95 ms │              1249.38 ms │     no change │
│ QQuery 16    │  2687.28 ms │              2705.34 ms │     no change │
│ QQuery 17    │  2666.35 ms │              2685.82 ms │     no change │
│ QQuery 18    │  5334.10 ms │              5051.38 ms │ +1.06x faster │
│ QQuery 19    │   128.58 ms │               129.60 ms │     no change │
│ QQuery 20    │  2025.19 ms │              1974.90 ms │     no change │
│ QQuery 21    │  2326.74 ms │              2256.83 ms │     no change │
│ QQuery 22    │  3910.56 ms │              3883.88 ms │     no change │
│ QQuery 23    │ 16847.81 ms │             12664.80 ms │ +1.33x faster │
│ QQuery 24    │   218.91 ms │               218.97 ms │     no change │
│ QQuery 25    │   470.06 ms │               470.62 ms │     no change │
│ QQuery 26    │   211.18 ms │               219.75 ms │     no change │
│ QQuery 27    │  2828.80 ms │              2821.30 ms │     no change │
│ QQuery 28    │ 23333.65 ms │             24002.05 ms │     no change │
│ QQuery 29    │  1010.28 ms │               990.00 ms │     no change │
│ QQuery 30    │  1361.32 ms │              1340.64 ms │     no change │
│ QQuery 31    │  1401.38 ms │              1386.63 ms │     no change │
│ QQuery 32    │  4833.23 ms │              4590.26 ms │ +1.05x faster │
│ QQuery 33    │  5922.93 ms │              5861.58 ms │     no change │
│ QQuery 34    │  6218.31 ms │              6093.69 ms │     no change │
│ QQuery 35    │  1945.21 ms │              1892.18 ms │     no change │
│ QQuery 36    │   118.71 ms │               122.83 ms │     no change │
│ QQuery 37    │    52.92 ms │                53.12 ms │     no change │
│ QQuery 38    │   119.74 ms │               120.43 ms │     no change │
│ QQuery 39    │   195.76 ms │               198.67 ms │     no change │
│ QQuery 40    │    40.90 ms │                45.00 ms │  1.10x slower │
│ QQuery 41    │    39.19 ms │                40.69 ms │     no change │
│ QQuery 42    │    31.96 ms │                33.45 ms │     no change │
└──────────────┴─────────────┴─────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 99487.42ms │
│ Total Time (row_group_limit_pruning)   │ 94963.00ms │
│ Average Time (HEAD)                    │  2313.66ms │
│ Average Time (row_group_limit_pruning) │  2208.44ms │
│ Queries Faster                         │          3 │
│ Queries Slower                         │          1 │
│ Queries with No Change                 │         39 │
│ Queries with Failure                   │          0 │
└────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ row_group_limit_pruning ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 1     │ 130.90 ms │               136.37 ms │    no change │
│ QQuery 2     │  28.85 ms │                29.40 ms │    no change │
│ QQuery 3     │  35.92 ms │                38.73 ms │ 1.08x slower │
│ QQuery 4     │  29.91 ms │                29.59 ms │    no change │
│ QQuery 5     │  87.85 ms │                88.68 ms │    no change │
│ QQuery 6     │  19.75 ms │                19.33 ms │    no change │
│ QQuery 7     │ 224.01 ms │               217.44 ms │    no change │
│ QQuery 8     │  33.21 ms │                32.74 ms │    no change │
│ QQuery 9     │ 103.44 ms │               100.86 ms │    no change │
│ QQuery 10    │  64.34 ms │                64.25 ms │    no change │
│ QQuery 11    │  17.90 ms │                19.12 ms │ 1.07x slower │
│ QQuery 12    │  52.62 ms │                53.57 ms │    no change │
│ QQuery 13    │  47.65 ms │                47.74 ms │    no change │
│ QQuery 14    │  13.71 ms │                13.79 ms │    no change │
│ QQuery 15    │  24.87 ms │                24.60 ms │    no change │
│ QQuery 16    │  25.17 ms │                25.16 ms │    no change │
│ QQuery 17    │ 149.35 ms │               151.78 ms │    no change │
│ QQuery 18    │ 278.08 ms │               276.50 ms │    no change │
│ QQuery 19    │  37.01 ms │                38.28 ms │    no change │
│ QQuery 20    │  49.78 ms │                48.67 ms │    no change │
│ QQuery 21    │ 317.59 ms │               310.12 ms │    no change │
│ QQuery 22    │  18.42 ms │                18.49 ms │    no change │
└──────────────┴───────────┴─────────────────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                      │ 1790.33ms │
│ Total Time (row_group_limit_pruning)   │ 1785.22ms │
│ Average Time (HEAD)                    │   81.38ms │
│ Average Time (row_group_limit_pruning) │   81.15ms │
│ Queries Faster                         │         0 │
│ Queries Slower                         │         2 │
│ Queries with No Change                 │        20 │
│ Queries with Failure                   │         0 │
└────────────────────────────────────────┴───────────┘

@xudong963 xudong963 force-pushed the row_group_limit_pruning branch from fdef903 to 7e1fc4f Compare November 27, 2025 13:31
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Nov 28, 2025
@xudong963 xudong963 force-pushed the row_group_limit_pruning branch from 90295f9 to 63cd878 Compare November 28, 2025 01:58
@xudong963
Copy link
Member Author

Thank you very much for this PR @xudong963

The idea is very cool, and a nicely done PR; I have several suggestions, but I think it is very close.

Suggestions:

  1. Rename the flag to preserve_order
  2. Make this new flag on TableScan visible in EXPLAIN plans so we have a better chance of understanding when it is being applied/not applied
  3. Implement an end to end test for the behavior you want to see (specifically, a test that shows a query on a parquet file and demonstrates that only a single row group is fetched with this flag, and more than one is fetched without the flag)

In my mind, 3 is the most important

Thanks for the review @alamb . I addressed some suggestions:

  1. Renamed the flag to preseerve_order ecbfff7
  2. Avoided the second recursive call in pushdown limit and maintained a state to decide if set preserve_order as true. e919879
  3. Extracted the logic that finds fully matched rg as a function: 647618a
  4. Added the end-to-end sqllogictests for limit pruning 63cd878
  5. Extracted the not simplifier as a new PR: Support simplify not for physical expr #18970

Things later the PR:

  1. Make a TableScanBuilder or something (as a follow on PR) and deprecate TableScan::try_new
  2. Add the preserve_order to explain. The reason I plan to do it in another PR is that I guess it'll make lots of changes for tests, which will expand the PR.

@alamb
Copy link
Contributor

alamb commented Dec 1, 2025

I was mentioning this PR to @crepererum and he also noted there is another interesting potential optimization when we know the predicate is true for the entire row group: we can skip evaluating the predicate for the row group entirely (as the filter can often be quite expensive itself)

@xudong963 is this something else you are contemplating? This PR likely lays the foundation for such an optimization. If you think it is worthwhile I'll file a ticket

@alamb
Copy link
Contributor

alamb commented Dec 1, 2025

Extracted the not simplifier as a new PR: #18970

I just reviewed this -- it looks nice 👍

@xudong963 xudong963 force-pushed the row_group_limit_pruning branch from 60fed57 to 8c08ac9 Compare January 7, 2026 05:01
@github-actions github-actions bot added physical-expr Changes to the physical-expr crates documentation Improvements or additions to documentation and removed execution Related to the execution crate labels Jan 7, 2026
@xudong963
Copy link
Member Author

👏 this is a great PR -- thank you @xudong963 I went over the code and tests very carefully and I think it is really nice

I would personally suggest we merge this one into main and not into 52 to give it maximum "bake time" (testing before release) but I defer to you in that regard

Thank you @alamb , I addressed your newest comments in the commit.

I also like to just merge into main and give it some "bake time".

@xudong963
Copy link
Member Author

Hi @xudong963 I'm sorry for the delay. I have also been on holiday and struggling to keep up with things. I did read the Snowflake paper... great stuff! I do plan to review this PR some time this week, it's on my shortlist. Thank you for the great work and for your patience.

Thank you @adriangb , no worries, looking forward to your thoughts!

Comment on lines +79 to +80
/// If should keep the output rows in order
pub preserve_order: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be pub? Is it pub only for tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we need the pub like other fields, because in create_file_opener, the opener is created by Structure way:

ParquetOpener {
            partition_index: partition,
            projection: self.projection.clone(),
            batch_size: self
                .batch_size
                .expect("Batch size must set before creating ParquetOpener"),
            limit: base_config.limit,
            preserve_order: base_config.preserve_order,
            predicate: self.predicate.clone(),
            table_schema: self.table_schema.clone(),
            metadata_size_hint: self.metadata_size_hint,
            metrics: self.metrics().clone(),
            parquet_file_reader_factory,
            pushdown_filters: self.pushdown_filters(),
            reorder_filters: self.reorder_filters(),
            force_filter_selections: self.force_filter_selections(),
            enable_page_index: self.enable_page_index(),
            enable_bloom_filter: self.bloom_filter_on_read(),
            enable_row_group_stats_pruning: self.table_parquet_options.global.pruning,
            coerce_int96,
            #[cfg(feature = "parquet_encryption")]
            file_decryption_properties,
            expr_adapter_factory,
            #[cfg(feature = "parquet_encryption")]
            encryption_factory: self.get_encryption_factory_with_config(),
            max_predicate_cache_size: self.max_predicate_cache_size(),
            reverse_row_groups: self.reverse_row_groups,
        });

We could add a new method later, then see if some fields don't need the pub

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't it be pub(crate) then?


// Simplify the NOT expression (e.g., NOT(c1 = 0) -> c1 != 0)
// before building the pruning predicate
let simplifier = PhysicalExprSimplifier::new(arrow_schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could short circuit by adding a not() to PruningPredicate or something like that to take an existing PruningPredicate and negate it by cloning the innner expression Arc or something.

Comment on lines 155 to 156
/// Whether the scan's limit is order sensitive
pub preserve_order: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing this in isolation is a bit confusing. I think adding 2-3 lines of docstring summarizing the explanation from the PR description or something would be helpful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done fd95aaf

Comment on lines +374 to +376
let Some(data_source_exec) = plan.as_any().downcast_ref::<DataSourceExec>() else {
return plan;
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It saddens me every time I see a downcast. Maybe you decided not to because it's more code churn or something but could we record in the FileScanConfig if an order was pushed down into it and if so set the preserve_order flag? As opposed to the case where the scan has an ordering but it isn't required by upstream operators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could set the preserve_order flag to true when an order was pushed down into FileScanConfig.

But here, I'm trying to solve the case described in #18868 (comment)

However, a complication arises: if the data distribution and physical ordering of table t already satisfy ORDER BY a, the EnforceSorting phase in the Physical Optimizer will remove the Sort node. Subsequently, the LIMIT is pushed down to the DataSource during the LimitPushdown phase.

In this scenario, we cannot rely solely on the presence of the LIMIT at the Parquet level to decide whether to prune. If we prune based on a limit that was originally associated with a removed Sort, we might violate the required global ordering.

Proposed Solution: To address this fundamentally, when a Sort node is removed because the distribution already matches the requirements, we should mark the resulting Limit node as "Order-Sensitive." During the subsequent LimitPushdown, we can detect this flag and set the preserve_order attribute to true in the ParquetOpener. This ensures that Limit Pruning is bypassed when the global order must be maintained.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see yeah. I wonder if with the new sort pushdown rule we can tweak this?

My thought is that the scan can have a sort order that was discovered or user provided but not promise to upstream nodes that it's going to output sorted results until the upstream nodes ask it to do so. At that point it would set this internal flag so when we get a limit we can remember that we promised our parent nodes we'd be outputting sorted results.

Put another way, at the same time that a file scan node sets it's equivalence properties to output sorted results it should set this flag. So maybe all we need to do is in ListingTable set this flag if we also give the scan node a specified ordering? Then it gets set one of two ways:

  1. In the listing step, if we discover a sort order from Parquet metadata or if the user provided a sort order.
  2. Via the new sort pushdown rule.
    So either way, if the scan node is outputting sorted data the flag is set?

I just dislike this sort of downcast matching in a generic engine like DataFusion. It means that custom scan nodes can't benefit from this feature, etc.

@xudong963 xudong963 force-pushed the row_group_limit_pruning branch from 8a39aa1 to 682b68b Compare January 8, 2026 05:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation optimizer Optimizer rules performance Make DataFusion faster physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants