Skip to content

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Aug 13, 2025

Which issue does this PR close?

Closes #2131
Closes #1836

Rationale for this change

  • The logic for determining when to inject a CopyExec::UnpackOrDeepCopy into a plan was complex to maintain and is only needed for the soon to be legacy native_comet scan. It is not needed for the native_datafusion or native_iceberg_compat scans. This PR simplifies the implementation so that the deep copy logic happens directly in ScanExec. This is likely slightly less performance than before for native_comet scans.
  • Needed for Iceberg integration. This is a more robust fix than the quick hack we included in fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel #1987

What changes are included in this PR?

  • Add has_buffer_reuse flag to ScanExec protobuf
  • Perform deep copy of arrays directly inside ScanExec if this flag is enabled
  • Populate this flag for native_comet scan and Iceberg scans (which currently wrap native_comet)
  • Populate this flag for Comet sinks that wrap native_comet scans
  • Remove Comet's FilterExec, which was a slightly modified fork of DataFusion's FilterExec. This is no longer needed now that we copy the incoming batches before the filter.

How are these changes tested?

I will wait for #1987 to be merged and then make sure that this PR does not introduce any regressions.

@codecov-commenter
Copy link

codecov-commenter commented Aug 13, 2025

Codecov Report

❌ Patch coverage is 84.21053% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.47%. Comparing base (f09f8af) to head (310ef0c).
⚠️ Report is 383 commits behind head on main.

Files with missing lines Patch % Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala 70.00% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2135      +/-   ##
============================================
+ Coverage     56.12%   58.47%   +2.34%     
- Complexity      976     1253     +277     
============================================
  Files           119      143      +24     
  Lines         11743    13202    +1459     
  Branches       2251     2371     +120     
============================================
+ Hits           6591     7720    +1129     
- Misses         4012     4257     +245     
- Partials       1140     1225      +85     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove changed the title fix: Iceberg scan buffer reuse [WIP] fix: Improve logic for adding CopyExec operators to prevent memory corruption from mutable buffer reuse Aug 13, 2025
@andygrove andygrove closed this Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Avoid unnecessary uses of CopyExec in native plan [iceberg] Memory corruption due to unsafe reuse of mutable buffers

2 participants