Releases: apache/datafusion-comet
0.12.0
DataFusion Comet 0.12.0 Changelog
This release consists of 105 commits from 13 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: Fix
None.getinstringDecodewhenbinchild cannot be converted #2606 (cfmcgrady) - fix: Update FuzzDataGenerator to produce dictionary-encoded string arrays & fix bugs that this exposes #2635 (andygrove)
- fix: Fallback to Spark for lpad/rpad for unsupported arguments & fix negative length handling #2630 (andygrove)
- fix: Mark SortOrder with floating-point as incompatible #2650 (andygrove)
- fix: Fall back to Spark for
trunc/date_truncfunctions when format string is unsupported, or is not a literal value #2634 (andygrove) - fix: [native_datafusion] only pass single partition of PartitionedFiles into DataSourceExec #2675 (mbutrovich)
- fix: Fix subcommands options in fuzz-testing #2684 (manuzhang)
- fix: Do not replace SMJ with HJ for
LeftSemi#2687 (comphead) - fix: Apply spotless on Iceberg 1.8.1 diff [iceberg] #2700 (hsiang-c)
- fix: Fix generate-user-guide-reference-docs failure when mvn command is not executed at root #2691 (manuzhang)
- fix: Fix missing SortOrder fallback reason in range partitioning #2716 (andygrove)
- fix: CometLiteral class cast exception with arrays #2718 (andygrove)
- fix: NormalizeNaNAndZero::children() returns child's child #2732 (mbutrovich)
- fix: checkSparkMaybeThrows should compare Spark and Comet results in success case #2728 (andygrove)
- fix: Mark
WindowsExecas incompatible #2748 (andygrove) - fix: Add strict floating point mode and fallback to Spark for min/max/sort on floating point inputs when enabled #2747 (andygrove)
- fix: Implement producedAttributes for CometWindowExec #2789 (rahulbabarwal89)
- fix: Pass all Comet configs to native plan #2801 (andygrove)
Implemented enhancements:
- feat: Add option to write benchmark results to file #2640 (andygrove)
- feat: Implement metrics for iceberg compat #2615 (EmilyMatt)
- feat: Define function signatures in CometFuzz #2614 (andygrove)
- feat: cherry-pick UUID conversion logic from #2528 #2648 (mbutrovich)
- feat: support
concatfor strings #2604 (comphead) - feat: Add support for
abs#2689 (andygrove) - feat: Support variadic function in CometFuzz #2682 (manuzhang)
- feat: CometExecRule refactor: Unify CometNativeExec creation with Serde in CometOperatorSerde trait #2768 (andygrove)
- feat: support cot #2755 (psvri)
- feat: Add bash script to build and run fuzz testing #2686 (manuzhang)
- feat: Add getSupportLevel to CometAggregateExpressionSerde trait #2777 (andygrove)
- feat: Add CI check to ensure generated docs are in sync with code #2779 (andygrove)
- feat: Add prettier enforcement #2783 (andygrove)
- feat: hyperbolic trig functions #2784 (psvri)
- feat: [iceberg] Native scan by serializing FileScanTasks to iceberg-rust #2528 (mbutrovich)
Documentation updates:
- docs: Add changelog for 0.11.0 release #2585 (mbutrovich)
- docs: Improve documentation layout #2587 (andygrove)
- docs: Publish 0.11.0 user guide #2589 (andygrove)
- docs: Put Comet logo in top nav bar, respect light/dark mode #2591 (andygrove)
- docs: Improve main landing page #2593 (andygrove)
- docs: Improve site navigation #2597 (andygrove)
- docs: Update benchmark results #2596 (andygrove)
- docs: Upgrade pydata-sphinx-theme to 0.16.1 #2602 (andygrove)
- docs: Fix redirect #2603 (andygrove)
- docs: Fix broken image link #2613 (andygrove)
- docs: Add FFI docs to contributor guide #2668 (andygrove)
- docs: Various documentation updates #2674 (andygrove)
- docs: Add supported SortOrder expressions and fix a typo #2694 (andygrove)
- docs: Minor docs update for running Spark SQL tests #2712 (andygrove)
- docs: Update contributor guide for adding a new expression #2704 (andygrove)
- docs: Documentation updates for
LocalTableScanandWindowExec#2742 (andygrove) - docs: Typo fix #2752 (wForget)
- docs: Categorize some configs as
testingand add notes about known time zone issues #2740 (andygrove) - docs: Run prettier on all markdown files #2782 (andygrove)
- docs: Ignore prettier formatting for generated tables #2790 (andygrove)
- docs: Add new section to contributor guide, explaining how to add a new operator #2758 (andygrove)
Other:
- chore: Start 0.12.0 development #2584 (mbutrovich)
- chore: Bump Spark from 3.5.6 to 3.5.7 #2574 (cfmcgrady)
- chore(deps): bump parquet from 56.0.0 to 56.2.0 in /native #2608 (dependabot[bot])
- chore(deps): bump tikv-jemallocator from 0.6.0 to 0.6.1 in /native #2609 (dependabot[bot])
- chore(deps): bump tikv-jemalloc-ctl from 0.6.0 to 0.6.1 in /native #2610 (dependabot[bot])
- tests: FuzzDataGenerator instead of Parquet-specific generator #2616 (mbutrovich)
- chore: Simplify on-heap memory configuration #2599 (andygrove)
- Feat: Add sha1 function impl #2471 (kazantsev-maksim)
- chore: Refactor Parquet/DataFrame fuzz data generators #2629 (andygrove)
- chore: Remove needless from_raw calls #2638 (EmilyMatt)
- chore: support DataFusion 50.3.0 #2605 (comphead)
- chore(deps): bump actions/upload-artifact from 4 to 5 #2654 (dependabot[bot])
- chore(deps): bump cc from 1.2.42 to 1.2.43 in /native #2653 (dependabot[bot])
- chore(deps): bump actions/download-artifact from 5 to 6 #2652 (dependabot[bot])
- chore: extract c...
0.11.0
DataFusion Comet 0.11.0 Changelog
This release consists of 131 commits from 15 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: temporarily ignore test for hdfs file systems #2359 (parthchandra)
- fix: Check reused broadcast plan in non-AQE and make setNumPartitions thread safe #2398 (wForget)
- fix: correct
missingInputforCometHashAggregateExec#2409 (comphead) - fix:clippy errros rust 1.9.0 update #2419 (coderfender)
- fix: Avoid spark plan execution cache preventing CometBatchRDD numPartitions change #2420 (wForget)
- fix: regressions in
CometToPrettyStringSuite#2384 (hsiang-c) - fix: Byte array Literals failed on cast #2432 (comphead)
- fix: Do not push down subquery filters on native_datafusion scan #2438 (wForget)
- fix: Improve error handling when resolving S3 bucket region #2440 (andygrove)
- fix: [iceberg] additional parquet independent api for iceberg integration #2442 (parthchandra)
- fix: Specify reqwest crate features #2446 (andygrove)
- fix: distributed RangePartitioning bounds calculation with native shuffle #2258 (mbutrovich)
- fix: fix regression in tpcbench.py #2512 (andygrove)
- fix: [iceberg] Close reader instance in ReadConf #2510 (hsiang-c)
- fix: Enable plan stability tests for
autoscan #2516 (andygrove) - fix: Capture unexpected output when retrieving JVM 17 args in Makefile #2566 (zuston)
Performance related:
- perf: New Configuration from shared conf to avoid high costs #2402 (wForget)
- perf: Use DataFusion's
count_udafinstead ofSUM(IF(expr IS NOT NULL, 1, 0))#2407 (andygrove) - perf: Improve BroadcastExchangeExec conversion #2417 (wForget)
Implemented enhancements:
- feat: Add dynamic
enabledandallowIncompatconfigs for all supported expressions #2329 (andygrove) - feat: feature specific tests #2372 (parthchandra)
- feat: Support more date part expressions #2316 (wForget)
- feat: rpad support column for second arg instead of just literal #2099 (coderfender)
- feat: Support comet native log level conf #2379 (wForget)
- feat: Enable WeekDay function #2411 (wForget)
- feat: Add nested Array literal support #2181 (comphead)
- feat:add_additional_char_support_rpad #2436 (coderfender)
- feat: do not fallback to Spark for
COUNT(distinct)#2429 (comphead) - feat: implement_ansi_eval_mode_arithmetic #2136 (coderfender)
- feat: Add plan conversion statistics to extended explain info #2412 (andygrove)
- feat: implement_comet_native_lpad_expr #2102 (coderfender)
- feat: Add
backtracefeature to simplify enabling native backtraces inCometNativeException#2515 (andygrove) - feat: Support reverse function with ArrayType input #2481 (cfmcgrady)
- feat: Change default off-heap memory pool from
greedy_unifiedtofair_unified#2526 (andygrove) - feat: Make DiskManager
max_temp_directory_sizeconfigurable #2479 (manuzhang) - feat: Parquet Modular Encryption with Spark KMS for native readers #2447 (mbutrovich)
- feat: Add support for Spark-compatible cast from integral to decimal #2472 (coderfender)
- feat:Support ANSI mode integral divide #2421 (coderfender)
- feat: Add config to enable running Comet in onheap mode #2554 (andygrove)
- feat:support ansi mode rounding function #2542 (coderfender)
- feat:support ansi mode remainder function #2556 (coderfender)
- feat: Implement array-to-string cast support #2425 (cfmcgrady)
- feat: Various improvements to memory pool configuration, logging, and documentation #2538 (andygrove)
- feat: Enable complex types for columnar shuffle #2573 (mbutrovich)
- feat: support_decimal_types_bool_cast_native_impl #2490 (coderfender)
- feat: Use buf write to reduce system call on index write #2579 (zuston)
Documentation updates:
- doc: Document usage IcebergCometBatchReader.java #2347 (comphead)
- docs: Add changelog for 0.10.0 release #2361 (andygrove)
- docs: Fix error in docs #2373 (andygrove)
- docs: Fix more comet versions in docs #2374 (andygrove)
- docs: Publish 0.10.0 user guide #2394 (andygrove)
- doc: macos benches doc clarifications #2418 (comphead)
- docs: update configs.md after #2422 #2428 (mbutrovich)
- docs: update docs and tuning guide related to native shuffle #2487 (mbutrovich)
- docs: Improve EC2 benchmarking guide #2474 (andygrove)
- docs: docs_update_ansi_support #2496 (coderfender)
- docs:support lpad expression documentation update #2517 (coderfender)
- docs: doc changes to support ANSI mode integral divide #2570 (coderfender)
- docs: Split configuration guide into different sections (scan, exec, shuffle, etc) #2568 (andygrove)
- docs: doc update to support ANSI mode remainder function #2576 (coderfender)
- docs: Documentation updates #2581 (andygrove)
Other:
- chore(deps): bump uuid from 1.18.0 to 1.18.1 in /native #2336 (dependabot[bot])
- build: Check that all Scala test suites run in PR builds #2304 (andygrove)
- chore: Start 0.11.0 development #2365 (andygrove)
- chore: Split expression serde hash map into separate categories #2322 (andygrove)
- chore: exclude Iceberg diffs from rat checks #2376 (hsiang-c)
- chore: Refactor UnaryMinus serde #2378 (andygrove)
- chore: Revert "chore: [1941-Part1]: Introduce
map_sortscalar function (#2… #2381 (comphead) - chore: Refactor Literal serde [#2377](https://github.com/apache/datafusion-comet/pull/...
0.10.1
DataFusion Comet 0.10.1 Changelog
This release consists of 7 commits from 1 contributors. See credits at the end of this changelog for more information.
Documentation updates:
- docs: [branch-0.10] Update version number in branch-0.10 user guide #2395 (andygrove)
Other:
- chore: [branch-0.10] Support Spark 4.0.1 instead of 4.0.0 (#2414) #2497 (andygrove)
- build: [branch-0.10] Stop caching libcomet in CI (#2498) #2502 (andygrove)
- chore: [branch-0.10] perf: Improve BroadcastExchangeExec conversion #2501 (andygrove)
- chore: [branch-0.10] [iceberg] additional parquet independent api for iceberg integration (#2442) #2499 (andygrove)
- fix: [branch-0.10] Avoid spark plan execution cache preventing CometBatchRDD numPartitions change (#2420) #2503 (andygrove)
- build: [branch-0.10] Bump version to 0.10.1 #2508 (andygrove)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
7 Andy Grove
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.
0.10.0
DataFusion Comet 0.10.0 Changelog
This release consists of 183 commits from 26 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: [Iceberg] Fix decimal corruption #1985 (andygrove)
- fix: broken link in development.md #2024 (petern48)
- fix: [iceberg] Add LogicalTypeAnnotation in ParquetColumnSpec #2000 (huaxingao)
- fix: hdfs read into buffer fully #2031 (parthchandra)
- fix: Refactor arithmetic serde and fix correctness issues with EvalMode::TRY #2018 (andygrove)
- fix: clean up [iceberg] integration APIs #2032 (huaxingao)
- fix: zero Arrow Array offset before sending across FFI #2052 (mbutrovich)
- fix: [iceberg] more fixes for Iceberg integration APIs. #2078 (parthchandra)
- fix: Add support for StringDecode in Spark 4.0.0 #2075 (peter-toth)
- fix: Avoid double free in CometUnifiedShuffleMemoryAllocator #2122 (andygrove)
- fix: Remove duplicate serde code #2098 (andygrove)
- fix: Improve logic for determining when an UnpackOrDeepCopy is needed #2142 (andygrove)
- fix: Add CopyExec to inputs to SortMergeJoinExec #2155 (andygrove)
- fix: Fix repeatedly url-decode path when reading parquet from s3 using native parquet reader #2138 (Kontinuation)
- fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel #1987 (hsiang-c)
- fix: [iceberg] Fall back to spark for schemas with empty structs #2204 (andygrove)
- fix: Fix failing TPC-DS workflow in PR CI runs #2207 (andygrove)
- fix: [iceberg] order query result deterministically #2208 (hsiang-c)
- fix: use
spark.comet.batchSizeinstead ofconf.arrowMaxRecordsPerBatchfor data that is coming from Java #2196 (rluvaton) - fix: if expr nullable #2217 (Asura7969)
- fix: Support
autoscan mode with Spark 4.0.0 #1975 (andygrove) - fix: Make Sha2 fallback message more user-friendly #2213 (rishvin)
- fix: separate type checking for CometExchange and CometColumnarExchange #2241 (mbutrovich)
- fix: Fix potential resource leak in native shuffle block reader #2247 (andygrove)
- fix: Remove unreachable code in
CometScanRule#2252 (andygrove) - fix: Fall back to
native_cometfor encrypted Parquet scans #2250 (andygrove) - fix: Fall back to
native_cometwhen object store not supported bynative_iceberg_compat#2251 (andygrove) - fix: split expr.proto file (new) #2267 (kination)
- fix: handle cast to dictionary vector introduced by case when #2044 (parthchandra)
- fix: Remove check for custom S3 endpoints #2288 (andygrove)
- fix: implement lazy evaluation in Coalesce function #2270 (coderfender)
- fix: Update benchmarking scripts #2293 (andygrove)
- fix: Fix regression in NativeConfigSuite #2299 (andygrove)
- fix: Validating object store configs should not throw exception #2308 (andygrove)
- fix: TakeOrderedAndProjectExec is not reporting all fallback reasons #2323 (kazuyukitanimura)
- fix: Fallback length function with binary input #2349 (wForget)
Performance related:
- perf: Optimize
AvgDecimalGroupsAccumulator#1893 (leung-ming) - perf: Optimize
SumDecimalGroupsAccumulator::update_single#2069 (leung-ming) - perf: Avoid FFI copy in
ScanExecwhen reading data from exchanges #2268 (andygrove)
Implemented enhancements:
- feat: Add from_unixtime support #1943 (kazuyukitanimura)
- feat: randn expression support #2010 (akupchinskiy)
- feat: monotonically_increasing_id and spark_partition_id implementation #2037 (akupchinskiy)
- feat: support
map_entries#2059 (comphead) - feat: Support Array Literal #2057 (comphead)
- feat: Add new trait for operator serde #2115 (andygrove)
- feat: limit with offset support #2070 (akupchinskiy)
- feat: Include scan implementation name in CometScan nodeName #2141 (andygrove)
- feat: Add config option to log fallback reasons #2154 (andygrove)
- feat: [iceberg] Enable Comet shuffle in Iceberg diff #2205 (andygrove)
- feat: Improve shuffle fallback reporting #2194 (andygrove)
- feat: Reset data buf of NativeBatchDecoderIterator on close #2235 (wForget)
- feat: Improve fallback mechanism for ANSI mode #2211 (andygrove)
- feat: Support hdfs with OpenDAL #2244 (wForget)
- feat: Ignore fallback info for command execs #2297 (wForget)
- feat: Improve some confusing fallback reasons #2301 (wForget)
- feat: Make supported hadoop filesystem schemes configurable #2272 (wForget)
- feat: [1941-Part1]: Introduce map-sort scalar function #2262 (rishvin)
- feat: [iceberg] delete rows support using selection vectors #2346 (parthchandra)
Documentation updates:
- docs: Update benchmark results for 0.9.0 #1959 (andygrove)
- doc: Add comment about local clippy run before submitting a pull request #1961 (akupchinskiy)
- docs: Minor improvements to Spark SQL test docs #1980 (andygrove)
- docs: Update Maven links for 0.9.0 release #1988 (andygrove)
- docs: Documentation updates for 0.9.0 release #1981 (andygrove)
- docs: Add guide showing comparison between Comet and Gluten #2012 (andygrove)
- docs: Remove legacy comment in docs #2022 (andygrove)
- docs: Update Gluten comparision to clarify that Velox is open-source #2043 (andygrove)
- docs: Improve Gluten comparison based on feedback from the community #2048 (andygrove)
- docs: added a missing export into the plan stability section #2071 (akupchinskiy)
- doc: Added documentation for supported map functions #2074 (codetyri0n)
- doc: Alternative way to start Spark Master to run benchmarks #2072 (comphead)
- docs: Update to support try arithmetic functions #2143 (coderfender)
- doc: update macos standalone spark start instructions #2103 (comphead)
- docs: Update confs to bypass Iceberg Spark issues #2166 (hsiang-c)
- docs: Add Roadmap #2191 (andygrove)
- docs: Update installation guide for 0.9.1 #2230 (andygrov...
0.9.1
DataFusion Comet 0.9.1 Changelog
This release consists of 2 commits from 1 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: [branch-0.9] Backport FFI fix #2164 (andygrove)
- fix: [branch-0.9] Avoid double free in CometUnifiedShuffleMemoryAllocator #2201 (andygrove)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
2 Andy Grove
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.
0.9.0
DataFusion Comet 0.9.0 Changelog
This release consists of 139 commits from 24 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: typo for
instrin fuzz testing #1686 (mbutrovich) - fix: Bucketed scan fallback for native_datafusion Parquet scan #1720 (mbutrovich)
- fix: Skip row index Spark SQL tests for native_datafusion Parquet scan #1724 (mbutrovich)
- fix: Check acquired memory when CometMemoryPool grows #1732 (wForget)
- fix: Fix data race in memory profiling #1727 (andygrove)
- fix: Enable some DPP Spark SQL tests #1734 (andygrove)
- fix: support literal null list and map #1742 (kazuyukitanimura)
- fix: get_struct field is incorrect when struct in array #1687 (comphead)
- fix: cast map types correctly in schema adapter #1771 (parthchandra)
- fix: correct schema type checking in native_iceberg_compat #1755 (parthchandra)
- fix: default values for native_datafusion scan #1756 (mbutrovich)
- fix: [native_scans] Support
CASE_SENSITIVEwhen reading Parquet #1782 (andygrove) - fix: cargo install tpchgen-cli in benchmark doc #1797 (zhuqi-lucas)
- fix: support
map_keys#1788 (comphead) - fix: fall back on nested types for default values #1799 (mbutrovich)
- fix: Re-enable Spark 4 tests on Linux #1806 (andygrove)
- fix: fallback to Spark scan if encryption is enabled (native_datafusion/native_iceberg_compat) #1785 (parthchandra)
- fix: native_iceberg_compat: move checking parquet types above fetching batch #1809 (mbutrovich)
- fix: translate missing or corrupt file exceptions, fall back if asked to ignore #1765 (mbutrovich)
- fix: Fix Spark SQL AQE exchange reuse test failures #1811 (coderfender)
- fix: Enable more Spark SQL tests #1834 (andygrove)
- fix: support
map_values#1835 (comphead) - fix: Handle case where num_cols == 0 in native execution #1840 (andygrove)
- fix: Fix shuffle writing rows containing null struct fields #1845 (Kontinuation)
- fix: Fall back to Spark for
RANGE BETWEENwindow expressions #1848 (andygrove) - fix: Remove COMET_SHUFFLE_FALLBACK_TO_COLUMNAR hack #1865 (andygrove)
- fix: support read Struct by user schema #1860 (comphead)
- fix: map parquet field_id correctly (native_iceberg_compat) #1815 (parthchandra)
- fix: cast_struct_to_struct aligns to Spark behavior #1879 (mbutrovich)
- fix: correctly handle schemas with nested array of struct (native_iceberg_compat) #1883 (parthchandra)
- fix: set RangePartitioning for native shuffle default to false #1907 (mbutrovich)
- fix: conflict between #1905 and #1892. #1919 (mbutrovich)
- fix: Add overflow check to evaluate of sum decimal accumulator #1922 (leung-ming)
- fix: Fix overflow handling when casting float to decimal #1914 (leung-ming)
- fix: Ignore a test case fails on Miri #1951 (leung-ming)
Performance related:
- perf: Add memory profiling #1702 (andygrove)
- perf: Add performance tracing capability #1706 (andygrove)
- perf: Add
COMET_RESPECT_PARQUET_FILTER_PUSHDOWNconfig #1936 (andygrove)
Implemented enhancements:
- feat: add jemalloc as optional custom allocator #1679 (mbutrovich)
- feat: support
array_repeat#1680 (comphead) - feat: More warning info for users #1667 (hsiang-c)
- feat: decode() expression when using 'utf-8' encoding #1697 (mbutrovich)
- feat: regexp_replace() expression with no starting offset #1700 (mbutrovich)
- feat: Improve performance tracing feature #1730 (andygrove)
- feat: Set/cancel with job tag and make max broadcast table size configurable #1693 (wForget)
- feat: Add support for
expm1expression fromdatafusion-sparkcrate #1711 (andygrove) - feat: Add config option for showing all Comet plan transformations #1780 (andygrove)
- feat: Support Type widening: byte → short/int/long, short → int/long #1770 (huaxingao)
- feat: Translate Hadoop S3A configurations to object_store configurations #1817 (Kontinuation)
- feat: Upgrade to official DataFusion 48.0.0 release #1877 (andygrove)
- feat: Add experimental auto mode for
COMET_PARQUET_SCAN_IMPL#1747 (andygrove) - feat: support RangePartitioning with native shuffle #1862 (mbutrovich)
- feat: Add support for signum expression #1889 (andygrove)
- feat: Add support to lookup map by key #1898 (comphead)
- feat: support array_max #1892 (drexler-sky)
- feat: pass ignore_nulls flag to first and last #1866 (rluvaton)
- feat: Implement ToPrettyString #1921 (andygrove)
- feat: Support hadoop s3a config in native_iceberg_compat #1925 (parthchandra)
- feat: rand expression support #1199 (akupchinskiy)
- feat: supports array_distinct #1923 (drexler-sky)
- feat:
autoscan mode should check for supported file location #1930 (andygrove) - feat: Encapsulate Parquet objects #1920 (huaxingao)
- feat: Change default value of
COMET_NATIVE_SCAN_IMPLtoauto#1933 (andygrove) - feat: Supports array_union #1945 (drexler-sky)
Documentation updates:
- docs: Add changelog for 0.8.0 #1675 (andygrove)
- docs: Add instructions on running TPC-H on macOS #1647 (andygrove)
- docs: Add documentation for accelerating Iceberg Parquet scans with Comet #1683 (andygrove)
- docs: Add note on setting
core.abbrevwhen generating diffs #1735 (andygrove) - docs: Remove outdated param in macos bench guide #1748 (ding-young)
- docs: Add instructions for running i...
0.8.0
DataFusion Comet 0.8.0 Changelog
This release consists of 81 commits from 11 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: remove code duplication in native_datafusion and native_iceberg_compat implementations #1443 (parthchandra)
- fix: Refactor CometScanRule and fix bugs #1483 (andygrove)
- fix: check if handle has been initialized before closing #1554 (wForget)
- fix: Taking slicing into account when writing BooleanBuffers as fast-encoding format #1522 (Kontinuation)
- fix: isCometEnabled name conflict #1569 (kazuyukitanimura)
- fix: make register_object_store use same session_env as file scan #1555 (wForget)
- fix: adjust CometNativeScan's doCanonicalize and hashCode for AQE, use DataSourceScanExec trait #1578 (mbutrovich)
- fix: corrected the logic of eliminating CometSparkToColumnarExec #1597 (wForget)
- fix: avoid panic caused by close null handle of parquet reader #1604 (wForget)
- fix: Make AQE capable of converting Comet shuffled joins to Comet broadcast hash joins #1605 (Kontinuation)
- fix: Making shuffle files generated in native shuffle mode reclaimable #1568 (Kontinuation)
- fix: Support per-task shuffle write rows and shuffle write time metrics #1617 (Kontinuation)
- fix: Modify Spark SQL core 2 tests for
native_datafusionreader, change 3.5.5 diff hash length to 11 #1641 (mbutrovich) - fix: fix spark/sql test failures in native_iceberg_compat #1593 (parthchandra)
- fix: handle missing field correctly in native_iceberg_compat #1656 (parthchandra)
- fix: better int96 support for experimental native scans #1652 (mbutrovich)
- fix: respect
ignoreNullsflag infirst_valueandlast_value#1626 (andygrove) - fix: update row groups count in internal metrics accumulator #1658 (parthchandra)
- fix: Shuffle should maintain insertion order #1660 (EmilyMatt)
Performance related:
- perf: Use a global tokio runtime #1614 (andygrove)
- perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config #1619 (andygrove)
- perf: Experimental fix to avoid join strategy regression #1674 (andygrove)
Implemented enhancements:
- feat: add read array support #1456 (comphead)
- feat: introduce hadoop mini cluster to test native scan on hdfs #1556 (wForget)
- feat: make parquet native scan schema case insensitive #1575 (wForget)
- feat: enable iceberg compat tests, more tests for complex types #1550 (comphead)
- feat: pushdown filter for native_iceberg_compat #1566 (wForget)
- feat: Fix struct of arrays schema issue #1592 (comphead)
- feat: adding more struct/arrays tests #1594 (comphead)
- feat: respect
batchSize/workerThreads/blockingThreadsconfigurations for native_iceberg_compat scan #1587 (wForget) - feat: add MAP type support for first level #1603 (comphead)
- feat: Add more tests for nested types combinations for
native_datafusion#1632 (comphead) - feat: Override MapBuilder values field with expected schema #1643 (comphead)
- feat: track unified memory pool #1651 (wForget)
- feat: Add support for complex types in native shuffle #1655 (andygrove)
Documentation updates:
- docs: Update configuration guide to show optional configs #1524 (andygrove)
- docs: Add changelog for 0.7.0 release #1527 (andygrove)
- docs: Use a shallow clone for Spark SQL test instructions #1547 (mbutrovich)
- docs: Update benchmark results for 0.7.0 release #1548 (andygrove)
- doc: Renew
kubernetes.md#1549 (comphead) - docs: various improvements to tuning guide #1525 (andygrove)
- docs: Update supported Spark versions #1580 (andygrove)
- docs: change OSX/OS X to macOS #1584 (mbutrovich)
- docs: docs for benchmarking in aws ec2 #1601 (andygrove)
- docs: Update compatibility docs for new native scans #1657 (andygrove)
- doc: Document local HDFS setup #1673 (comphead)
Other:
- chore: fix issue in release process #1528 (andygrove)
- chore: Remove all subdependencies #1514 (EmilyMatt)
- chore: Drop support for Spark 3.3 (EOL) #1529 (andygrove)
- chore: Prepare for 0.8.0 development #1530 (andygrove)
- chore: Re-enable GitHub discussions #1535 (andygrove)
- chore: [FOLLOWUP] Drop support for Spark 3.3 (EOL) #1534 (kazuyukitanimura)
- build: Use unique name for surefire artifacts #1544 (andygrove)
- chore: Update links for released version #1540 (andygrove)
- chore: Enable Comet explicitly in
CometTPCDSQueryTestSuite#1559 (andygrove) - chore: Fix some inconsistencies in memory pool configuration #1561 (andygrove)
- upgraded spark 3.5.4 to 3.5.5 #1565 (YanivKunda)
- minor: fix typo #1570 (wForget)
- Chore: simplify array related functions impl #1490 (kazantsev-maksim)
- added fallback using reflection for backward-compatibility #1573 (YanivKunda)
- chore: Override node name for CometSparkToColumnar #1577 (l0kr)
- chore: Reimplement ShuffleWriterExec using interleave_record_batch #1511 (Kontinuation)
- chore: Run Comet tests for more Spark versions #1582 (andygrove)
- Feat: support array_except function #1343 (kazantsev-maksim)
- minor: Fix clippy warnings #1606 (Kontinuation)
- chore: Remove some unwraps in hashing code #1600 (andygrove)
- chore: Remove redundant shims for getFailOnError #1608 (andygrove)
- chore: Making comet native operators write spill files to spark local dir #1581 (Kontinuation)
- chore: Refactor QueryPlanSerde to use idiomatic Scala and red...
0.7.0
DataFusion Comet 0.7.0 Changelog
This release consists of 46 commits from 11 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: Change default value of COMET_SCAN_ALLOW_INCOMPATIBLE and add documentation #1398 (andygrove)
- fix: Reduce cast.rs and utils.rs logic from parquet_support.rs for experimental native scans #1387 (mbutrovich)
- fix: Remove more cast.rs logic from parquet_support.rs for experimental native scans #1413 (mbutrovich)
- fix: fix various unit test failures in native_datafusion and native_iceberg_compat readers #1415 (parthchandra)
- fix: metrics tests for native_datafusion experimental native scan #1445 (mbutrovich)
- fix: Reduce number of shuffle spill files, fix spilled_bytes metric, add some unit tests #1440 (andygrove)
- fix: Executor memory overhead overriding #1462 (LukMRVC)
- fix: Stop copying rust-toolchain to docker file #1475 (andygrove)
- fix: PartitionBuffers should not have their own MemoryConsumer #1496 (EmilyMatt)
- fix: enable full decimal to decimal support #1385 (himadripal)
- fix: use common implementation of handling object store and hdfs urls for native_datafusion and native_iceberg_compat #1494 (parthchandra)
- fix: Simplify CometShuffleMemoryAllocator logic, rename classes, remove config #1485 (mbutrovich)
- fix: check overflow for decimal integral division #1512 (wForget)
Performance related:
- perf: Update RewriteJoin logic to choose optimal build side #1424 (andygrove)
- perf: Reduce native shuffle memory overhead by 50% #1452 (andygrove)
Implemented enhancements:
- feat: CometNativeScan metrics from ParquetFileMetrics and FileStreamMetrics #1172 (mbutrovich)
- feat: add experimental remote HDFS support for native DataFusion reader #1359 (comphead)
- feat: add Win-amd64 profile #1410 (wForget)
- feat: Support IntegralDivide function #1428 (wForget)
- feat: Add div operator for fuzz testing and update expression doc #1464 (wForget)
- feat: Upgrade to DataFusion 46.0.0-rc2 #1423 (andygrove)
- feat: Add support for rpad #1470 (andygrove)
- feat: Use official DataFusion 46.0.0 release #1484 (andygrove)
Documentation updates:
- docs: Add changelog for 0.6.0 release #1402 (andygrove)
- docs: Improve documentation for running stability plan tests #1469 (andygrove)
Other:
- test: Add experimental native scans to CometReadBenchmark #1150 (mbutrovich)
- chore: Prepare for 0.7.0 development #1404 (andygrove)
- chore: Update released version in documentation #1418 (andygrove)
- chore: Update protobuf to 3.25.5 #1434 (kazuyukitanimura)
- chore: Update guava to 33.2.1-jre #1435 (kazuyukitanimura)
- test: Register Spark-compatible expressions with a DataFusion context #1432 (viczsaurav)
- chore: fixes for kube build #1421 (comphead)
- build: pin machete to version 0.7.0 #1444 (andygrove)
- chore: Re-organize shuffle writer code #1439 (andygrove)
- chore: faster maven mirror #1447 (comphead)
- build: Use stable channel in rust-toolchain #1465 (andygrove)
- Feat: support array_compact function #1321 (kazantsev-maksim)
- chore: Upgrade to Spark 3.5.4 #1471 (andygrove)
- chore: Enable CI checks for
native_datafusionscan #1479 (andygrove) - chore: Add
native_iceberg_compatCI checks #1487 (andygrove) - chore: Stop disabling readside padding in TPC stability suite #1491 (andygrove)
- chore: Remove num partitions from repartitioner #1498 (EmilyMatt)
- test: fix Spark 3.5 tests #1482 (kazuyukitanimura)
- minor: Remove hard-coded config default #1503 (andygrove)
- chore: Use Datafusion's existing empty stream #1517 (EmilyMatt)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
20 Andy Grove
6 Matt Butrovich
4 Zhen Wang
3 Emily Matheys
3 KAZUYUKI TANIMURA
3 Oleks V
2 Himadri Pal
2 Parth Chandra
1 Kazantsev Maksim
1 Lukas Moravec
1 Saurav Verma
Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.
0.6.0
What's Changed
- fix: cast timestamp to decimal is unsupported by @wForget in #1281
- chore: Start 0.6.0 development by @andygrove in #1286
- docs: Fix links and provide complete benchmarking scripts by @andygrove in #1284
- feat: Add HasRowIdMapping interface by @viirya in #1288
- minor: update compatibility by @kazuyukitanimura in #1303
- chore: extract conversion_funcs, conditional_funcs, bitwise_funcs and array_funcs expressions to folders based on spark grouping by @rluvaton in #1223
- fix: partially fix consistency issue of hash functions with decimal input by @wForget in #1295
- chore: extract math_funcs expressions to folders based on spark grouping by @rluvaton in #1219
- chore: merge comet-parquet-exec branch into main by @andygrove in #1318
- Feat: Support array_intersect function by @erenavsarogullari in #1271
- build(deps): bump pprof from 0.13.0 to 0.14.0 in /native by @dependabot in #1319
- chore: Fix merge conflicts from merging comet-parquet-exec into main by @andygrove in #1320
- fix: Improve testing for array_remove and fallback to Spark for unsupported types by @andygrove in #1308
- chore: Revert accidental re-introduction of off-heap memory requirement by @andygrove in #1326
- fix: address post merge comet-parquet-exec review comments by @parthchandra in #1327
- chore: Fix merge conflicts from merging comet-parquet-exec into main by @mbutrovich in #1323
- Feat: Support array_join function by @erenavsarogullari in #1290
- Fix missing slash in spark script by @xleoken in #1334
- chore: Refactor QueryPlanSerde to allow logic to be moved to individual classes per expression by @andygrove in #1331
- build: re-enable upload-test-reports for macos-13 runner by @viirya in #1335
- chore: Upgrade to Arrow 53.4.0 by @andygrove in #1338
- fix: memory pool error type by @kazuyukitanimura in #1346
- Feat: Support arrays_overlap function by @erenavsarogullari in #1312
- fix: Fall back to Spark when hashing decimals with precision > 18 by @andygrove in #1325
- chore: Move all array_* serde to new framework, use correct INCOMPAT config by @andygrove in #1349
- chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) by @andygrove in #1332
- fix: expressions doc for ArrayRemove by @kazuyukitanimura in #1356
- minor: commit compatibility doc by @kazuyukitanimura in #1358
- minor: update fuzz dependency by @kazuyukitanimura in #1357
- chore: Remove redundant processing from exprToProtoInternal by @andygrove in #1351
- fix: pass scale to DF round in spark_round by @cht42 in #1341
- feat: Upgrade to DataFusion 45 by @andygrove in #1364
- fix: Mark cast from float/double to decimal as incompatible by @andygrove in #1372
- perf: improve performance of update metrics by @wForget in #1329
- feat: Add fair unified memory pool by @kazuyukitanimura in #1369
- feat: Add unbounded memory pool by @kazuyukitanimura in #1386
- fix: Passthrough condition in StaticInvoke case block by @EmilyMatt in #1392
- chore: Adding an optional
hdfscrate by @comphead in #1377 - fix: disable checking for uint_8 and uint_16 if complex type readers are enabled by @parthchandra in #1376
- perf: Use DataFusion FilterExec for experimental native scans by @mbutrovich in #1395
- doc: update memory tuning guide by @kazuyukitanimura in #1394
- chore: Refactor aggregate expression serde by @andygrove in #1380
- feat: make random seed configurable in fuzz-testing by @wForget in #1401
- feat: override executor overhead memory only when comet unified memory manager is disabled by @wForget in #1379
New Contributors
- @xleoken made their first contribution in #1334
- @cht42 made their first contribution in #1341
- @EmilyMatt made their first contribution in #1392
Full Changelog: 0.5.0...0.6.0
0.5.0
DataFusion Comet 0.5.0 Changelog
This release consists of 69 commits from 15 contributors. See credits at the end of this changelog for more information.
Fixed bugs:
- fix: Unsigned type related bugs #1095 (kazuyukitanimura)
- fix: Use RDD partition index #1112 (viirya)
- fix: Various metrics bug fixes and improvements #1111 (andygrove)
- fix: Don't create CometScanExec for subclasses of ParquetFileFormat #1129 (Kimahriman)
- fix: Fix metrics regressions #1132 (andygrove)
- fix: Enable scenarios accidentally commented out in CometExecBenchmark #1151 (mbutrovich)
- fix: Spark 4.0-preview1 SPARK-47120 #1156 (kazuyukitanimura)
- fix: Document enabling comet explain plan usage in Spark (4.0) #1176 (parthchandra)
- fix: stddev_pop should not directly return 0.0 when count is 1.0 #1184 (viirya)
- fix: fix missing explanation for then branch in case when #1200 (rluvaton)
- fix: Fall back to Spark for unsupported partition or sort expressions in window aggregates #1253 (andygrove)
- fix: Fall back to Spark for distinct aggregates #1262 (andygrove)
- fix: disable initCap by default #1276 (kazuyukitanimura)
Performance related:
- perf: Stop passing Java config map into native createPlan #1101 (andygrove)
- feat: Make native shuffle compression configurable and respect
spark.shuffle.compress#1185 (andygrove) - perf: Improve query planning to more reliably fall back to columnar shuffle when native shuffle is not supported #1209 (andygrove)
- feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192 (andygrove)
- feat: Implement custom RecordBatch serde for shuffle for improved performance #1190 (andygrove)
Implemented enhancements:
- feat: support array_insert #1073 (SemyonSinchenko)
- feat: enable decimal to decimal cast of different precision and scale #1086 (himadripal)
- feat: Improve ScanExec native metrics #1133 (andygrove)
- feat: Add Spark-compatible implementation of SchemaAdapterFactory #1169 (andygrove)
- feat: Improve shuffle metrics (second attempt) #1175 (andygrove)
- feat: Add a
spark.comet.exec.memoryPoolconfiguration for experimenting with various datafusion memory pool setups. #1021 (Kontinuation) - feat: Reenable tests for filtered SMJ anti join #1211 (comphead)
- feat: add support for array_remove expression #1179 (jatin510)
Documentation updates:
- docs: Update documentation for 0.4.0 release #1096 (andygrove)
- docs: Fix readme typo FGPA -> FPGA #1117 (gstvg)
- docs: Add more technical detail and new diagram to Comet plugin overview #1119 (andygrove)
- docs: Add some documentation explaining how shuffle works #1148 (andygrove)
- docs: Update TPC-H benchmark results #1257 (andygrove)
Other:
- chore: Add changelog for 0.4.0 #1089 (andygrove)
- chore: Prepare for 0.5.0 development #1090 (andygrove)
- build: Skip installation of spark-integration and fuzz testing modules #1091 (parthchandra)
- minor: Add hint for finding the GPG key to use when publishing to maven #1093 (andygrove)
- chore: Include first ScanExec batch in metrics #1105 (andygrove)
- chore: Improve CometScan metrics #1100 (andygrove)
- chore: Add custom metric for native shuffle fetching batches from JVM #1108 (andygrove)
- chore: Remove unused StringView struct #1143 (andygrove)
- test: enable more Spark 4.0 tests #1145 (kazuyukitanimura)
- chore: Refactor cast to use SparkCastOptions param #1146 (andygrove)
- chore: Move more expressions from core crate to spark-expr crate #1152 (andygrove)
- chore: Remove dead code #1155 (andygrove)
- chore: Move string kernels and expressions to spark-expr crate #1164 (andygrove)
- chore: Move remaining expressions to spark-expr crate + some minor refactoring #1165 (andygrove)
- chore: Add ignored tests for reading complex types from Parquet #1167 (andygrove)
- test: enabling Spark tests with offHeap requirement #1177 (kazuyukitanimura)
- minor: move shuffle classes from common to spark #1193 (andygrove)
- minor: refactor to move decodeBatches to broadcast exchange code as private function #1195 (andygrove)
- minor: refactor prepare_output so that it does not require an ExecutionContext #1194 (andygrove)
- minor: remove unused source files #1202 (andygrove)
- chore: Upgrade to DataFusion 44.0.0-rc2 #1154 (andygrove)
- chore: Add safety check to CometBuffer #1050 (viirya)
- chore: Remove unreachable code #1213 (andygrove)
- test: Enable Comet by default except some tests in SparkSessionExtensionSuite #1201 (kazuyukitanimura)
- chore: extract
structexpressions to folders based on spark grouping #1216 (rluvaton) - chore: extract static invoke expressions to folders based on spark grouping #1217 (rluvaton)
- chore: Follow-on PR to fully enable onheap memory usage #1210 (andygrove)
- chore: extract agg_funcs expressions to folders based on spark grouping #1224 (rluvaton)
- chore: extract datetime_funcs expressions to folders based on spark grouping #1222 (rluvaton)
- chore: Upgrade to DataFusion 44.0.0 from 44.0.0 RC2 #1232 (rluvaton)
- chore: extract strings file to
strings_funclike in spark grouping #1215 (rluvaton) - chore: extract predicate_functions expressions to folders based on spark grouping #1218 (rluvaton)
- build(deps): bump protobuf version to 3.21.12 #1234 (wForget)
- chore: extract json_funcs expressions to folders based on spark grouping #1220 (rluvaton)
- test: Enable shuffle by default in Spark tests #1240 (kazuyukitanimura)
- chore: extract hash_funcs expressions to folders based on spark grouping #1221 (rluvaton)
- build: Fix test failure caused by merging conflicting PRs #1259 (andygrove)
Credits
Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.
37 Andy Grove
10 Raz Luvaton
7 KAZUYUKI TANIMURA
3 Liang-Chi Hsieh
2 Parth Chandra
1 Adam Binford
1 Dharan Aditya
1 Himadri Pal
1 Jagdish Parihar
1 Kristin Cowalcijk
1 Matt Butrovich
1 Oleks V
1 Sem
1 Zhen Wang
1 gstvg
Thank you also to everyone who contributed ...