01 Dec 16:23

6086438

0.12.0 Pre-release

Pre-release

DataFusion Comet 0.12.0 Changelog

This release consists of 105 commits from 13 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: Fix None.get in stringDecode when bin child cannot be converted #2606 (cfmcgrady)
fix: Update FuzzDataGenerator to produce dictionary-encoded string arrays & fix bugs that this exposes #2635 (andygrove)
fix: Fallback to Spark for lpad/rpad for unsupported arguments & fix negative length handling #2630 (andygrove)
fix: Mark SortOrder with floating-point as incompatible #2650 (andygrove)
fix: Fall back to Spark for trunc / date_trunc functions when format string is unsupported, or is not a literal value #2634 (andygrove)
fix: [native_datafusion] only pass single partition of PartitionedFiles into DataSourceExec #2675 (mbutrovich)
fix: Fix subcommands options in fuzz-testing #2684 (manuzhang)
fix: Do not replace SMJ with HJ for LeftSemi #2687 (comphead)
fix: Apply spotless on Iceberg 1.8.1 diff [iceberg] #2700 (hsiang-c)
fix: Fix generate-user-guide-reference-docs failure when mvn command is not executed at root #2691 (manuzhang)
fix: Fix missing SortOrder fallback reason in range partitioning #2716 (andygrove)
fix: CometLiteral class cast exception with arrays #2718 (andygrove)
fix: NormalizeNaNAndZero::children() returns child's child #2732 (mbutrovich)
fix: checkSparkMaybeThrows should compare Spark and Comet results in success case #2728 (andygrove)
fix: Mark WindowsExec as incompatible #2748 (andygrove)
fix: Add strict floating point mode and fallback to Spark for min/max/sort on floating point inputs when enabled #2747 (andygrove)
fix: Implement producedAttributes for CometWindowExec #2789 (rahulbabarwal89)
fix: Pass all Comet configs to native plan #2801 (andygrove)

Implemented enhancements:

feat: Add option to write benchmark results to file #2640 (andygrove)
feat: Implement metrics for iceberg compat #2615 (EmilyMatt)
feat: Define function signatures in CometFuzz #2614 (andygrove)
feat: cherry-pick UUID conversion logic from #2528 #2648 (mbutrovich)
feat: support concat for strings #2604 (comphead)
feat: Add support for abs #2689 (andygrove)
feat: Support variadic function in CometFuzz #2682 (manuzhang)
feat: CometExecRule refactor: Unify CometNativeExec creation with Serde in CometOperatorSerde trait #2768 (andygrove)
feat: support cot #2755 (psvri)
feat: Add bash script to build and run fuzz testing #2686 (manuzhang)
feat: Add getSupportLevel to CometAggregateExpressionSerde trait #2777 (andygrove)
feat: Add CI check to ensure generated docs are in sync with code #2779 (andygrove)
feat: Add prettier enforcement #2783 (andygrove)
feat: hyperbolic trig functions #2784 (psvri)
feat: [iceberg] Native scan by serializing FileScanTasks to iceberg-rust #2528 (mbutrovich)

Documentation updates:

docs: Add changelog for 0.11.0 release #2585 (mbutrovich)
docs: Improve documentation layout #2587 (andygrove)
docs: Publish 0.11.0 user guide #2589 (andygrove)
docs: Put Comet logo in top nav bar, respect light/dark mode #2591 (andygrove)
docs: Improve main landing page #2593 (andygrove)
docs: Improve site navigation #2597 (andygrove)
docs: Update benchmark results #2596 (andygrove)
docs: Upgrade pydata-sphinx-theme to 0.16.1 #2602 (andygrove)
docs: Fix redirect #2603 (andygrove)
docs: Fix broken image link #2613 (andygrove)
docs: Add FFI docs to contributor guide #2668 (andygrove)
docs: Various documentation updates #2674 (andygrove)
docs: Add supported SortOrder expressions and fix a typo #2694 (andygrove)
docs: Minor docs update for running Spark SQL tests #2712 (andygrove)
docs: Update contributor guide for adding a new expression #2704 (andygrove)
docs: Documentation updates for LocalTableScan and WindowExec #2742 (andygrove)
docs: Typo fix #2752 (wForget)
docs: Categorize some configs as testing and add notes about known time zone issues #2740 (andygrove)
docs: Run prettier on all markdown files #2782 (andygrove)
docs: Ignore prettier formatting for generated tables #2790 (andygrove)
docs: Add new section to contributor guide, explaining how to add a new operator #2758 (andygrove)

Other:

chore: Start 0.12.0 development #2584 (mbutrovich)
chore: Bump Spark from 3.5.6 to 3.5.7 #2574 (cfmcgrady)
chore(deps): bump parquet from 56.0.0 to 56.2.0 in /native #2608 (dependabot[bot])
chore(deps): bump tikv-jemallocator from 0.6.0 to 0.6.1 in /native #2609 (dependabot[bot])
chore(deps): bump tikv-jemalloc-ctl from 0.6.0 to 0.6.1 in /native #2610 (dependabot[bot])
tests: FuzzDataGenerator instead of Parquet-specific generator #2616 (mbutrovich)
chore: Simplify on-heap memory configuration #2599 (andygrove)
Feat: Add sha1 function impl #2471 (kazantsev-maksim)
chore: Refactor Parquet/DataFrame fuzz data generators #2629 (andygrove)
chore: Remove needless from_raw calls #2638 (EmilyMatt)
chore: support DataFusion 50.3.0 #2605 (comphead)
chore(deps): bump actions/upload-artifact from 4 to 5 #2654 (dependabot[bot])
chore(deps): bump cc from 1.2.42 to 1.2.43 in /native #2653 (dependabot[bot])
chore(deps): bump actions/download-artifact from 5 to 6 #2652 (dependabot[bot])
chore: extract c...

Assets 2

19 Oct 18:00

andygrove

0.11.0

57668af

0.11.0 Pre-release

Pre-release

DataFusion Comet 0.11.0 Changelog

This release consists of 131 commits from 15 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: temporarily ignore test for hdfs file systems #2359 (parthchandra)
fix: Check reused broadcast plan in non-AQE and make setNumPartitions thread safe #2398 (wForget)
fix: correct missingInput for CometHashAggregateExec #2409 (comphead)
fix:clippy errros rust 1.9.0 update #2419 (coderfender)
fix: Avoid spark plan execution cache preventing CometBatchRDD numPartitions change #2420 (wForget)
fix: regressions in CometToPrettyStringSuite #2384 (hsiang-c)
fix: Byte array Literals failed on cast #2432 (comphead)
fix: Do not push down subquery filters on native_datafusion scan #2438 (wForget)
fix: Improve error handling when resolving S3 bucket region #2440 (andygrove)
fix: [iceberg] additional parquet independent api for iceberg integration #2442 (parthchandra)
fix: Specify reqwest crate features #2446 (andygrove)
fix: distributed RangePartitioning bounds calculation with native shuffle #2258 (mbutrovich)
fix: fix regression in tpcbench.py #2512 (andygrove)
fix: [iceberg] Close reader instance in ReadConf #2510 (hsiang-c)
fix: Enable plan stability tests for auto scan #2516 (andygrove)
fix: Capture unexpected output when retrieving JVM 17 args in Makefile #2566 (zuston)

Performance related:

perf: New Configuration from shared conf to avoid high costs #2402 (wForget)
perf: Use DataFusion's count_udaf instead of SUM(IF(expr IS NOT NULL, 1, 0)) #2407 (andygrove)
perf: Improve BroadcastExchangeExec conversion #2417 (wForget)

Implemented enhancements:

feat: Add dynamic enabled and allowIncompat configs for all supported expressions #2329 (andygrove)
feat: feature specific tests #2372 (parthchandra)
feat: Support more date part expressions #2316 (wForget)
feat: rpad support column for second arg instead of just literal #2099 (coderfender)
feat: Support comet native log level conf #2379 (wForget)
feat: Enable WeekDay function #2411 (wForget)
feat: Add nested Array literal support #2181 (comphead)
feat:add_additional_char_support_rpad #2436 (coderfender)
feat: do not fallback to Spark for COUNT(distinct) #2429 (comphead)
feat: implement_ansi_eval_mode_arithmetic #2136 (coderfender)
feat: Add plan conversion statistics to extended explain info #2412 (andygrove)
feat: implement_comet_native_lpad_expr #2102 (coderfender)
feat: Add backtrace feature to simplify enabling native backtraces in CometNativeException #2515 (andygrove)
feat: Support reverse function with ArrayType input #2481 (cfmcgrady)
feat: Change default off-heap memory pool from greedy_unified to fair_unified #2526 (andygrove)
feat: Make DiskManager max_temp_directory_size configurable #2479 (manuzhang)
feat: Parquet Modular Encryption with Spark KMS for native readers #2447 (mbutrovich)
feat: Add support for Spark-compatible cast from integral to decimal #2472 (coderfender)
feat:Support ANSI mode integral divide #2421 (coderfender)
feat: Add config to enable running Comet in onheap mode #2554 (andygrove)
feat:support ansi mode rounding function #2542 (coderfender)
feat:support ansi mode remainder function #2556 (coderfender)
feat: Implement array-to-string cast support #2425 (cfmcgrady)
feat: Various improvements to memory pool configuration, logging, and documentation #2538 (andygrove)
feat: Enable complex types for columnar shuffle #2573 (mbutrovich)
feat: support_decimal_types_bool_cast_native_impl #2490 (coderfender)
feat: Use buf write to reduce system call on index write #2579 (zuston)

Documentation updates:

doc: Document usage IcebergCometBatchReader.java #2347 (comphead)
docs: Add changelog for 0.10.0 release #2361 (andygrove)
docs: Fix error in docs #2373 (andygrove)
docs: Fix more comet versions in docs #2374 (andygrove)
docs: Publish 0.10.0 user guide #2394 (andygrove)
doc: macos benches doc clarifications #2418 (comphead)
docs: update configs.md after #2422 #2428 (mbutrovich)
docs: update docs and tuning guide related to native shuffle #2487 (mbutrovich)
docs: Improve EC2 benchmarking guide #2474 (andygrove)
docs: docs_update_ansi_support #2496 (coderfender)
docs:support lpad expression documentation update #2517 (coderfender)
docs: doc changes to support ANSI mode integral divide #2570 (coderfender)
docs: Split configuration guide into different sections (scan, exec, shuffle, etc) #2568 (andygrove)
docs: doc update to support ANSI mode remainder function #2576 (coderfender)
docs: Documentation updates #2581 (andygrove)

Other:

chore(deps): bump uuid from 1.18.0 to 1.18.1 in /native #2336 (dependabot[bot])
build: Check that all Scala test suites run in PR builds #2304 (andygrove)
chore: Start 0.11.0 development #2365 (andygrove)
chore: Split expression serde hash map into separate categories #2322 (andygrove)
chore: exclude Iceberg diffs from rat checks #2376 (hsiang-c)
chore: Refactor UnaryMinus serde #2378 (andygrove)
chore: Revert "chore: [1941-Part1]: Introduce map_sort scalar function (#2… #2381 (comphead)
chore: Refactor Literal serde [#2377](https://github.com/apache/datafusion-comet/pull/...

Assets 2

06 Oct 18:44

andygrove

0.10.1

42f6774

0.10.1 Pre-release

Pre-release

DataFusion Comet 0.10.1 Changelog

This release consists of 7 commits from 1 contributors. See credits at the end of this changelog for more information.

Documentation updates:

docs: [branch-0.10] Update version number in branch-0.10 user guide #2395 (andygrove)

Other:

chore: [branch-0.10] Support Spark 4.0.1 instead of 4.0.0 (#2414) #2497 (andygrove)
build: [branch-0.10] Stop caching libcomet in CI (#2498) #2502 (andygrove)
chore: [branch-0.10] perf: Improve BroadcastExchangeExec conversion #2501 (andygrove)
chore: [branch-0.10] [iceberg] additional parquet independent api for iceberg integration (#2442) #2499 (andygrove)
fix: [branch-0.10] Avoid spark plan execution cache preventing CometBatchRDD numPartitions change (#2420) #2503 (andygrove)
build: [branch-0.10] Bump version to 0.10.1 #2508 (andygrove)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

     7	Andy Grove

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.

Assets 2

16 Sep 17:21

andygrove

0.10.0

9cb0cc4

0.10.0 Pre-release

Pre-release

DataFusion Comet 0.10.0 Changelog

This release consists of 183 commits from 26 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: [Iceberg] Fix decimal corruption #1985 (andygrove)
fix: broken link in development.md #2024 (petern48)
fix: [iceberg] Add LogicalTypeAnnotation in ParquetColumnSpec #2000 (huaxingao)
fix: hdfs read into buffer fully #2031 (parthchandra)
fix: Refactor arithmetic serde and fix correctness issues with EvalMode::TRY #2018 (andygrove)
fix: clean up [iceberg] integration APIs #2032 (huaxingao)
fix: zero Arrow Array offset before sending across FFI #2052 (mbutrovich)
fix: [iceberg] more fixes for Iceberg integration APIs. #2078 (parthchandra)
fix: Add support for StringDecode in Spark 4.0.0 #2075 (peter-toth)
fix: Avoid double free in CometUnifiedShuffleMemoryAllocator #2122 (andygrove)
fix: Remove duplicate serde code #2098 (andygrove)
fix: Improve logic for determining when an UnpackOrDeepCopy is needed #2142 (andygrove)
fix: Add CopyExec to inputs to SortMergeJoinExec #2155 (andygrove)
fix: Fix repeatedly url-decode path when reading parquet from s3 using native parquet reader #2138 (Kontinuation)
fix: [iceberg] Switch to OSS Spark and run Iceberg Spark tests in parallel #1987 (hsiang-c)
fix: [iceberg] Fall back to spark for schemas with empty structs #2204 (andygrove)
fix: Fix failing TPC-DS workflow in PR CI runs #2207 (andygrove)
fix: [iceberg] order query result deterministically #2208 (hsiang-c)
fix: use spark.comet.batchSize instead of conf.arrowMaxRecordsPerBatch for data that is coming from Java #2196 (rluvaton)
fix: if expr nullable #2217 (Asura7969)
fix: Support auto scan mode with Spark 4.0.0 #1975 (andygrove)
fix: Make Sha2 fallback message more user-friendly #2213 (rishvin)
fix: separate type checking for CometExchange and CometColumnarExchange #2241 (mbutrovich)
fix: Fix potential resource leak in native shuffle block reader #2247 (andygrove)
fix: Remove unreachable code in CometScanRule #2252 (andygrove)
fix: Fall back to native_comet for encrypted Parquet scans #2250 (andygrove)
fix: Fall back to native_comet when object store not supported by native_iceberg_compat #2251 (andygrove)
fix: split expr.proto file (new) #2267 (kination)
fix: handle cast to dictionary vector introduced by case when #2044 (parthchandra)
fix: Remove check for custom S3 endpoints #2288 (andygrove)
fix: implement lazy evaluation in Coalesce function #2270 (coderfender)
fix: Update benchmarking scripts #2293 (andygrove)
fix: Fix regression in NativeConfigSuite #2299 (andygrove)
fix: Validating object store configs should not throw exception #2308 (andygrove)
fix: TakeOrderedAndProjectExec is not reporting all fallback reasons #2323 (kazuyukitanimura)
fix: Fallback length function with binary input #2349 (wForget)

Performance related:

perf: Optimize AvgDecimalGroupsAccumulator #1893 (leung-ming)
perf: Optimize SumDecimalGroupsAccumulator::update_single #2069 (leung-ming)
perf: Avoid FFI copy in ScanExec when reading data from exchanges #2268 (andygrove)

Implemented enhancements:

feat: Add from_unixtime support #1943 (kazuyukitanimura)
feat: randn expression support #2010 (akupchinskiy)
feat: monotonically_increasing_id and spark_partition_id implementation #2037 (akupchinskiy)
feat: support map_entries #2059 (comphead)
feat: Support Array Literal #2057 (comphead)
feat: Add new trait for operator serde #2115 (andygrove)
feat: limit with offset support #2070 (akupchinskiy)
feat: Include scan implementation name in CometScan nodeName #2141 (andygrove)
feat: Add config option to log fallback reasons #2154 (andygrove)
feat: [iceberg] Enable Comet shuffle in Iceberg diff #2205 (andygrove)
feat: Improve shuffle fallback reporting #2194 (andygrove)
feat: Reset data buf of NativeBatchDecoderIterator on close #2235 (wForget)
feat: Improve fallback mechanism for ANSI mode #2211 (andygrove)
feat: Support hdfs with OpenDAL #2244 (wForget)
feat: Ignore fallback info for command execs #2297 (wForget)
feat: Improve some confusing fallback reasons #2301 (wForget)
feat: Make supported hadoop filesystem schemes configurable #2272 (wForget)
feat: [1941-Part1]: Introduce map-sort scalar function #2262 (rishvin)
feat: [iceberg] delete rows support using selection vectors #2346 (parthchandra)

Documentation updates:

docs: Update benchmark results for 0.9.0 #1959 (andygrove)
doc: Add comment about local clippy run before submitting a pull request #1961 (akupchinskiy)
docs: Minor improvements to Spark SQL test docs #1980 (andygrove)
docs: Update Maven links for 0.9.0 release #1988 (andygrove)
docs: Documentation updates for 0.9.0 release #1981 (andygrove)
docs: Add guide showing comparison between Comet and Gluten #2012 (andygrove)
docs: Remove legacy comment in docs #2022 (andygrove)
docs: Update Gluten comparision to clarify that Velox is open-source #2043 (andygrove)
docs: Improve Gluten comparison based on feedback from the community #2048 (andygrove)
docs: added a missing export into the plan stability section #2071 (akupchinskiy)
doc: Added documentation for supported map functions #2074 (codetyri0n)
doc: Alternative way to start Spark Master to run benchmarks #2072 (comphead)
docs: Update to support try arithmetic functions #2143 (coderfender)
doc: update macos standalone spark start instructions #2103 (comphead)
docs: Update confs to bypass Iceberg Spark issues #2166 (hsiang-c)
docs: Add Roadmap #2191 (andygrove)
docs: Update installation guide for 0.9.1 #2230 (andygrov...

Assets 2

25 Aug 16:49

andygrove

0.9.1

a168c9a

0.9.1 Pre-release

Pre-release

DataFusion Comet 0.9.1 Changelog

This release consists of 2 commits from 1 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: [branch-0.9] Backport FFI fix #2164 (andygrove)
fix: [branch-0.9] Avoid double free in CometUnifiedShuffleMemoryAllocator #2201 (andygrove)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

     2	Andy Grove

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.

Assets 2

04 Jul 17:01

andygrove

0.9.0

1c462bc

0.9.0 Pre-release

Pre-release

DataFusion Comet 0.9.0 Changelog

This release consists of 139 commits from 24 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: typo for instr in fuzz testing #1686 (mbutrovich)
fix: Bucketed scan fallback for native_datafusion Parquet scan #1720 (mbutrovich)
fix: Skip row index Spark SQL tests for native_datafusion Parquet scan #1724 (mbutrovich)
fix: Check acquired memory when CometMemoryPool grows #1732 (wForget)
fix: Fix data race in memory profiling #1727 (andygrove)
fix: Enable some DPP Spark SQL tests #1734 (andygrove)
fix: support literal null list and map #1742 (kazuyukitanimura)
fix: get_struct field is incorrect when struct in array #1687 (comphead)
fix: cast map types correctly in schema adapter #1771 (parthchandra)
fix: correct schema type checking in native_iceberg_compat #1755 (parthchandra)
fix: default values for native_datafusion scan #1756 (mbutrovich)
fix: [native_scans] Support CASE_SENSITIVE when reading Parquet #1782 (andygrove)
fix: cargo install tpchgen-cli in benchmark doc #1797 (zhuqi-lucas)
fix: support map_keys #1788 (comphead)
fix: fall back on nested types for default values #1799 (mbutrovich)
fix: Re-enable Spark 4 tests on Linux #1806 (andygrove)
fix: fallback to Spark scan if encryption is enabled (native_datafusion/native_iceberg_compat) #1785 (parthchandra)
fix: native_iceberg_compat: move checking parquet types above fetching batch #1809 (mbutrovich)
fix: translate missing or corrupt file exceptions, fall back if asked to ignore #1765 (mbutrovich)
fix: Fix Spark SQL AQE exchange reuse test failures #1811 (coderfender)
fix: Enable more Spark SQL tests #1834 (andygrove)
fix: support map_values #1835 (comphead)
fix: Handle case where num_cols == 0 in native execution #1840 (andygrove)
fix: Fix shuffle writing rows containing null struct fields #1845 (Kontinuation)
fix: Fall back to Spark for RANGE BETWEEN window expressions #1848 (andygrove)
fix: Remove COMET_SHUFFLE_FALLBACK_TO_COLUMNAR hack #1865 (andygrove)
fix: support read Struct by user schema #1860 (comphead)
fix: map parquet field_id correctly (native_iceberg_compat) #1815 (parthchandra)
fix: cast_struct_to_struct aligns to Spark behavior #1879 (mbutrovich)
fix: correctly handle schemas with nested array of struct (native_iceberg_compat) #1883 (parthchandra)
fix: set RangePartitioning for native shuffle default to false #1907 (mbutrovich)
fix: conflict between #1905 and #1892. #1919 (mbutrovich)
fix: Add overflow check to evaluate of sum decimal accumulator #1922 (leung-ming)
fix: Fix overflow handling when casting float to decimal #1914 (leung-ming)
fix: Ignore a test case fails on Miri #1951 (leung-ming)

Performance related:

perf: Add memory profiling #1702 (andygrove)
perf: Add performance tracing capability #1706 (andygrove)
perf: Add COMET_RESPECT_PARQUET_FILTER_PUSHDOWN config #1936 (andygrove)

Implemented enhancements:

feat: add jemalloc as optional custom allocator #1679 (mbutrovich)
feat: support array_repeat #1680 (comphead)
feat: More warning info for users #1667 (hsiang-c)
feat: decode() expression when using 'utf-8' encoding #1697 (mbutrovich)
feat: regexp_replace() expression with no starting offset #1700 (mbutrovich)
feat: Improve performance tracing feature #1730 (andygrove)
feat: Set/cancel with job tag and make max broadcast table size configurable #1693 (wForget)
feat: Add support for expm1 expression from datafusion-spark crate #1711 (andygrove)
feat: Add config option for showing all Comet plan transformations #1780 (andygrove)
feat: Support Type widening: byte → short/int/long, short → int/long #1770 (huaxingao)
feat: Translate Hadoop S3A configurations to object_store configurations #1817 (Kontinuation)
feat: Upgrade to official DataFusion 48.0.0 release #1877 (andygrove)
feat: Add experimental auto mode for COMET_PARQUET_SCAN_IMPL #1747 (andygrove)
feat: support RangePartitioning with native shuffle #1862 (mbutrovich)
feat: Add support for signum expression #1889 (andygrove)
feat: Add support to lookup map by key #1898 (comphead)
feat: support array_max #1892 (drexler-sky)
feat: pass ignore_nulls flag to first and last #1866 (rluvaton)
feat: Implement ToPrettyString #1921 (andygrove)
feat: Support hadoop s3a config in native_iceberg_compat #1925 (parthchandra)
feat: rand expression support #1199 (akupchinskiy)
feat: supports array_distinct #1923 (drexler-sky)
feat: auto scan mode should check for supported file location #1930 (andygrove)
feat: Encapsulate Parquet objects #1920 (huaxingao)
feat: Change default value of COMET_NATIVE_SCAN_IMPL to auto #1933 (andygrove)
feat: Supports array_union #1945 (drexler-sky)

Documentation updates:

docs: Add changelog for 0.8.0 #1675 (andygrove)
docs: Add instructions on running TPC-H on macOS #1647 (andygrove)
docs: Add documentation for accelerating Iceberg Parquet scans with Comet #1683 (andygrove)
docs: Add note on setting core.abbrev when generating diffs #1735 (andygrove)
docs: Remove outdated param in macos bench guide #1748 (ding-young)
docs: Add instructions for running i...

Assets 2

30 Apr 16:27

andygrove

0.8.0

64b6252

0.8.0 Pre-release

Pre-release

DataFusion Comet 0.8.0 Changelog

This release consists of 81 commits from 11 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: remove code duplication in native_datafusion and native_iceberg_compat implementations #1443 (parthchandra)
fix: Refactor CometScanRule and fix bugs #1483 (andygrove)
fix: check if handle has been initialized before closing #1554 (wForget)
fix: Taking slicing into account when writing BooleanBuffers as fast-encoding format #1522 (Kontinuation)
fix: isCometEnabled name conflict #1569 (kazuyukitanimura)
fix: make register_object_store use same session_env as file scan #1555 (wForget)
fix: adjust CometNativeScan's doCanonicalize and hashCode for AQE, use DataSourceScanExec trait #1578 (mbutrovich)
fix: corrected the logic of eliminating CometSparkToColumnarExec #1597 (wForget)
fix: avoid panic caused by close null handle of parquet reader #1604 (wForget)
fix: Make AQE capable of converting Comet shuffled joins to Comet broadcast hash joins #1605 (Kontinuation)
fix: Making shuffle files generated in native shuffle mode reclaimable #1568 (Kontinuation)
fix: Support per-task shuffle write rows and shuffle write time metrics #1617 (Kontinuation)
fix: Modify Spark SQL core 2 tests for native_datafusion reader, change 3.5.5 diff hash length to 11 #1641 (mbutrovich)
fix: fix spark/sql test failures in native_iceberg_compat #1593 (parthchandra)
fix: handle missing field correctly in native_iceberg_compat #1656 (parthchandra)
fix: better int96 support for experimental native scans #1652 (mbutrovich)
fix: respect ignoreNulls flag in first_value and last_value #1626 (andygrove)
fix: update row groups count in internal metrics accumulator #1658 (parthchandra)
fix: Shuffle should maintain insertion order #1660 (EmilyMatt)

Performance related:

perf: Use a global tokio runtime #1614 (andygrove)
perf: Respect Spark's PARQUET_FILTER_PUSHDOWN_ENABLED config #1619 (andygrove)
perf: Experimental fix to avoid join strategy regression #1674 (andygrove)

Implemented enhancements:

feat: add read array support #1456 (comphead)
feat: introduce hadoop mini cluster to test native scan on hdfs #1556 (wForget)
feat: make parquet native scan schema case insensitive #1575 (wForget)
feat: enable iceberg compat tests, more tests for complex types #1550 (comphead)
feat: pushdown filter for native_iceberg_compat #1566 (wForget)
feat: Fix struct of arrays schema issue #1592 (comphead)
feat: adding more struct/arrays tests #1594 (comphead)
feat: respect batchSize/workerThreads/blockingThreads configurations for native_iceberg_compat scan #1587 (wForget)
feat: add MAP type support for first level #1603 (comphead)
feat: Add more tests for nested types combinations for native_datafusion #1632 (comphead)
feat: Override MapBuilder values field with expected schema #1643 (comphead)
feat: track unified memory pool #1651 (wForget)
feat: Add support for complex types in native shuffle #1655 (andygrove)

Documentation updates:

docs: Update configuration guide to show optional configs #1524 (andygrove)
docs: Add changelog for 0.7.0 release #1527 (andygrove)
docs: Use a shallow clone for Spark SQL test instructions #1547 (mbutrovich)
docs: Update benchmark results for 0.7.0 release #1548 (andygrove)
doc: Renew kubernetes.md #1549 (comphead)
docs: various improvements to tuning guide #1525 (andygrove)
docs: Update supported Spark versions #1580 (andygrove)
docs: change OSX/OS X to macOS #1584 (mbutrovich)
docs: docs for benchmarking in aws ec2 #1601 (andygrove)
docs: Update compatibility docs for new native scans #1657 (andygrove)
doc: Document local HDFS setup #1673 (comphead)

Other:

chore: fix issue in release process #1528 (andygrove)
chore: Remove all subdependencies #1514 (EmilyMatt)
chore: Drop support for Spark 3.3 (EOL) #1529 (andygrove)
chore: Prepare for 0.8.0 development #1530 (andygrove)
chore: Re-enable GitHub discussions #1535 (andygrove)
chore: [FOLLOWUP] Drop support for Spark 3.3 (EOL) #1534 (kazuyukitanimura)
build: Use unique name for surefire artifacts #1544 (andygrove)
chore: Update links for released version #1540 (andygrove)
chore: Enable Comet explicitly in CometTPCDSQueryTestSuite #1559 (andygrove)
chore: Fix some inconsistencies in memory pool configuration #1561 (andygrove)
upgraded spark 3.5.4 to 3.5.5 #1565 (YanivKunda)
minor: fix typo #1570 (wForget)
Chore: simplify array related functions impl #1490 (kazantsev-maksim)
added fallback using reflection for backward-compatibility #1573 (YanivKunda)
chore: Override node name for CometSparkToColumnar #1577 (l0kr)
chore: Reimplement ShuffleWriterExec using interleave_record_batch #1511 (Kontinuation)
chore: Run Comet tests for more Spark versions #1582 (andygrove)
Feat: support array_except function #1343 (kazantsev-maksim)
minor: Fix clippy warnings #1606 (Kontinuation)
chore: Remove some unwraps in hashing code #1600 (andygrove)
chore: Remove redundant shims for getFailOnError #1608 (andygrove)
chore: Making comet native operators write spill files to spark local dir #1581 (Kontinuation)
chore: Refactor QueryPlanSerde to use idiomatic Scala and red...

Assets 2

30 Apr 16:26

andygrove

0.7.0

664e681

0.7.0 Pre-release

Pre-release

DataFusion Comet 0.7.0 Changelog

This release consists of 46 commits from 11 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: Change default value of COMET_SCAN_ALLOW_INCOMPATIBLE and add documentation #1398 (andygrove)
fix: Reduce cast.rs and utils.rs logic from parquet_support.rs for experimental native scans #1387 (mbutrovich)
fix: Remove more cast.rs logic from parquet_support.rs for experimental native scans #1413 (mbutrovich)
fix: fix various unit test failures in native_datafusion and native_iceberg_compat readers #1415 (parthchandra)
fix: metrics tests for native_datafusion experimental native scan #1445 (mbutrovich)
fix: Reduce number of shuffle spill files, fix spilled_bytes metric, add some unit tests #1440 (andygrove)
fix: Executor memory overhead overriding #1462 (LukMRVC)
fix: Stop copying rust-toolchain to docker file #1475 (andygrove)
fix: PartitionBuffers should not have their own MemoryConsumer #1496 (EmilyMatt)
fix: enable full decimal to decimal support #1385 (himadripal)
fix: use common implementation of handling object store and hdfs urls for native_datafusion and native_iceberg_compat #1494 (parthchandra)
fix: Simplify CometShuffleMemoryAllocator logic, rename classes, remove config #1485 (mbutrovich)
fix: check overflow for decimal integral division #1512 (wForget)

Performance related:

perf: Update RewriteJoin logic to choose optimal build side #1424 (andygrove)
perf: Reduce native shuffle memory overhead by 50% #1452 (andygrove)

Implemented enhancements:

feat: CometNativeScan metrics from ParquetFileMetrics and FileStreamMetrics #1172 (mbutrovich)
feat: add experimental remote HDFS support for native DataFusion reader #1359 (comphead)
feat: add Win-amd64 profile #1410 (wForget)
feat: Support IntegralDivide function #1428 (wForget)
feat: Add div operator for fuzz testing and update expression doc #1464 (wForget)
feat: Upgrade to DataFusion 46.0.0-rc2 #1423 (andygrove)
feat: Add support for rpad #1470 (andygrove)
feat: Use official DataFusion 46.0.0 release #1484 (andygrove)

Documentation updates:

docs: Add changelog for 0.6.0 release #1402 (andygrove)
docs: Improve documentation for running stability plan tests #1469 (andygrove)

Other:

test: Add experimental native scans to CometReadBenchmark #1150 (mbutrovich)
chore: Prepare for 0.7.0 development #1404 (andygrove)
chore: Update released version in documentation #1418 (andygrove)
chore: Update protobuf to 3.25.5 #1434 (kazuyukitanimura)
chore: Update guava to 33.2.1-jre #1435 (kazuyukitanimura)
test: Register Spark-compatible expressions with a DataFusion context #1432 (viczsaurav)
chore: fixes for kube build #1421 (comphead)
build: pin machete to version 0.7.0 #1444 (andygrove)
chore: Re-organize shuffle writer code #1439 (andygrove)
chore: faster maven mirror #1447 (comphead)
build: Use stable channel in rust-toolchain #1465 (andygrove)
Feat: support array_compact function #1321 (kazantsev-maksim)
chore: Upgrade to Spark 3.5.4 #1471 (andygrove)
chore: Enable CI checks for native_datafusion scan #1479 (andygrove)
chore: Add native_iceberg_compat CI checks #1487 (andygrove)
chore: Stop disabling readside padding in TPC stability suite #1491 (andygrove)
chore: Remove num partitions from repartitioner #1498 (EmilyMatt)
test: fix Spark 3.5 tests #1482 (kazuyukitanimura)
minor: Remove hard-coded config default #1503 (andygrove)
chore: Use Datafusion's existing empty stream #1517 (EmilyMatt)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    20	Andy Grove
     6	Matt Butrovich
     4	Zhen Wang
     3	Emily Matheys
     3	KAZUYUKI TANIMURA
     3	Oleks V
     2	Himadri Pal
     2	Parth Chandra
     1	Kazantsev Maksim
     1	Lukas Moravec
     1	Saurav Verma

Thank you also to everyone who contributed in other ways such as filing issues, reviewing PRs, and providing feedback on this release.

Assets 2

17 Feb 19:25

andygrove

0.6.0

b735f2d

0.6.0 Pre-release

Pre-release

What's Changed

fix: cast timestamp to decimal is unsupported by @wForget in #1281
chore: Start 0.6.0 development by @andygrove in #1286
docs: Fix links and provide complete benchmarking scripts by @andygrove in #1284
feat: Add HasRowIdMapping interface by @viirya in #1288
minor: update compatibility by @kazuyukitanimura in #1303
chore: extract conversion_funcs, conditional_funcs, bitwise_funcs and array_funcs expressions to folders based on spark grouping by @rluvaton in #1223
fix: partially fix consistency issue of hash functions with decimal input by @wForget in #1295
chore: extract math_funcs expressions to folders based on spark grouping by @rluvaton in #1219
chore: merge comet-parquet-exec branch into main by @andygrove in #1318
Feat: Support array_intersect function by @erenavsarogullari in #1271
build(deps): bump pprof from 0.13.0 to 0.14.0 in /native by @dependabot in #1319
chore: Fix merge conflicts from merging comet-parquet-exec into main by @andygrove in #1320
fix: Improve testing for array_remove and fallback to Spark for unsupported types by @andygrove in #1308
chore: Revert accidental re-introduction of off-heap memory requirement by @andygrove in #1326
fix: address post merge comet-parquet-exec review comments by @parthchandra in #1327
chore: Fix merge conflicts from merging comet-parquet-exec into main by @mbutrovich in #1323
Feat: Support array_join function by @erenavsarogullari in #1290
Fix missing slash in spark script by @xleoken in #1334
chore: Refactor QueryPlanSerde to allow logic to be moved to individual classes per expression by @andygrove in #1331
build: re-enable upload-test-reports for macos-13 runner by @viirya in #1335
chore: Upgrade to Arrow 53.4.0 by @andygrove in #1338
fix: memory pool error type by @kazuyukitanimura in #1346
Feat: Support arrays_overlap function by @erenavsarogullari in #1312
fix: Fall back to Spark when hashing decimals with precision > 18 by @andygrove in #1325
chore: Move all array_* serde to new framework, use correct INCOMPAT config by @andygrove in #1349
chore: Prepare for DataFusion 45 (bump to DataFusion rev 5592834 + Arrow 54.0.0) by @andygrove in #1332
fix: expressions doc for ArrayRemove by @kazuyukitanimura in #1356
minor: commit compatibility doc by @kazuyukitanimura in #1358
minor: update fuzz dependency by @kazuyukitanimura in #1357
chore: Remove redundant processing from exprToProtoInternal by @andygrove in #1351
fix: pass scale to DF round in spark_round by @cht42 in #1341
feat: Upgrade to DataFusion 45 by @andygrove in #1364
fix: Mark cast from float/double to decimal as incompatible by @andygrove in #1372
perf: improve performance of update metrics by @wForget in #1329
feat: Add fair unified memory pool by @kazuyukitanimura in #1369
feat: Add unbounded memory pool by @kazuyukitanimura in #1386
fix: Passthrough condition in StaticInvoke case block by @EmilyMatt in #1392
chore: Adding an optional hdfs crate by @comphead in #1377
fix: disable checking for uint_8 and uint_16 if complex type readers are enabled by @parthchandra in #1376
perf: Use DataFusion FilterExec for experimental native scans by @mbutrovich in #1395
doc: update memory tuning guide by @kazuyukitanimura in #1394
chore: Refactor aggregate expression serde by @andygrove in #1380
feat: make random seed configurable in fuzz-testing by @wForget in #1401
feat: override executor overhead memory only when comet unified memory manager is disabled by @wForget in #1379

New Contributors

@xleoken made their first contribution in #1334
@cht42 made their first contribution in #1341
@EmilyMatt made their first contribution in #1392

Full Changelog: 0.5.0...0.6.0

Contributors

viirya, kazuyukitanimura, and 11 other contributors

Assets 2

17 Jan 19:06

andygrove

0.5.0

698c6e5

0.5.0 Pre-release

Pre-release

DataFusion Comet 0.5.0 Changelog

This release consists of 69 commits from 15 contributors. See credits at the end of this changelog for more information.

Fixed bugs:

fix: Unsigned type related bugs #1095 (kazuyukitanimura)
fix: Use RDD partition index #1112 (viirya)
fix: Various metrics bug fixes and improvements #1111 (andygrove)
fix: Don't create CometScanExec for subclasses of ParquetFileFormat #1129 (Kimahriman)
fix: Fix metrics regressions #1132 (andygrove)
fix: Enable scenarios accidentally commented out in CometExecBenchmark #1151 (mbutrovich)
fix: Spark 4.0-preview1 SPARK-47120 #1156 (kazuyukitanimura)
fix: Document enabling comet explain plan usage in Spark (4.0) #1176 (parthchandra)
fix: stddev_pop should not directly return 0.0 when count is 1.0 #1184 (viirya)
fix: fix missing explanation for then branch in case when #1200 (rluvaton)
fix: Fall back to Spark for unsupported partition or sort expressions in window aggregates #1253 (andygrove)
fix: Fall back to Spark for distinct aggregates #1262 (andygrove)
fix: disable initCap by default #1276 (kazuyukitanimura)

Performance related:

perf: Stop passing Java config map into native createPlan #1101 (andygrove)
feat: Make native shuffle compression configurable and respect spark.shuffle.compress #1185 (andygrove)
perf: Improve query planning to more reliably fall back to columnar shuffle when native shuffle is not supported #1209 (andygrove)
feat: Move shuffle block decompression and decoding to native code and add LZ4 & Snappy support #1192 (andygrove)
feat: Implement custom RecordBatch serde for shuffle for improved performance #1190 (andygrove)

Implemented enhancements:

feat: support array_insert #1073 (SemyonSinchenko)
feat: enable decimal to decimal cast of different precision and scale #1086 (himadripal)
feat: Improve ScanExec native metrics #1133 (andygrove)
feat: Add Spark-compatible implementation of SchemaAdapterFactory #1169 (andygrove)
feat: Improve shuffle metrics (second attempt) #1175 (andygrove)
feat: Add a spark.comet.exec.memoryPool configuration for experimenting with various datafusion memory pool setups. #1021 (Kontinuation)
feat: Reenable tests for filtered SMJ anti join #1211 (comphead)
feat: add support for array_remove expression #1179 (jatin510)

Documentation updates:

docs: Update documentation for 0.4.0 release #1096 (andygrove)
docs: Fix readme typo FGPA -> FPGA #1117 (gstvg)
docs: Add more technical detail and new diagram to Comet plugin overview #1119 (andygrove)
docs: Add some documentation explaining how shuffle works #1148 (andygrove)
docs: Update TPC-H benchmark results #1257 (andygrove)

Other:

chore: Add changelog for 0.4.0 #1089 (andygrove)
chore: Prepare for 0.5.0 development #1090 (andygrove)
build: Skip installation of spark-integration and fuzz testing modules #1091 (parthchandra)
minor: Add hint for finding the GPG key to use when publishing to maven #1093 (andygrove)
chore: Include first ScanExec batch in metrics #1105 (andygrove)
chore: Improve CometScan metrics #1100 (andygrove)
chore: Add custom metric for native shuffle fetching batches from JVM #1108 (andygrove)
chore: Remove unused StringView struct #1143 (andygrove)
test: enable more Spark 4.0 tests #1145 (kazuyukitanimura)
chore: Refactor cast to use SparkCastOptions param #1146 (andygrove)
chore: Move more expressions from core crate to spark-expr crate #1152 (andygrove)
chore: Remove dead code #1155 (andygrove)
chore: Move string kernels and expressions to spark-expr crate #1164 (andygrove)
chore: Move remaining expressions to spark-expr crate + some minor refactoring #1165 (andygrove)
chore: Add ignored tests for reading complex types from Parquet #1167 (andygrove)
test: enabling Spark tests with offHeap requirement #1177 (kazuyukitanimura)
minor: move shuffle classes from common to spark #1193 (andygrove)
minor: refactor to move decodeBatches to broadcast exchange code as private function #1195 (andygrove)
minor: refactor prepare_output so that it does not require an ExecutionContext #1194 (andygrove)
minor: remove unused source files #1202 (andygrove)
chore: Upgrade to DataFusion 44.0.0-rc2 #1154 (andygrove)
chore: Add safety check to CometBuffer #1050 (viirya)
chore: Remove unreachable code #1213 (andygrove)
test: Enable Comet by default except some tests in SparkSessionExtensionSuite #1201 (kazuyukitanimura)
chore: extract struct expressions to folders based on spark grouping #1216 (rluvaton)
chore: extract static invoke expressions to folders based on spark grouping #1217 (rluvaton)
chore: Follow-on PR to fully enable onheap memory usage #1210 (andygrove)
chore: extract agg_funcs expressions to folders based on spark grouping #1224 (rluvaton)
chore: extract datetime_funcs expressions to folders based on spark grouping #1222 (rluvaton)
chore: Upgrade to DataFusion 44.0.0 from 44.0.0 RC2 #1232 (rluvaton)
chore: extract strings file to strings_func like in spark grouping #1215 (rluvaton)
chore: extract predicate_functions expressions to folders based on spark grouping #1218 (rluvaton)
build(deps): bump protobuf version to 3.21.12 #1234 (wForget)
chore: extract json_funcs expressions to folders based on spark grouping #1220 (rluvaton)
test: Enable shuffle by default in Spark tests #1240 (kazuyukitanimura)
chore: extract hash_funcs expressions to folders based on spark grouping #1221 (rluvaton)
build: Fix test failure caused by merging conflicting PRs #1259 (andygrove)

Credits

Thank you to everyone who contributed to this release. Here is a breakdown of commits (PRs merged) per contributor.

    37	Andy Grove
    10	Raz Luvaton
     7	KAZUYUKI TANIMURA
     3	Liang-Chi Hsieh
     2	Parth Chandra
     1	Adam Binford
     1	Dharan Aditya
     1	Himadri Pal
     1	Jagdish Parihar
     1	Kristin Cowalcijk
     1	Matt Butrovich
     1	Oleks V
     1	Sem
     1	Zhen Wang
     1	gstvg

Thank you also to everyone who contributed ...

Assets 2

Releases: apache/datafusion-comet

0.12.0

DataFusion Comet 0.12.0 Changelog

Uh oh!

0.11.0

DataFusion Comet 0.11.0 Changelog

Uh oh!

0.10.1

DataFusion Comet 0.10.1 Changelog

Credits

Uh oh!

0.10.0

DataFusion Comet 0.10.0 Changelog

Uh oh!

0.9.1

DataFusion Comet 0.9.1 Changelog

Credits

Uh oh!

0.9.0

DataFusion Comet 0.9.0 Changelog

Uh oh!

0.8.0

DataFusion Comet 0.8.0 Changelog

Uh oh!

0.7.0

DataFusion Comet 0.7.0 Changelog

Credits

Uh oh!

0.6.0

What's Changed

New Contributors

Contributors

Uh oh!

0.5.0

DataFusion Comet 0.5.0 Changelog

Credits

Uh oh!