Skip to content

Conversation

@coderfender
Copy link
Contributor

@coderfender coderfender commented Dec 17, 2025

Which issue does this PR close?

Similar issue as #326

Closes #2926

Rationale for this change

What changes are included in this PR?

Decimal Types
(This is the tricky one)

  1. I added two functions (is_valid_decimal and parse_string_to_decimal). is_valid_decimal to parse the input string and return True if the input is a valid decimal. This follows similar pattern of Spark's parsing logic which indeed relies on BigDecimal implementation
  2. parse_string_to_decimal does the heavy lifting of parsing the input string into a valid 'i128' which is essentially how the value is stored (along with scale and precision information). This method also checks if the scale is too high / low and fails early.
  3. Once the above method returns a valid i128 integer, we use the decimal builder and return the decimal array

How are these changes tested?

if string_array.is_null(i) {
decimal_builder.append_null();
} else {
let str_value = string_array.value(i).trim();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to avoid trim since this could be expensive (creating a new string). It would be more efficient for the parser to ignore whitespace at the end of this string.

mantissa_str
};

let split_by_dot: Vec<&str> = mantissa_str.split('.').collect();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling split seems expensive as well.

@andygrove
Copy link
Member

Thanks for moving this to a separate PR @coderfender. I left some initial comments. Could you run some benchmarks? I'd like to get a feel for performance.

@coderfender
Copy link
Contributor Author

Sure @andygrove . Let me run some benchmarks on the cast op and update this thread

@andygrove
Copy link
Member

@coderfender, the functionality looks good overall. There are likely some edge cases we do not support that we need to handle or document. I have only found a few so far, and here is a suggested test to add.

  test("cast StringType to DecimalType non-ASCII characters") {
    // TODO fix for Spark 4.0.0
    assume(!isSpark40Plus)
    val values = Seq(
      "\uff11\uff12\uff13", // fullwidth digits 123
      "123\u0000", // null byte at end
      "12\u00003", // null byte in middle
      "\u0000123", // null byte at start
      null).toDF("a")
    Seq(true, false).foreach(k =>
      castTest(values, DataTypes.createDecimalType(10, 2), testAnsi = k))
  }
!== Correct Answer - 5 ==           == Spark Answer - 5 ==
 struct<a:string,a:decimal(10,2)>   struct<a:string,a:decimal(10,2)>
 [null,null]                        [null,null]
![ 123,123.00]                      [ 123,null]
 [12 3,null]                        [12 3,null]
![123 ,123.00]                      [123 ,null]
![123,123.00]                       [123,null]

I think it is fine if we do not support everything, but we should mark the cast as Incompatible and add documentation explaining the current limitations.

@coderfender
Copy link
Contributor Author

Great catch @andygrove . Let me try and see if I can resolve the issue or move forward with these changes and mark it compatible

@coderfender
Copy link
Contributor Author

Pushed commit to not have expensive 'trim' and split by '.' operations. Working on the rather difficult part of merging decimal validation and parsing in a single pass

@coderfender
Copy link
Contributor Author

Addressed double iteration and we now validate and parse in a single iteration

@coderfender
Copy link
Contributor Author

Benchmarks after adding further optimizations

================================================================================================
Running benchmark DecimalType(2,2)
================================================================================================

OpenJDK 64-Bit Server VM 17.0.16+8-LTS on Mac OS X 16.0
Apple M2 Max
Cast function to : DecimalType(2,2) , ansi mode enabled : false:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark Cast expr from STRING to : DECIMAL(2,2) , ansi mode enabled : false            106            130          14          9.4         106.3       1.0X
SQL Parquet - Comet Cast expr from STRING to : DECIMAL(2,2) , ansi mode enabled : false             94            123          36         10.7          93.5       1.1X


================================================================================================
Running benchmark DecimalType(2,2)
================================================================================================

OpenJDK 64-Bit Server VM 17.0.16+8-LTS on Mac OS X 16.0
Apple M2 Max
Cast function to : DecimalType(2,2) , ansi mode enabled : true:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark Cast expr from STRING to : DECIMAL(2,2) , ansi mode enabled : true             88            109          18         11.3          88.3       1.0X
SQL Parquet - Comet Cast expr from STRING to : DECIMAL(2,2) , ansi mode enabled : true             78             85           9         12.8          78.2       1.1X


================================================================================================
Running benchmark DecimalType(10,2)
================================================================================================

OpenJDK 64-Bit Server VM 17.0.16+8-LTS on Mac OS X 16.0
Apple M2 Max
Cast function to : DecimalType(10,2) , ansi mode enabled : false:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark Cast expr from STRING to : DECIMAL(10,2) , ansi mode enabled : false            102            132          22          9.8         101.7       1.0X
SQL Parquet - Comet Cast expr from STRING to : DECIMAL(10,2) , ansi mode enabled : false             86            104          18         11.7          85.8       1.2X


================================================================================================
Running benchmark DecimalType(10,2)
================================================================================================

OpenJDK 64-Bit Server VM 17.0.16+8-LTS on Mac OS X 16.0
Apple M2 Max
Cast function to : DecimalType(10,2) , ansi mode enabled : true:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark Cast expr from STRING to : DECIMAL(10,2) , ansi mode enabled : true            118            150          34          8.5         118.1       1.0X
SQL Parquet - Comet Cast expr from STRING to : DECIMAL(10,2) , ansi mode enabled : true             86            138          36         11.6          86.1       1.4X


================================================================================================
Running benchmark DecimalType(20,2)
================================================================================================

OpenJDK 64-Bit Server VM 17.0.16+8-LTS on Mac OS X 16.0
Apple M2 Max
Cast function to : DecimalType(20,2) , ansi mode enabled : false:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark Cast expr from STRING to : DECIMAL(20,2) , ansi mode enabled : false            158            198          40          6.3         157.7       1.0X
SQL Parquet - Comet Cast expr from STRING to : DECIMAL(20,2) , ansi mode enabled : false            146            150           4          6.8         146.2       1.1X


================================================================================================
Running benchmark DecimalType(20,2)
================================================================================================

OpenJDK 64-Bit Server VM 17.0.16+8-LTS on Mac OS X 16.0
Apple M2 Max
Cast function to : DecimalType(20,2) , ansi mode enabled : true:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark Cast expr from STRING to : DECIMAL(20,2) , ansi mode enabled : true            160            210          39          6.3         159.6       1.0X
SQL Parquet - Comet Cast expr from STRING to : DECIMAL(20,2) , ansi mode enabled : true            145            150           6          6.9         145.0       1.1X


================================================================================================
Running benchmark DecimalType(34,10)
================================================================================================

OpenJDK 64-Bit Server VM 17.0.16+8-LTS on Mac OS X 16.0
Apple M2 Max
Cast function to : DecimalType(34,10) , ansi mode enabled : false:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark Cast expr from STRING to : DECIMAL(34,10) , ansi mode enabled : false            350            365          27          2.9         350.0       1.0X
SQL Parquet - Comet Cast expr from STRING to : DECIMAL(34,10) , ansi mode enabled : false            191            203          12          5.2         191.0       1.8X


================================================================================================
Running benchmark DecimalType(34,10)
================================================================================================

OpenJDK 64-Bit Server VM 17.0.16+8-LTS on Mac OS X 16.0
Apple M2 Max
Cast function to : DecimalType(34,10) , ansi mode enabled : true:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SQL Parquet - Spark Cast expr from STRING to : DECIMAL(34,10) , ansi mode enabled : true            347            354          12          2.9         346.5       1.0X
SQL Parquet - Comet Cast expr from STRING to : DECIMAL(34,10) , ansi mode enabled : true            192            224          38          5.2         191.6       1.8X

@codecov-commenter
Copy link

codecov-commenter commented Dec 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.59%. Comparing base (f09f8af) to head (6a6c80d).
⚠️ Report is 790 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #2925      +/-   ##
============================================
+ Coverage     56.12%   59.59%   +3.47%     
- Complexity      976     1381     +405     
============================================
  Files           119      167      +48     
  Lines         11743    15492    +3749     
  Branches       2251     2568     +317     
============================================
+ Hits           6591     9233    +2642     
- Misses         4012     4959     +947     
- Partials       1140     1300     +160     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderfender
Copy link
Contributor Author

@andygrove please take a look whenever you get a chance. Thank you

}

// This is to pass the first `all cast combinations are covered`
ignore("cast StringType to DecimalType(10,2)") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this test also be enabled now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andygrove this test cant be enabled given that we are still marking the casts as incompatible . However I added the same tests (with a different name to make sure they pass) . Enabling this test will fail 'all cast combinations are covered'

@andygrove
Copy link
Member

Thanks @coderfender. This LGTM but could you also update the compatibility guide. It currently states Does not support inputs ending with 'd' or 'f'. Does not support 'inf'. Does not support ANSI mode. Returns 0.0 instead of null if input contains no digits and this is not entirely correct now.

With the docs update, I'll be happy to approve.

@coderfender
Copy link
Contributor Author

@andygrove the docs are already updated AFAIK

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @coderfender

@andygrove andygrove merged commit a6cfadb into apache:main Dec 22, 2025
150 of 151 checks passed
@coderfender
Copy link
Contributor Author

Thank you @andygrove

@coderfender coderfender deleted the support_string_decimal_cast branch December 23, 2025 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support exhaustive String to Decimal casting

3 participants