-
Notifications
You must be signed in to change notification settings - Fork 260
feat: Support string decimal cast #2925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support string decimal cast #2925
Conversation
| if string_array.is_null(i) { | ||
| decimal_builder.append_null(); | ||
| } else { | ||
| let str_value = string_array.value(i).trim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to avoid trim since this could be expensive (creating a new string). It would be more efficient for the parser to ignore whitespace at the end of this string.
| mantissa_str | ||
| }; | ||
|
|
||
| let split_by_dot: Vec<&str> = mantissa_str.split('.').collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling split seems expensive as well.
|
Thanks for moving this to a separate PR @coderfender. I left some initial comments. Could you run some benchmarks? I'd like to get a feel for performance. |
|
Sure @andygrove . Let me run some benchmarks on the cast op and update this thread |
|
@coderfender, the functionality looks good overall. There are likely some edge cases we do not support that we need to handle or document. I have only found a few so far, and here is a suggested test to add. test("cast StringType to DecimalType non-ASCII characters") {
// TODO fix for Spark 4.0.0
assume(!isSpark40Plus)
val values = Seq(
"\uff11\uff12\uff13", // fullwidth digits 123
"123\u0000", // null byte at end
"12\u00003", // null byte in middle
"\u0000123", // null byte at start
null).toDF("a")
Seq(true, false).foreach(k =>
castTest(values, DataTypes.createDecimalType(10, 2), testAnsi = k))
}I think it is fine if we do not support everything, but we should mark the cast as |
|
Great catch @andygrove . Let me try and see if I can resolve the issue or move forward with these changes and mark it compatible |
|
Pushed commit to not have expensive 'trim' and split by '.' operations. Working on the rather difficult part of merging decimal validation and parsing in a single pass |
|
Addressed double iteration and we now validate and parse in a single iteration |
|
Benchmarks after adding further optimizations |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2925 +/- ##
============================================
+ Coverage 56.12% 59.59% +3.47%
- Complexity 976 1381 +405
============================================
Files 119 167 +48
Lines 11743 15492 +3749
Branches 2251 2568 +317
============================================
+ Hits 6591 9233 +2642
- Misses 4012 4959 +947
- Partials 1140 1300 +160 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@andygrove please take a look whenever you get a chance. Thank you |
| } | ||
|
|
||
| // This is to pass the first `all cast combinations are covered` | ||
| ignore("cast StringType to DecimalType(10,2)") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this test also be enabled now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andygrove this test cant be enabled given that we are still marking the casts as incompatible . However I added the same tests (with a different name to make sure they pass) . Enabling this test will fail 'all cast combinations are covered'
|
Thanks @coderfender. This LGTM but could you also update the compatibility guide. It currently states With the docs update, I'll be happy to approve. |
|
@andygrove the docs are already updated AFAIK |
andygrove
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @coderfender
|
Thank you @andygrove |
Which issue does this PR close?
Similar issue as #326
Closes #2926
Rationale for this change
What changes are included in this PR?
Decimal Types
(This is the tricky one)
is_valid_decimalandparse_string_to_decimal).is_valid_decimalto parse the input string and return True if the input is a valid decimal. This follows similar pattern of Spark's parsing logic which indeed relies on BigDecimal implementationparse_string_to_decimaldoes the heavy lifting of parsing the input string into a valid 'i128' which is essentially how the value is stored (along with scale and precision information). This method also checks if the scale is too high / low and fails early.How are these changes tested?