[SPARK-55363][PS][TESTS] Make ops tests with "decimal_nan" columns ignore NaN vs. None#54146
[SPARK-55363][PS][TESTS] Make ops tests with "decimal_nan" columns ignore NaN vs. None#54146ueshin wants to merge 1 commit intoapache:masterfrom
Conversation
JIRA Issue Information=== Sub-task SPARK-55363 === This comment was automatically generated by GitHub Actions |
| self.assert_eq(left, right) | ||
|
|
||
| def ignore_null(self, col): | ||
| return LooseVersion(pd.__version__) >= LooseVersion("3.0") and col == "decimal_nan" |
There was a problem hiding this comment.
Is decimal_nan the only case where this happens? I think we can have this matter any time we do some calculation that results in a null-like value.
There was a problem hiding this comment.
Yes, as long as I've observed so far, it only happens with decimal_nan.
There was a problem hiding this comment.
We can use subTest to check the parameters in the test loops. WDYT?
There was a problem hiding this comment.
I think most of the issues happened because decimal_nan actually has null-like values. Some of the columns don't. We should probably not use it as a fact. I think a better way is probably check if there's any null-like value in the column with col.isna().to_numpy().any()?
There was a problem hiding this comment.
The actually issue should be how decimal.Decimal(np.nan) is handled?
The other numeric types, None will be NaN when converting to pandas, which is well-handled.
The other types, None will be None anyway.
- The string type is a bit different; previously it was
objectdtype withNone, but now it'sStringDtypewithNaNfor null, which was fixed at [SPARK-55244][PYTHON][PS] Use np.nan as default value for pandas string types #54015
But decimal.Decimal(np.nan) is kind of special value that Spark can't handle well anyway?
It will be None in pandas API on Spark as Spark doesn't have a concept of NaN on decimal type.
>>> pdf = pd.DataFrame([decimal.Decimal(np.nan), None])
>>> pdf
0
0 NaN
1 None
>>>
>>> psdf = ps.from_pandas(pdf)
>>> psdf
0
0 None
1 NoneThere was a problem hiding this comment.
Okay so for these tests, we are doing operations on a "decimal column" - which is really just object in pandas because pandas does not have a decimal dtype. The psdf output, unfortunately, also has type object because object + anything is object. So there is no way for us to know that we should convert this None to np.nan.
Then I guess this change is fine - we should ignore the null differences from operation of "decimal" data - which is just object.
|
The link to the build in the build status is missing for some reason. |
|
Actually, I'm doing #54100 this. I think I've seen other cases that's caused by null comparison, but maybe I'm wrong? |
|
thanks, merged to master |
…nore NaN vs. None ### What changes were proposed in this pull request? This is a follow-up of #54146. Also fixes `test_num_mul_div` to ignore `NaN` vs. `None`. ### Why are the changes needed? The fix was not included in #54146. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54165 from ueshin/issues/SPARK-55363/test_num_mul_div. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…nore NaN vs. None ### What changes were proposed in this pull request? This is a follow-up of apache#54146. Also fixes `test_num_mul_div` to ignore `NaN` vs. `None`. ### Why are the changes needed? The fix was not included in apache#54146. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54165 from ueshin/issues/SPARK-55363/test_num_mul_div. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
…nore NaN vs. None
### What changes were proposed in this pull request?
Makes ops tests with "decimal_nan" columns ignore `NaN` vs. `None`.
### Why are the changes needed?
pandas 3 made `assert_frame_equal` strictly check `NaN` vs. `None`.
```py
>>> pdf = pd.DataFrame([decimal.Decimal(np.nan)])
>>> psdf = ps.from_pandas(pdf)
>>>
>>> pdf
0
0 NaN
>>> psdf
0
0 None
```
- pandas < 3
```py
>>> pd.__version__
'2.3.3'
>>> assert_frame_equal(pdf, psdf.to_pandas())
<stdin>:1: FutureWarning: Mismatched null-like values NaN and None found. In a future version, pandas equality-testing functions (e.g. assert_frame_equal) will consider these not-matching and raise.
```
- pandas == 3
```py
>>> pd.__version__
'3.0.0'
>>> assert_frame_equal(pdf, psdf.to_pandas())
Traceback (most recent call last):
...
AssertionError: DataFrame.iloc[:, 0] (column name="0") are different
DataFrame.iloc[:, 0] (column name="0") values are different (100.0 %)
[index]: [0]
[left]: [NaN]
[right]: [None]
At positional index 0, first diff: NaN != None
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated the related tests.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes apache#54146 from ueshin/issues/SPARK-55363/ignore_null.
Authored-by: Takuya Ueshin <ueshin@databricks.com>
Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
…nore NaN vs. None ### What changes were proposed in this pull request? This is a follow-up of apache#54146. Also fixes `test_num_mul_div` to ignore `NaN` vs. `None`. ### Why are the changes needed? The fix was not included in apache#54146. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54165 from ueshin/issues/SPARK-55363/test_num_mul_div. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
What changes were proposed in this pull request?
Makes ops tests with "decimal_nan" columns ignore
NaNvs.None.Why are the changes needed?
pandas 3 made
assert_frame_equalstrictly checkNaNvs.None.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Updated the related tests.
Was this patch authored or co-authored using generative AI tooling?
No.