[SPARK-55363][PS][TESTS] Make ops tests with "decimal_nan" columns ignore NaN vs. None by ueshin · Pull Request #54146 · apache/spark

ueshin · 2026-02-04T22:29:31Z

What changes were proposed in this pull request?

Makes ops tests with "decimal_nan" columns ignore NaN vs. None.

Why are the changes needed?

pandas 3 made assert_frame_equal strictly check NaN vs. None.

>>> pdf = pd.DataFrame([decimal.Decimal(np.nan)])
>>> psdf = ps.from_pandas(pdf)
>>>
>>> pdf
     0
0  NaN
>>> psdf
      0
0  None

pandas < 3

>>> pd.__version__
'2.3.3'
>>> assert_frame_equal(pdf, psdf.to_pandas())
<stdin>:1: FutureWarning: Mismatched null-like values NaN and None found. In a future version, pandas equality-testing functions (e.g. assert_frame_equal) will consider these not-matching and raise.

pandas == 3

>>> pd.__version__
'3.0.0'
>>> assert_frame_equal(pdf, psdf.to_pandas())
Traceback (most recent call last):
...
AssertionError: DataFrame.iloc[:, 0] (column name="0") are different

DataFrame.iloc[:, 0] (column name="0") values are different (100.0 %)
[index]: [0]
[left]:  [NaN]
[right]: [None]
At positional index 0, first diff: NaN != None

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Updated the related tests.

Was this patch authored or co-authored using generative AI tooling?

No.

github-actions · 2026-02-04T22:29:43Z

JIRA Issue Information

=== Sub-task SPARK-55363 ===
Summary: Make ops tests with "decimal_nan" columns ignore NaN vs. None.
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

ueshin · 2026-02-04T22:29:54Z

cc @gaogaotiantian @HyukjinKwon @zhengruifeng

gaogaotiantian · 2026-02-04T22:38:00Z

python/pyspark/pandas/tests/data_type_ops/testing_utils.py

        self.assert_eq(left, right)
+
+    def ignore_null(self, col):
+        return LooseVersion(pd.__version__) >= LooseVersion("3.0") and col == "decimal_nan"


Is decimal_nan the only case where this happens? I think we can have this matter any time we do some calculation that results in a null-like value.

Yes, as long as I've observed so far, it only happens with decimal_nan.

We can use subTest to check the parameters in the test loops. WDYT?

I think most of the issues happened because decimal_nan actually has null-like values. Some of the columns don't. We should probably not use it as a fact. I think a better way is probably check if there's any null-like value in the column with col.isna().to_numpy().any()?

The actually issue should be how decimal.Decimal(np.nan) is handled?

The other numeric types, None will be NaN when converting to pandas, which is well-handled.
The other types, None will be None anyway.

The string type is a bit different; previously it was object dtype with None, but now it's StringDtype with NaN for null, which was fixed at [SPARK-55244][PYTHON][PS] Use np.nan as default value for pandas string types #54015

But decimal.Decimal(np.nan) is kind of special value that Spark can't handle well anyway?

It will be None in pandas API on Spark as Spark doesn't have a concept of NaN on decimal type.

>>> pdf = pd.DataFrame([decimal.Decimal(np.nan), None]) >>> pdf 0 0 NaN 1 None >>> >>> psdf = ps.from_pandas(pdf) >>> psdf 0 0 None 1 None

Okay so for these tests, we are doing operations on a "decimal column" - which is really just object in pandas because pandas does not have a decimal dtype. The psdf output, unfortunately, also has type object because object + anything is object. So there is no way for us to know that we should convert this None to np.nan.

Then I guess this change is fine - we should ignore the null differences from operation of "decimal" data - which is just object.

ueshin · 2026-02-04T22:47:37Z

The link to the build in the build status is missing for some reason.

https://github.com/ueshin/apache-spark/actions/runs/21690272648

gaogaotiantian · 2026-02-05T01:24:25Z

Actually, I'm doing #54100 this. I think I've seen other cases that's caused by null comparison, but maybe I'm wrong?

zhengruifeng · 2026-02-05T02:58:28Z

thanks, merged to master

…nore NaN vs. None ### What changes were proposed in this pull request? This is a follow-up of #54146. Also fixes `test_num_mul_div` to ignore `NaN` vs. `None`. ### Why are the changes needed? The fix was not included in #54146. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #54165 from ueshin/issues/SPARK-55363/test_num_mul_div. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…nore NaN vs. None ### What changes were proposed in this pull request? This is a follow-up of apache#54146. Also fixes `test_num_mul_div` to ignore `NaN` vs. `None`. ### Why are the changes needed? The fix was not included in apache#54146. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54165 from ueshin/issues/SPARK-55363/test_num_mul_div. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…nore NaN vs. None ### What changes were proposed in this pull request? Makes ops tests with "decimal_nan" columns ignore `NaN` vs. `None`. ### Why are the changes needed? pandas 3 made `assert_frame_equal` strictly check `NaN` vs. `None`. ```py >>> pdf = pd.DataFrame([decimal.Decimal(np.nan)]) >>> psdf = ps.from_pandas(pdf) >>> >>> pdf 0 0 NaN >>> psdf 0 0 None ``` - pandas < 3 ```py >>> pd.__version__ '2.3.3' >>> assert_frame_equal(pdf, psdf.to_pandas()) <stdin>:1: FutureWarning: Mismatched null-like values NaN and None found. In a future version, pandas equality-testing functions (e.g. assert_frame_equal) will consider these not-matching and raise. ``` - pandas == 3 ```py >>> pd.__version__ '3.0.0' >>> assert_frame_equal(pdf, psdf.to_pandas()) Traceback (most recent call last): ... AssertionError: DataFrame.iloc[:, 0] (column name="0") are different DataFrame.iloc[:, 0] (column name="0") values are different (100.0 %) [index]: [0] [left]: [NaN] [right]: [None] At positional index 0, first diff: NaN != None ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54146 from ueshin/issues/SPARK-55363/ignore_null. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>

…nore NaN vs. None ### What changes were proposed in this pull request? This is a follow-up of apache#54146. Also fixes `test_num_mul_div` to ignore `NaN` vs. `None`. ### Why are the changes needed? The fix was not included in apache#54146. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated the related tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#54165 from ueshin/issues/SPARK-55363/test_num_mul_div. Authored-by: Takuya Ueshin <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

Make ops tests with "decimal_nan" columns ignore NaN vs. None.

cff4fc6

github-actions bot added PYTHON PANDAS API ON SPARK labels Feb 4, 2026

HyukjinKwon approved these changes Feb 4, 2026

View reviewed changes

gaogaotiantian reviewed Feb 4, 2026

View reviewed changes

gaogaotiantian approved these changes Feb 5, 2026

View reviewed changes

ueshin mentioned this pull request Feb 5, 2026

[SPARK-55321][PYTHON][TESTS] Ignore null difference when comparing ps df/series #54100

Closed

zhengruifeng closed this in ebc24e0 Feb 5, 2026

ueshin mentioned this pull request Feb 6, 2026

[SPARK-55363][PS][TESTS][FOLLOW-UP] Also fix test_num_mul_div to ignore NaN vs. None #54165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-55363][PS][TESTS] Make ops tests with "decimal_nan" columns ignore NaN vs. None#54146

[SPARK-55363][PS][TESTS] Make ops tests with "decimal_nan" columns ignore NaN vs. None#54146
ueshin wants to merge 1 commit intoapache:masterfrom
ueshin:issues/SPARK-55363/ignore_null

ueshin commented Feb 4, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

ueshin commented Feb 4, 2026

Uh oh!

gaogaotiantian Feb 4, 2026

Uh oh!

ueshin Feb 4, 2026

Uh oh!

ueshin Feb 4, 2026

Uh oh!

gaogaotiantian Feb 4, 2026

Uh oh!

ueshin Feb 4, 2026

Uh oh!

gaogaotiantian Feb 5, 2026

Uh oh!

ueshin commented Feb 4, 2026

Uh oh!

gaogaotiantian commented Feb 5, 2026

Uh oh!

zhengruifeng commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

ueshin commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 4, 2026

JIRA Issue Information

Uh oh!

ueshin commented Feb 4, 2026

Uh oh!

gaogaotiantian Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

ueshin Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

ueshin Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

ueshin Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

ueshin commented Feb 4, 2026

Uh oh!

gaogaotiantian commented Feb 5, 2026

Uh oh!

zhengruifeng commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ueshin commented Feb 4, 2026 •

edited

Loading