[SPARK-55303][PYTHON][TESTS] Extract GoldenFileTestMixin for type coercion golden file tests by Yicong-Huang · Pull Request #54084 · apache/spark

Yicong-Huang · 2026-02-01T23:42:59Z

What changes were proposed in this pull request?

Extract common golden file testing utilities into GoldenFileTestMixin in python/pyspark/testing/goldenutils.py, and simplify the four type coercion test files to use this mixin.

Why are the changes needed?

Reduce duplicated code across four test files and provide a reusable framework for future golden file tests.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Regenerated all golden files with SPARK_GENERATE_GOLDEN_FILES=1 and verified tests pass.

Was this patch authored or co-authored using generative AI tooling?

No.

Extract common golden file testing utilities into GoldenFileTestMixin: - _run_golden_tests(): unified test execution framework - _golden_path(): compute golden file path from prefix - _compare_or_generate_golden(): compare or generate golden files - repr_value(), repr_spark_type(): value/type formatting - save_golden(), load_golden_csv(): file I/O utilities Simplify test files to only define test data and run_test callback.

Update golden files to match refactored test output format.

github-actions · 2026-02-01T23:43:08Z

JIRA Issue Information

=== Sub-task SPARK-55303 ===
Summary: Create golden file test framework
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

Yicong-Huang · 2026-02-02T04:33:32Z

cc @zhengruifeng

zhengruifeng · 2026-02-02T05:43:19Z

Input type tests: 5-column row format → 2-column matrix (Spark Type, Python Type)

please don't do this

zhengruifeng · 2026-02-03T07:26:21Z

python/pyspark/sql/tests/coercion/golden_pandas_udf_input_type_coercion_base.csv

+float_null	float	[None, 3.140000104904175]	float32	[None, 3.140000104904175]
+double_values	double	[0.0, 1.0, 0.3333333333333333]	float64	[0.0, 1.0, 0.3333333333333333]
+double_null	double	[None, 2.71]	float64	[None, 2.71]
+decimal_values	decimal(3,2)	[Decimal('5.35'), Decimal('1.23')]	Decimal	[Decimal('5.35'), Decimal('1.23')]


@Yicong-Huang is the change ['object', 'object'] -> Decimal expected?

I think it should be the dtypes of pdf here?

It is expected! I find object being too less of information.object dtype is pandas' fallback for anything that doesn't have native array support — it stores an array of Python object pointers. And if we use this in golden file test, we won't be able to notice the actual type has changed. For example,

# Correct UDF result: Decimal values result1 = DataFrame({'value': [Decimal('1.23')], 'name': ['a']}) # Buggy UDF result: str instead of Decimal result2 = DataFrame({'value': ['1.23'], 'name': ['a']}) # pandas dtype for both: [object, object] — identical! # Python element types: [Decimal, str] vs [str, str] — different!

So in this PR I went ahead updated the repr_type to print out the actual python object type when detected a general object pandas dtype. And you can see it prints out Decimal for this case:

I think it is a topic other than Extract, if we want to do this we should do it in separate PRs.

The test class should be able to override the default string expr.

got it. reverted the change. but I think we should consider change this in future PRs.

zhengruifeng · 2026-02-03T07:31:12Z

python/pyspark/testing/goldenutils.py

+        """
+        if elem is None:
+            return "NoneType"
+        elif have_pandas and isinstance(elem, pd.DataFrame):


I think we don't need to check have_pandas? it should always be true

I was hoping to reuse this for future golden file tests, for example pyarrow related tests. So I made the method general.

I see, but the files are still writen by pandas?

I see your point. you are right. I removed pandas check

zhengruifeng · 2026-02-03T07:40:14Z

python/pyspark/testing/goldenutils.py

+            parallel=True,
+        )
+
+    def _run_golden_tests(


I feel it is kind of less flexible.
I think this mixin should only provide helper functions to generate string exprs, and it is up to the test class to determine how the cases are tested, e.g. the error hanlding.

yes. here the framework is pretty general. the test classes who uses this mixin can define properties
test_cases, column_names, and also define the method run_single_test to handle how to execute or check error.

The framework logic here is simple:

run the provided test_cases by calling the provided run_single_test.

execute in parallel, if permitted.

collect results, serialize to string

compare or generate golden file.

I am wondering how to modify the golden file according to different envs? and how to only check a subset of the golden file? and how to forbidden the golden file regeneration in unexpected env?

I think it should be just a helper class providing some helper functions:
1, save/load goldenfile based on pandas;
2, default string expr of variant instances, subclass should be able to override this;

ok. reverted so that each test decides how to run tests.

zhengruifeng · 2026-02-05T03:07:08Z

thanks, merged to master

Yicong-Huang added 2 commits February 1, 2026 15:40

test: regenerate golden files for type coercion tests

92000f6

Update golden files to match refactored test output format.

github-actions bot added SQL PYTHON labels Feb 1, 2026

fix: normalize hex

6ee35ea

Yicong-Huang changed the title ~~[SPARK-55303][PYTHON][TEST] Extract GoldenFileTestMixin for type coercion golden file tests~~ [SPARK-55303][PYTHON][TESTS] Extract GoldenFileTestMixin for type coercion golden file tests Feb 2, 2026

Yicong-Huang added 3 commits February 2, 2026 14:00

fix: comments

12ba055

test: use proper repr for nested types

fef83ce

fix: tests

b70ed32

zhengruifeng reviewed Feb 3, 2026

View reviewed changes

Yicong-Huang added 4 commits February 3, 2026 12:16

fix: inheriatance order

fd946ec

revert: make mixin flexible

c0c2125

fix: remove pandas check

88ee0a8

fix: comment

a9d6a49

Yicong-Huang requested a review from zhengruifeng February 4, 2026 04:28

zhengruifeng approved these changes Feb 5, 2026

View reviewed changes

zhengruifeng closed this in 481f986 Feb 5, 2026

Conversation

Yicong-Huang commented Feb 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 1, 2026

JIRA Issue Information

Uh oh!

Yicong-Huang commented Feb 2, 2026

Uh oh!

zhengruifeng commented Feb 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Yicong-Huang commented Feb 1, 2026 •

edited

Loading