Skip to content

[SPARK-55303][PYTHON][TESTS] Extract GoldenFileTestMixin for type coercion golden file tests#54084

Closed
Yicong-Huang wants to merge 10 commits intoapache:masterfrom
Yicong-Huang:SPARK-55303/refactor/extract-golden-file-test-util
Closed

[SPARK-55303][PYTHON][TESTS] Extract GoldenFileTestMixin for type coercion golden file tests#54084
Yicong-Huang wants to merge 10 commits intoapache:masterfrom
Yicong-Huang:SPARK-55303/refactor/extract-golden-file-test-util

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Feb 1, 2026

What changes were proposed in this pull request?

Extract common golden file testing utilities into GoldenFileTestMixin in python/pyspark/testing/goldenutils.py, and simplify the four type coercion test files to use this mixin.

Why are the changes needed?

Reduce duplicated code across four test files and provide a reusable framework for future golden file tests.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Regenerated all golden files with SPARK_GENERATE_GOLDEN_FILES=1 and verified tests pass.

Was this patch authored or co-authored using generative AI tooling?

No.

Extract common golden file testing utilities into GoldenFileTestMixin:
- _run_golden_tests(): unified test execution framework
- _golden_path(): compute golden file path from prefix
- _compare_or_generate_golden(): compare or generate golden files
- repr_value(), repr_spark_type(): value/type formatting
- save_golden(), load_golden_csv(): file I/O utilities

Simplify test files to only define test data and run_test callback.
Update golden files to match refactored test output format.
@github-actions
Copy link

github-actions bot commented Feb 1, 2026

JIRA Issue Information

=== Sub-task SPARK-55303 ===
Summary: Create golden file test framework
Assignee: None
Status: Open
Affected: ["4.2.0"]


This comment was automatically generated by GitHub Actions

@Yicong-Huang
Copy link
Contributor Author

cc @zhengruifeng

@zhengruifeng
Copy link
Contributor

Input type tests: 5-column row format → 2-column matrix (Spark Type, Python Type)

please don't do this

@Yicong-Huang Yicong-Huang changed the title [SPARK-55303][PYTHON][TEST] Extract GoldenFileTestMixin for type coercion golden file tests [SPARK-55303][PYTHON][TESTS] Extract GoldenFileTestMixin for type coercion golden file tests Feb 2, 2026
float_null float [None, 3.140000104904175] float32 [None, 3.140000104904175]
double_values double [0.0, 1.0, 0.3333333333333333] float64 [0.0, 1.0, 0.3333333333333333]
double_null double [None, 2.71] float64 [None, 2.71]
decimal_values decimal(3,2) [Decimal('5.35'), Decimal('1.23')] Decimal [Decimal('5.35'), Decimal('1.23')]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yicong-Huang is the change ['object', 'object'] -> Decimal expected?

I think it should be the dtypes of pdf here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is expected! I find object being too less of information.object dtype is pandas' fallback for anything that doesn't have native array support — it stores an array of Python object pointers. And if we use this in golden file test, we won't be able to notice the actual type has changed. For example,

# Correct UDF result: Decimal values
result1 = DataFrame({'value': [Decimal('1.23')], 'name': ['a']})

# Buggy UDF result: str instead of Decimal  
result2 = DataFrame({'value': ['1.23'], 'name': ['a']})

# pandas dtype for both: [object, object] — identical!
# Python element types: [Decimal, str] vs [str, str] — different!

So in this PR I went ahead updated the repr_type to print out the actual python object type when detected a general object pandas dtype. And you can see it prints out Decimal for this case:

Image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a topic other than Extract, if we want to do this we should do it in separate PRs.

The test class should be able to override the default string expr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. reverted the change. but I think we should consider change this in future PRs.

"""
if elem is None:
return "NoneType"
elif have_pandas and isinstance(elem, pd.DataFrame):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need to check have_pandas? it should always be true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping to reuse this for future golden file tests, for example pyarrow related tests. So I made the method general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, but the files are still writen by pandas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. you are right. I removed pandas check

parallel=True,
)

def _run_golden_tests(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it is kind of less flexible.
I think this mixin should only provide helper functions to generate string exprs, and it is up to the test class to determine how the cases are tested, e.g. the error hanlding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. here the framework is pretty general. the test classes who uses this mixin can define properties
test_cases, column_names, and also define the method run_single_test to handle how to execute or check error.

The framework logic here is simple:

  1. run the provided test_cases by calling the provided run_single_test.
  2. execute in parallel, if permitted.
  3. collect results, serialize to string
  4. compare or generate golden file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering how to modify the golden file according to different envs? and how to only check a subset of the golden file? and how to forbidden the golden file regeneration in unexpected env?

I think it should be just a helper class providing some helper functions:
1, save/load goldenfile based on pandas;
2, default string expr of variant instances, subclass should be able to override this;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. reverted so that each test decides how to run tests.

@zhengruifeng
Copy link
Contributor

thanks, merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants