[SPARK-55303][PYTHON][TESTS] Extract GoldenFileTestMixin for type coercion golden file tests#54084
Conversation
Extract common golden file testing utilities into GoldenFileTestMixin: - _run_golden_tests(): unified test execution framework - _golden_path(): compute golden file path from prefix - _compare_or_generate_golden(): compare or generate golden files - repr_value(), repr_spark_type(): value/type formatting - save_golden(), load_golden_csv(): file I/O utilities Simplify test files to only define test data and run_test callback.
Update golden files to match refactored test output format.
JIRA Issue Information=== Sub-task SPARK-55303 === This comment was automatically generated by GitHub Actions |
please don't do this |
| float_null float [None, 3.140000104904175] float32 [None, 3.140000104904175] | ||
| double_values double [0.0, 1.0, 0.3333333333333333] float64 [0.0, 1.0, 0.3333333333333333] | ||
| double_null double [None, 2.71] float64 [None, 2.71] | ||
| decimal_values decimal(3,2) [Decimal('5.35'), Decimal('1.23')] Decimal [Decimal('5.35'), Decimal('1.23')] |
There was a problem hiding this comment.
@Yicong-Huang is the change ['object', 'object'] -> Decimal expected?
I think it should be the dtypes of pdf here?
There was a problem hiding this comment.
It is expected! I find object being too less of information.object dtype is pandas' fallback for anything that doesn't have native array support — it stores an array of Python object pointers. And if we use this in golden file test, we won't be able to notice the actual type has changed. For example,
# Correct UDF result: Decimal values
result1 = DataFrame({'value': [Decimal('1.23')], 'name': ['a']})
# Buggy UDF result: str instead of Decimal
result2 = DataFrame({'value': ['1.23'], 'name': ['a']})
# pandas dtype for both: [object, object] — identical!
# Python element types: [Decimal, str] vs [str, str] — different!
So in this PR I went ahead updated the repr_type to print out the actual python object type when detected a general object pandas dtype. And you can see it prints out Decimal for this case:
There was a problem hiding this comment.
I think it is a topic other than Extract, if we want to do this we should do it in separate PRs.
The test class should be able to override the default string expr.
There was a problem hiding this comment.
got it. reverted the change. but I think we should consider change this in future PRs.
| """ | ||
| if elem is None: | ||
| return "NoneType" | ||
| elif have_pandas and isinstance(elem, pd.DataFrame): |
There was a problem hiding this comment.
I think we don't need to check have_pandas? it should always be true
There was a problem hiding this comment.
I was hoping to reuse this for future golden file tests, for example pyarrow related tests. So I made the method general.
There was a problem hiding this comment.
I see, but the files are still writen by pandas?
There was a problem hiding this comment.
I see your point. you are right. I removed pandas check
| parallel=True, | ||
| ) | ||
|
|
||
| def _run_golden_tests( |
There was a problem hiding this comment.
I feel it is kind of less flexible.
I think this mixin should only provide helper functions to generate string exprs, and it is up to the test class to determine how the cases are tested, e.g. the error hanlding.
There was a problem hiding this comment.
yes. here the framework is pretty general. the test classes who uses this mixin can define properties
test_cases, column_names, and also define the method run_single_test to handle how to execute or check error.
The framework logic here is simple:
- run the provided
test_casesby calling the providedrun_single_test. - execute in parallel, if permitted.
- collect results, serialize to string
- compare or generate golden file.
There was a problem hiding this comment.
I am wondering how to modify the golden file according to different envs? and how to only check a subset of the golden file? and how to forbidden the golden file regeneration in unexpected env?
I think it should be just a helper class providing some helper functions:
1, save/load goldenfile based on pandas;
2, default string expr of variant instances, subclass should be able to override this;
There was a problem hiding this comment.
ok. reverted so that each test decides how to run tests.
|
thanks, merged to master |
What changes were proposed in this pull request?
Extract common golden file testing utilities into
GoldenFileTestMixininpython/pyspark/testing/goldenutils.py, and simplify the four type coercion test files to use this mixin.Why are the changes needed?
Reduce duplicated code across four test files and provide a reusable framework for future golden file tests.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Regenerated all golden files with
SPARK_GENERATE_GOLDEN_FILES=1and verified tests pass.Was this patch authored or co-authored using generative AI tooling?
No.