-
-
Notifications
You must be signed in to change notification settings - Fork 333
Add ArcSinhTransformer, TextFeatures, and GeoDistanceTransformer #875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add ArcSinhTransformer, TextFeatures, and GeoDistanceTransformer #875
Conversation
New transformers: - ArcSinhTransformer: Pseudo-log transformation for positive and negative values (addresses feature-engine#707) - TextFeatures: Extract 19 text features from string columns (new text module) - GeoDistanceTransformer: Calculate geographic distances using Haversine/Euclidean/Manhattan (addresses feature-engine#688) Includes: - Full test coverage (43 tests) - RST documentation for all transformers - Code style compliance (flake8, black, isort)
solegalli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ankitlade12
Thank you so much for these fantastic additions! They were much needed and long overdue :)
I only reviewed the files relevant for the GeoDistance transformer. The transformer is almost ready to be merged. I have a couple of nitpicks, a question regarding whether we should expose the optionality of a sanity check, and some more nitpicks on the documentation.
Would you be able to take a look at that?
In the ideal world, it would be great to have 1 PR per transformer. Any chance you could distribute the files relevant to each new transformer in different PRs?
If not, we take it from here. Thanks a lot!
In the meantime, I'll try and fix the tests in a separate PR. They are probably caused by the new version of sklearn :_(
- Add validate_ranges parameter to control coordinate validation - Update docstrings to match library patterns - Refactor tests to standalone functions (remove class wrapper) - Add tests for validate_ranges parameter - Rewrite user guide with proper headings, explanatory text, and outputs - Create API documentation file - Add GeoDistanceTransformer to API doc index
|
|
Hi @solegalli, Thank you for the detailed review! I've addressed all the feedback for the GeoDistanceTransformer: geo_features.py: Added Refactored tests from class-based to standalone functions Fixed heading levels (- instead of ~ for main sections) Created Regarding separate PRs per transformer - I'll keep them in this single PR for now if that's okay. Let me know if there's anything else that needs adjustment! |
| drop_original=True | ||
| )), | ||
| ('scaler', StandardScaler()), | ||
| ('regressor', RandomForestRegressor(n_estimators=10, random_state=42)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random forests do not need feature scaling. Could we change for linear regression?
| assert 2400 < X_tr["geo_distance"].iloc[0] < 2500 | ||
|
|
||
|
|
||
| def test_same_location_zero_distance(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should add a pytest line to test this for all distances and all metrics. Something like these 2 lines here:
| @pytest.mark.parametrize("_varnames", [None, ["var1", "var2"]]) |
| ) | ||
| X_tr = transformer.fit_transform(X) | ||
|
|
||
| assert X_tr["geo_distance"].iloc[0] > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the expected value here? A note for the future contributor would help as a reminder. In addition, testing that the calculation is bigger than 0 seems like any value that we enter in the df would pass this test. Can we make it more specific?
| ) | ||
| X_tr = transformer.fit_transform(X) | ||
|
|
||
| assert X_tr["geo_distance"].iloc[0] > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as for the previous test. Maybe I am not understanding the test. Are we testing that the functions return the expected values? I think so.
|
|
||
| assert len(X_tr["geo_distance"]) == 3 | ||
| # All distances should be positive | ||
| assert all(X_tr["geo_distance"] > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit vague. We normally add the expected df and test that the returned df matches the expected one. I would be also happy to test the values, but it would be great to be more specific regarding what the output should be, one way or another.
| assert all(X_tr["geo_distance"] > 0) | ||
|
|
||
|
|
||
| def test_invalid_method_raises_error(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we normally add a pytest command here to test more than one value, for example "invalid", but then also a boolean or a number to make sure that those fail as well.
| ) | ||
|
|
||
|
|
||
| def test_invalid_output_unit_raises_error(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as for previous test.
| transformer.fit(X) | ||
|
|
||
|
|
||
| def test_invalid_latitude_range_raises_error(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a pytest command to test a value greater than 90 and smaller than -90.
| transformer.fit(X) | ||
|
|
||
|
|
||
| def test_invalid_longitude_range_raises_error(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as for previous test
| assert "geo_distance" in X_tr.columns | ||
|
|
||
|
|
||
| def test_validate_ranges_parameter_validation(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we'd add a pytest command to test more than one value, for Example "True" and a number or something else. 1 and 0 will pass though, because they are interpreted as booleans
| ) | ||
| transformer.fit(X) | ||
|
|
||
| assert hasattr(transformer, "variables_") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
more than the attribute is there, I would test that the attribute contains the expected values.
|
|
||
| def test_get_feature_names_out(): | ||
| """Test get_feature_names_out returns correct names.""" | ||
| X = pd.DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are using this df twice, maybe instead of hardcoding we could add it as a fixture at the top of the file
|
|
||
| feature_names = transformer.get_feature_names_out() | ||
| assert "geo_distance" in feature_names | ||
| assert len(feature_names) == 6 # 5 original + 1 new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be better to test that the feature names are the expected ones.
|
Hey @ankitlade12 Thank you so much for the quick turnaround! I added some comments with a few nitpick requests and more specificity for the tests. In addition, we should add the Make sure to set Could you please also remove the other transformers and make different PRs for them? There is to much going on here and it makes the review process harder. After this, we would be ready to merge the GeoDistanceTransformer. Thank you! |
- Replace RandomForestRegressor with LinearRegression in docs (forests don't need scaling) - Add pytest parametrize for test_same_location_zero_distance (all methods and units) - Make test_euclidean_method and test_manhattan_method more specific with expected values - Make test_multiple_rows use pd.testing.assert_frame_equal with expected DataFrame - Parametrize all validation tests (invalid_method, invalid_output_unit, lat/lon range, validate_ranges) - Add fixtures for commonly used DataFrames - Test exact feature names in get_feature_names_out - Add test_geo_distance_transformer_in_pipeline to test_check_estimator_creation.py
|
|
Hi @solegalli , Thank you for the detailed review! I've addressed all your feedback: Changes Made
New Branches and PRs
I have created PRs for all three branches. Please ignore the old branch add-arcsinh-text-geodistance-transformers - it contains all three transformers mixed together. The new add-geodistance-transformer branch contains only the GeoDistanceTransformer with all your requested changes. Thank you! |
Summary
This PR adds three new transformers to Feature-engine, expanding the library's capabilities for numerical transformation, text feature extraction, and geospatial data processing.
New Transformers
1. ArcSinhTransformer (transformation module)
Addresses Issue #707
Applies the inverse hyperbolic sine (arcsinh) transformation to numerical variables. Also known as the pseudo-logarithm, this transformation is useful for data that contains both positive and negative values.
Key Features:
Use Cases: Financial data (profit/loss), sensor readings, any metric that can be positive or negative
2. TextFeatures (NEW text module)
Extracts 19 numerical features from text/string columns. Unlike scikit-learn's CountVectorizer or TfidfVectorizer which create sparse matrices, TextFeatures extracts metadata features that remain in DataFrame format.
Available Features:
char_count,word_count, sentence_count,avg_word_lengthdigit_count,uppercase_count,lowercase_count,special_char_countwhitespace_count,whitespace_ratio,digit_ratio,uppercase_ratiohas_digits,has_uppercase,is_emptystarts_with_uppercase,ends_with_punctuationunique_word_count,unique_word_ratioUse Cases: NLP preprocessing, sentiment analysis feature engineering, text classification
3. GeoDistanceTransformer (creation module)
Addresses Issue #688
Calculates the distance between two geographical coordinate pairs (latitude/longitude) and adds the result as a new feature.
Distance Methods:
Output Units: km, miles, meters, feet
Validation: Checks latitude range (-90 to 90) and longitude range (-180 to 180)
Use Cases: Real estate pricing, delivery optimization, ride-sharing, location-based ML
What's Included
Files Added/Modified
New Files:
Modified Files:
CI Status Note
The test failures in
test_feature_engine_py312andpy313are pre-existing issues in the main branch, not caused by this PR.I verified this by:
mainbranch (without my changes)pytest tests/test_base_transformers/test_get_feature_names_out_mixin.pyAll 27 tests for the new transformers pass:
The
stylechecks,test_type, andtest_docschecks should all pass.