Add ArcSinhTransformer, TextFeatures, and GeoDistanceTransformer #875

ankitlade12 · 2025-12-26T19:01:26Z

Summary

This PR adds three new transformers to Feature-engine, expanding the library's capabilities for numerical transformation, text feature extraction, and geospatial data processing.

New Transformers

1. ArcSinhTransformer (transformation module)

Addresses Issue #707
Applies the inverse hyperbolic sine (arcsinh) transformation to numerical variables. Also known as the pseudo-logarithm, this transformation is useful for data that contains both positive and negative values.
Key Features:

Handles zero and negative values (unlike LogTransformer)
Supports loc and scale parameters for centering and normalization
Full inverse_transform support
For large |x|, behaves like ln(|x|) + ln(2), providing similar variance-stabilizing properties as log
Use Cases: Financial data (profit/loss), sensor readings, any metric that can be positive or negative

2. TextFeatures (NEW text module)

Extracts 19 numerical features from text/string columns. Unlike scikit-learn's CountVectorizer or TfidfVectorizer which create sparse matrices, TextFeatures extracts metadata features that remain in DataFrame format.
Available Features:

char_count, word_count, sentence_count, avg_word_length
digit_count, uppercase_count, lowercase_count, special_char_count
whitespace_count, whitespace_ratio, digit_ratio, uppercase_ratio
has_digits, has_uppercase, is_empty
starts_with_uppercase, ends_with_punctuation
unique_word_count, unique_word_ratio
Use Cases: NLP preprocessing, sentiment analysis feature engineering, text classification

3. GeoDistanceTransformer (creation module)

Addresses Issue #688
Calculates the distance between two geographical coordinate pairs (latitude/longitude) and adds the result as a new feature.
Distance Methods:

Haversine: Great-circle distance (default, most accurate)
Euclidean: Simple coordinate distance (fast)
Manhattan: Taxicab distance (grid approximation)
Output Units: km, miles, meters, feet
Validation: Checks latitude range (-90 to 90) and longitude range (-180 to 180)
Use Cases: Real estate pricing, delivery optimization, ride-sharing, location-based ML

What's Included

Full test coverage with 43 passing tests
RST documentation for all three transformers
Code style compliance verified with flake8, black, and isort
NumPy-style docstrings with usage examples
All transformers follow Feature-engine's existing patterns and conventions

Files Added/Modified

New Files:

feature_engine/transformation/arcsinh.py
feature_engine/text/init.py
feature_engine/text/text_features.py
feature_engine/creation/geo_features.py
tests/test_transformation/test_arcsinh.py
tests/test_text/init.py
tests/test_creation/test_geo_features.py
docs/user_guide/transformation/ArcSinhTransformer.rst
docs/user_guide/text/index.rst
docs/user_guide/text/TextFeatures.rst
docs/user_guide/creation/GeoDistanceTransformer.rst

Modified Files:

feature_engine/transformation/init.py
feature_engine/creation/init.py
docs/user_guide/transformation/index.rst
docs/user_guide/creation/index.rst
docs/user_guide/index.rst

CI Status Note

The test failures in test_feature_engine_py312 and py313 are pre-existing issues in the main branch, not caused by this PR.
I verified this by:

Checking out the upstream main branch (without my changes)
Running pytest tests/test_base_transformers/test_get_feature_names_out_mixin.py
The same 73 tests fail there as well
All 27 tests for the new transformers pass:

test_arcsinh.py: 11 passed
test_geo_features.py: 16 passed
test_text/init.py: Pass
The stylechecks, test_type, and test_docs checks should all pass.

New transformers: - ArcSinhTransformer: Pseudo-log transformation for positive and negative values (addresses feature-engine#707) - TextFeatures: Extract 19 text features from string columns (new text module) - GeoDistanceTransformer: Calculate geographic distances using Haversine/Euclidean/Manhattan (addresses feature-engine#688) Includes: - Full test coverage (43 tests) - RST documentation for all transformers - Code style compliance (flake8, black, isort)

solegalli

Hey @ankitlade12

Thank you so much for these fantastic additions! They were much needed and long overdue :)

I only reviewed the files relevant for the GeoDistance transformer. The transformer is almost ready to be merged. I have a couple of nitpicks, a question regarding whether we should expose the optionality of a sanity check, and some more nitpicks on the documentation.

Would you be able to take a look at that?

In the ideal world, it would be great to have 1 PR per transformer. Any chance you could distribute the files relevant to each new transformer in different PRs?

If not, we take it from here. Thanks a lot!

In the meantime, I'll try and fix the tests in a separate PR. They are probably caused by the new version of sklearn :_(

feature_engine/creation/geo_features.py

tests/test_creation/test_geo_features.py

docs/user_guide/creation/GeoDistanceTransformer.rst

- Add validate_ranges parameter to control coordinate validation - Update docstrings to match library patterns - Refactor tests to standalone functions (remove class wrapper) - Add tests for validate_ranges parameter - Rewrite user guide with proper headings, explanatory text, and outputs - Create API documentation file - Add GeoDistanceTransformer to API doc index

ankitlade12 · 2026-01-01T15:44:49Z

Hey @ankitlade12

Thank you so much for these fantastic additions! They were much needed and long overdue :)

I only reviewed the files relevant for the GeoDistance transformer. The transformer is almost ready to be merged. I have a couple of nitpicks, a question regarding whether we should expose the optionality of a sanity check, and some more nitpicks on the documentation.

Would you be able to take a look at that?

In the ideal world, it would be great to have 1 PR per transformer. Any chance you could distribute the files relevant to each new transformer in different PRs?

If not, we take it from here. Thanks a lot!

In the meantime, I'll try and fix the tests in a separate PR. They are probably caused by the new version of sklearn :_(

ankitlade12 · 2026-01-01T15:45:32Z

Hi @solegalli,

Thank you for the detailed review! I've addressed all the feedback for the GeoDistanceTransformer:

geo_features.py:

Added
validate_ranges
parameter (default True) to allow users to control coordinate validation
Updated docstrings to match library patterns (removed validation-related sentences)
test_geo_features.py:

Refactored tests from class-based to standalone functions
Added tests for the new
validate_ranges
parameter
User Guide (GeoDistanceTransformer.rst):

Fixed heading levels (- instead of ~ for main sections)
Changed "Example" to "Python Demo"
Updated class references per style guide
Added explanatory text before all code blocks
Split distance method examples with individual outputs
Renamed "Converting to miles" to "Using different output units"
Updated pipeline section with explanatory sentence, fit, and output
Removed API Reference section
API Documentation:

Created
docs/api_doc/creation/GeoDistanceTransformer.rst
Added to
docs/api_doc/creation/index.rst

Regarding separate PRs per transformer - I'll keep them in this single PR for now if that's okay. Let me know if there's anything else that needs adjustment!

solegalli · 2026-01-08T12:08:34Z

docs/user_guide/creation/GeoDistanceTransformer.rst

+            drop_original=True
+        )),
+        ('scaler', StandardScaler()),
+        ('regressor', RandomForestRegressor(n_estimators=10, random_state=42))


Random forests do not need feature scaling. Could we change for linear regression?

solegalli · 2026-01-08T12:21:19Z

tests/test_creation/test_geo_features.py

+    assert 2400 < X_tr["geo_distance"].iloc[0] < 2500
+
+
+def test_same_location_zero_distance():


we should add a pytest line to test this for all distances and all metrics. Something like these 2 lines here:

feature_engine/tests/test_creation/test_math_features.py

Line 308 in d87a781

@pytest.mark.parametrize("_varnames", [None, ["var1", "var2"]])

solegalli · 2026-01-08T12:23:54Z

tests/test_creation/test_geo_features.py

+    )
+    X_tr = transformer.fit_transform(X)
+
+    assert X_tr["geo_distance"].iloc[0] > 0


what's the expected value here? A note for the future contributor would help as a reminder. In addition, testing that the calculation is bigger than 0 seems like any value that we enter in the df would pass this test. Can we make it more specific?

solegalli · 2026-01-08T12:24:43Z

tests/test_creation/test_geo_features.py

+    )
+    X_tr = transformer.fit_transform(X)
+
+    assert X_tr["geo_distance"].iloc[0] > 0


same as for the previous test. Maybe I am not understanding the test. Are we testing that the functions return the expected values? I think so.

solegalli · 2026-01-08T12:27:47Z

tests/test_creation/test_geo_features.py

+
+    assert len(X_tr["geo_distance"]) == 3
+    # All distances should be positive
+    assert all(X_tr["geo_distance"] > 0)


This is a bit vague. We normally add the expected df and test that the returned df matches the expected one. I would be also happy to test the values, but it would be great to be more specific regarding what the output should be, one way or another.

solegalli · 2026-01-08T12:28:45Z

tests/test_creation/test_geo_features.py

+    assert all(X_tr["geo_distance"] > 0)
+
+
+def test_invalid_method_raises_error():


we normally add a pytest command here to test more than one value, for example "invalid", but then also a boolean or a number to make sure that those fail as well.

solegalli · 2026-01-08T12:28:59Z

tests/test_creation/test_geo_features.py

+        )
+
+
+def test_invalid_output_unit_raises_error():


same as for previous test.

solegalli · 2026-01-08T12:30:12Z

tests/test_creation/test_geo_features.py

+        transformer.fit(X)
+
+
+def test_invalid_latitude_range_raises_error():


I'd add a pytest command to test a value greater than 90 and smaller than -90.

solegalli · 2026-01-08T12:30:27Z

tests/test_creation/test_geo_features.py

+        transformer.fit(X)
+
+
+def test_invalid_longitude_range_raises_error():


same as for previous test

solegalli · 2026-01-08T12:31:44Z

tests/test_creation/test_geo_features.py

+    assert "geo_distance" in X_tr.columns
+
+
+def test_validate_ranges_parameter_validation():


here we'd add a pytest command to test more than one value, for Example "True" and a number or something else. 1 and 0 will pass though, because they are interpreted as booleans

solegalli · 2026-01-08T12:32:32Z

tests/test_creation/test_geo_features.py

+    )
+    transformer.fit(X)
+
+    assert hasattr(transformer, "variables_")


more than the attribute is there, I would test that the attribute contains the expected values.

solegalli · 2026-01-08T12:33:13Z

tests/test_creation/test_geo_features.py

+
+def test_get_feature_names_out():
+    """Test get_feature_names_out returns correct names."""
+    X = pd.DataFrame(


we are using this df twice, maybe instead of hardcoding we could add it as a fixture at the top of the file

solegalli · 2026-01-08T12:33:48Z

tests/test_creation/test_geo_features.py

+
+    feature_names = transformer.get_feature_names_out()
+    assert "geo_distance" in feature_names
+    assert len(feature_names) == 6  # 5 original + 1 new


it would be better to test that the feature names are the expected ones.

solegalli · 2026-01-08T12:49:07Z

Hey @ankitlade12

Thank you so much for the quick turnaround! I added some comments with a few nitpick requests and more specificity for the tests.

In addition, we should add the GeoDistanceTransformer to this file: https://github.com/feature-engine/feature_engine/blob/main/tests/test_creation/test_check_estimator_creation.py#L25

Make sure to set validate_ranges to false, because the tests run with some arbitrary hard-coded X from sklearn which values may be out of the range.

Could you please also remove the other transformers and make different PRs for them? There is to much going on here and it makes the review process harder.

After this, we would be ready to merge the GeoDistanceTransformer.

Thank you!

- Replace RandomForestRegressor with LinearRegression in docs (forests don't need scaling) - Add pytest parametrize for test_same_location_zero_distance (all methods and units) - Make test_euclidean_method and test_manhattan_method more specific with expected values - Make test_multiple_rows use pd.testing.assert_frame_equal with expected DataFrame - Parametrize all validation tests (invalid_method, invalid_output_unit, lat/lon range, validate_ranges) - Add fixtures for commonly used DataFrames - Test exact feature names in get_feature_names_out - Add test_geo_distance_transformer_in_pipeline to test_check_estimator_creation.py

…lity

ankitlade12 · 2026-01-08T15:59:07Z

Hey @ankitlade12

Thank you so much for the quick turnaround! I added some comments with a few nitpick requests and more specificity for the tests.

In addition, we should add the GeoDistanceTransformer to this file: https://github.com/feature-engine/feature_engine/blob/main/tests/test_creation/test_check_estimator_creation.py#L25

Make sure to set validate_ranges to false, because the tests run with some arbitrary hard-coded X from sklearn which values may be out of the range.

Could you please also remove the other transformers and make different PRs for them? There is to much going on here and it makes the review process harder.

After this, we would be ready to merge the GeoDistanceTransformer.

Thank you!

ankitlade12 · 2026-01-08T16:00:42Z

Hi @solegalli ,

Thank you for the detailed review! I've addressed all your feedback:

Changes Made

Documentation: Replaced RandomForestRegressor with LinearRegression in the pipeline example
Tests: Added @pytest.mark.parametrize for all validation tests, specific expected values, DataFrame fixtures, and exact feature name assertions
Estimator check: Added test_geo_distance_transformer_in_pipeline to test_check_estimator_creation.py (note: GeoDistanceTransformer requires 4 named columns so it can't work with sklearn's generic check_estimator, but the pipeline test verifies it works correctly)
Separate PRs: Created individual branches and PRs for each transformer

New Branches and PRs

Branch	Transformer
add-geodistance-transformer	GeoDistanceTransformer only
add-arcsinh-transformer	ArcSinhTransformer
add-text-features	TextFeatures

I have created PRs for all three branches. Please ignore the old branch add-arcsinh-text-geodistance-transformers - it contains all three transformers mixed together.

The new add-geodistance-transformer branch contains only the GeoDistanceTransformer with all your requested changes.

Thank you!

solegalli · 2026-01-11T14:21:04Z

Now split into:

#881

#880

#879

ankitlade12 added 6 commits December 26, 2025 12:43

Fix flake8 W391: remove trailing blank lines

965415d

Fix Sphinx docstring: escape feature_names_in_ with backticks

a9480aa

Fix mypy type errors in geo_features.py

b26564e

Fix RST doc: remove pipe characters that cause substitution error

a06b350

Fix docstring: remove pipe characters from arcsinh.py

c9190e8

solegalli reviewed Dec 31, 2025

View reviewed changes

ankitlade12 added 2 commits January 1, 2026 09:36

Fix flake8 line length error (E501)

227808d

solegalli linked an issue Jan 8, 2026 that may be closed by this pull request

Geo distance transformer #688

Open

solegalli reviewed Jan 8, 2026

View reviewed changes

tests/test_creation/test_geo_features.py Outdated

)

def test_invalid_output_unit_raises_error():

Copy link

Collaborator

solegalli Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as for previous test.

solegalli reviewed Jan 8, 2026

View reviewed changes

ankitlade12 added 2 commits January 8, 2026 09:41

Clean up test file: remove section header comments for production qua…

7cfdc53

…lity

		assert 2400 < X_tr["geo_distance"].iloc[0] < 2500


		def test_same_location_zero_distance():

		assert all(X_tr["geo_distance"] > 0)


		def test_invalid_method_raises_error():

		transformer.fit(X)


		def test_invalid_latitude_range_raises_error():

		transformer.fit(X)


		def test_invalid_longitude_range_raises_error():

		assert "geo_distance" in X_tr.columns


		def test_validate_ranges_parameter_validation():

Uh oh!

Add ArcSinhTransformer, TextFeatures, and GeoDistanceTransformer #875

Are you sure you want to change the base?

Add ArcSinhTransformer, TextFeatures, and GeoDistanceTransformer #875

Uh oh!

Conversation

ankitlade12 commented Dec 26, 2025 • edited by solegalli Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New Transformers

1. ArcSinhTransformer (transformation module)

2. TextFeatures (NEW text module)

3. GeoDistanceTransformer (creation module)

What's Included

Files Added/Modified

CI Status Note

Uh oh!

solegalli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ankitlade12 commented Jan 1, 2026

Uh oh!

ankitlade12 commented Jan 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

solegalli commented Jan 8, 2026

Uh oh!

ankitlade12 commented Jan 8, 2026

Uh oh!

ankitlade12 commented Jan 8, 2026

Uh oh!

solegalli commented Jan 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ankitlade12 commented Dec 26, 2025 •

edited by solegalli

Loading