Skip to content

Conversation

@ankitlade12
Copy link

@ankitlade12 ankitlade12 commented Dec 26, 2025

Summary

This PR adds three new transformers to Feature-engine, expanding the library's capabilities for numerical transformation, text feature extraction, and geospatial data processing.


New Transformers

1. ArcSinhTransformer (transformation module)

Addresses Issue #707
Applies the inverse hyperbolic sine (arcsinh) transformation to numerical variables. Also known as the pseudo-logarithm, this transformation is useful for data that contains both positive and negative values.
Key Features:

  • Handles zero and negative values (unlike LogTransformer)
  • Supports loc and scale parameters for centering and normalization
  • Full inverse_transform support
  • For large |x|, behaves like ln(|x|) + ln(2), providing similar variance-stabilizing properties as log
    Use Cases: Financial data (profit/loss), sensor readings, any metric that can be positive or negative

2. TextFeatures (NEW text module)

Extracts 19 numerical features from text/string columns. Unlike scikit-learn's CountVectorizer or TfidfVectorizer which create sparse matrices, TextFeatures extracts metadata features that remain in DataFrame format.
Available Features:

  • char_count, word_count, sentence_count, avg_word_length
  • digit_count, uppercase_count, lowercase_count, special_char_count
  • whitespace_count, whitespace_ratio, digit_ratio, uppercase_ratio
  • has_digits, has_uppercase, is_empty
  • starts_with_uppercase, ends_with_punctuation
  • unique_word_count, unique_word_ratio
    Use Cases: NLP preprocessing, sentiment analysis feature engineering, text classification

3. GeoDistanceTransformer (creation module)

Addresses Issue #688
Calculates the distance between two geographical coordinate pairs (latitude/longitude) and adds the result as a new feature.
Distance Methods:

  • Haversine: Great-circle distance (default, most accurate)
  • Euclidean: Simple coordinate distance (fast)
  • Manhattan: Taxicab distance (grid approximation)
    Output Units: km, miles, meters, feet
    Validation: Checks latitude range (-90 to 90) and longitude range (-180 to 180)
    Use Cases: Real estate pricing, delivery optimization, ride-sharing, location-based ML

What's Included

  • Full test coverage with 43 passing tests
  • RST documentation for all three transformers
  • Code style compliance verified with flake8, black, and isort
  • NumPy-style docstrings with usage examples
  • All transformers follow Feature-engine's existing patterns and conventions

Files Added/Modified

New Files:

  • feature_engine/transformation/arcsinh.py
  • feature_engine/text/init.py
  • feature_engine/text/text_features.py
  • feature_engine/creation/geo_features.py
  • tests/test_transformation/test_arcsinh.py
  • tests/test_text/init.py
  • tests/test_creation/test_geo_features.py
  • docs/user_guide/transformation/ArcSinhTransformer.rst
  • docs/user_guide/text/index.rst
  • docs/user_guide/text/TextFeatures.rst
  • docs/user_guide/creation/GeoDistanceTransformer.rst

Modified Files:

  • feature_engine/transformation/init.py
  • feature_engine/creation/init.py
  • docs/user_guide/transformation/index.rst
  • docs/user_guide/creation/index.rst
  • docs/user_guide/index.rst

CI Status Note

The test failures in test_feature_engine_py312 and py313 are pre-existing issues in the main branch, not caused by this PR.
I verified this by:

  1. Checking out the upstream main branch (without my changes)
  2. Running pytest tests/test_base_transformers/test_get_feature_names_out_mixin.py
  3. The same 73 tests fail there as well
    All 27 tests for the new transformers pass:
  • test_arcsinh.py: 11 passed
  • test_geo_features.py: 16 passed
  • test_text/init.py: Pass
    The stylechecks, test_type, and test_docs checks should all pass.

New transformers:
- ArcSinhTransformer: Pseudo-log transformation for positive and negative values (addresses feature-engine#707)
- TextFeatures: Extract 19 text features from string columns (new text module)
- GeoDistanceTransformer: Calculate geographic distances using Haversine/Euclidean/Manhattan (addresses feature-engine#688)

Includes:
- Full test coverage (43 tests)
- RST documentation for all transformers
- Code style compliance (flake8, black, isort)
Copy link
Collaborator

@solegalli solegalli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @ankitlade12

Thank you so much for these fantastic additions! They were much needed and long overdue :)

I only reviewed the files relevant for the GeoDistance transformer. The transformer is almost ready to be merged. I have a couple of nitpicks, a question regarding whether we should expose the optionality of a sanity check, and some more nitpicks on the documentation.

Would you be able to take a look at that?

In the ideal world, it would be great to have 1 PR per transformer. Any chance you could distribute the files relevant to each new transformer in different PRs?

If not, we take it from here. Thanks a lot!

In the meantime, I'll try and fix the tests in a separate PR. They are probably caused by the new version of sklearn :_(

- Add validate_ranges parameter to control coordinate validation
- Update docstrings to match library patterns
- Refactor tests to standalone functions (remove class wrapper)
- Add tests for validate_ranges parameter
- Rewrite user guide with proper headings, explanatory text, and outputs
- Create API documentation file
- Add GeoDistanceTransformer to API doc index
@ankitlade12
Copy link
Author

Hey @ankitlade12

Thank you so much for these fantastic additions! They were much needed and long overdue :)

I only reviewed the files relevant for the GeoDistance transformer. The transformer is almost ready to be merged. I have a couple of nitpicks, a question regarding whether we should expose the optionality of a sanity check, and some more nitpicks on the documentation.

Would you be able to take a look at that?

In the ideal world, it would be great to have 1 PR per transformer. Any chance you could distribute the files relevant to each new transformer in different PRs?

If not, we take it from here. Thanks a lot!

In the meantime, I'll try and fix the tests in a separate PR. They are probably caused by the new version of sklearn :_(

@ankitlade12
Copy link
Author

Hi @solegalli,

Thank you for the detailed review! I've addressed all the feedback for the GeoDistanceTransformer:

geo_features.py:

Added
validate_ranges
parameter (default True) to allow users to control coordinate validation
Updated docstrings to match library patterns (removed validation-related sentences)
test_geo_features.py:

Refactored tests from class-based to standalone functions
Added tests for the new
validate_ranges
parameter
User Guide (GeoDistanceTransformer.rst):

Fixed heading levels (- instead of ~ for main sections)
Changed "Example" to "Python Demo"
Updated class references per style guide
Added explanatory text before all code blocks
Split distance method examples with individual outputs
Renamed "Converting to miles" to "Using different output units"
Updated pipeline section with explanatory sentence, fit, and output
Removed API Reference section
API Documentation:

Created
docs/api_doc/creation/GeoDistanceTransformer.rst
Added to
docs/api_doc/creation/index.rst

Regarding separate PRs per transformer - I'll keep them in this single PR for now if that's okay. Let me know if there's anything else that needs adjustment!

@solegalli solegalli linked an issue Jan 8, 2026 that may be closed by this pull request
drop_original=True
)),
('scaler', StandardScaler()),
('regressor', RandomForestRegressor(n_estimators=10, random_state=42))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random forests do not need feature scaling. Could we change for linear regression?

assert 2400 < X_tr["geo_distance"].iloc[0] < 2500


def test_same_location_zero_distance():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add a pytest line to test this for all distances and all metrics. Something like these 2 lines here:

@pytest.mark.parametrize("_varnames", [None, ["var1", "var2"]])

)
X_tr = transformer.fit_transform(X)

assert X_tr["geo_distance"].iloc[0] > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the expected value here? A note for the future contributor would help as a reminder. In addition, testing that the calculation is bigger than 0 seems like any value that we enter in the df would pass this test. Can we make it more specific?

)
X_tr = transformer.fit_transform(X)

assert X_tr["geo_distance"].iloc[0] > 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as for the previous test. Maybe I am not understanding the test. Are we testing that the functions return the expected values? I think so.


assert len(X_tr["geo_distance"]) == 3
# All distances should be positive
assert all(X_tr["geo_distance"] > 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit vague. We normally add the expected df and test that the returned df matches the expected one. I would be also happy to test the values, but it would be great to be more specific regarding what the output should be, one way or another.

assert all(X_tr["geo_distance"] > 0)


def test_invalid_method_raises_error():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we normally add a pytest command here to test more than one value, for example "invalid", but then also a boolean or a number to make sure that those fail as well.

)


def test_invalid_output_unit_raises_error():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as for previous test.

transformer.fit(X)


def test_invalid_latitude_range_raises_error():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a pytest command to test a value greater than 90 and smaller than -90.

transformer.fit(X)


def test_invalid_longitude_range_raises_error():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as for previous test

assert "geo_distance" in X_tr.columns


def test_validate_ranges_parameter_validation():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we'd add a pytest command to test more than one value, for Example "True" and a number or something else. 1 and 0 will pass though, because they are interpreted as booleans

)
transformer.fit(X)

assert hasattr(transformer, "variables_")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more than the attribute is there, I would test that the attribute contains the expected values.


def test_get_feature_names_out():
"""Test get_feature_names_out returns correct names."""
X = pd.DataFrame(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are using this df twice, maybe instead of hardcoding we could add it as a fixture at the top of the file


feature_names = transformer.get_feature_names_out()
assert "geo_distance" in feature_names
assert len(feature_names) == 6 # 5 original + 1 new
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be better to test that the feature names are the expected ones.

@solegalli
Copy link
Collaborator

Hey @ankitlade12

Thank you so much for the quick turnaround! I added some comments with a few nitpick requests and more specificity for the tests.

In addition, we should add the GeoDistanceTransformer to this file: https://github.com/feature-engine/feature_engine/blob/main/tests/test_creation/test_check_estimator_creation.py#L25

Make sure to set validate_ranges to false, because the tests run with some arbitrary hard-coded X from sklearn which values may be out of the range.

Could you please also remove the other transformers and make different PRs for them? There is to much going on here and it makes the review process harder.

After this, we would be ready to merge the GeoDistanceTransformer.

Thank you!

- Replace RandomForestRegressor with LinearRegression in docs (forests don't need scaling)
- Add pytest parametrize for test_same_location_zero_distance (all methods and units)
- Make test_euclidean_method and test_manhattan_method more specific with expected values
- Make test_multiple_rows use pd.testing.assert_frame_equal with expected DataFrame
- Parametrize all validation tests (invalid_method, invalid_output_unit, lat/lon range, validate_ranges)
- Add fixtures for commonly used DataFrames
- Test exact feature names in get_feature_names_out
- Add test_geo_distance_transformer_in_pipeline to test_check_estimator_creation.py
@ankitlade12
Copy link
Author

Hey @ankitlade12

Thank you so much for the quick turnaround! I added some comments with a few nitpick requests and more specificity for the tests.

In addition, we should add the GeoDistanceTransformer to this file: https://github.com/feature-engine/feature_engine/blob/main/tests/test_creation/test_check_estimator_creation.py#L25

Make sure to set validate_ranges to false, because the tests run with some arbitrary hard-coded X from sklearn which values may be out of the range.

Could you please also remove the other transformers and make different PRs for them? There is to much going on here and it makes the review process harder.

After this, we would be ready to merge the GeoDistanceTransformer.

Thank you!

@ankitlade12
Copy link
Author

Hi @solegalli ,

Thank you for the detailed review! I've addressed all your feedback:

Changes Made

  1. Documentation: Replaced RandomForestRegressor with LinearRegression in the pipeline example
  2. Tests: Added @pytest.mark.parametrize for all validation tests, specific expected values, DataFrame fixtures, and exact feature name assertions
  3. Estimator check: Added test_geo_distance_transformer_in_pipeline to test_check_estimator_creation.py (note: GeoDistanceTransformer requires 4 named columns so it can't work with sklearn's generic check_estimator, but the pipeline test verifies it works correctly)
  4. Separate PRs: Created individual branches and PRs for each transformer

New Branches and PRs

Branch Transformer
add-geodistance-transformer GeoDistanceTransformer only
add-arcsinh-transformer ArcSinhTransformer
add-text-features TextFeatures

I have created PRs for all three branches. Please ignore the old branch add-arcsinh-text-geodistance-transformers - it contains all three transformers mixed together.

The new add-geodistance-transformer branch contains only the GeoDistanceTransformer with all your requested changes.

Thank you!

@solegalli
Copy link
Collaborator

Now split into:

#881

#880

#879

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Geo distance transformer

2 participants