-
-
Notifications
You must be signed in to change notification settings - Fork 333
Add ArcSinhTransformer for inverse hyperbolic sine transformation #879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ankitlade12
wants to merge
22
commits into
feature-engine:main
Choose a base branch
from
ankitlade12:add-arcsinh-transformer
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
22 commits
Select commit
Hold shift + click to select a range
d775c62
Add ArcSinhTransformer for inverse hyperbolic sine transformation
ankitlade12 841e30c
Enhance ArcSinhTransformer docs: add to index/README, improve user gu…
ankitlade12 6022a24
Add ArcSinhTransformer to standard estimator checks
ankitlade12 201b032
Fix duplicate LogCpTransformer in estimator checks
ankitlade12 0cb0023
Docs: Remove leading 'The' from ArcSinhTransformer references
ankitlade12 bf44266
Docs: Remove 'the' before LogTransformer reference
ankitlade12 b9d92d5
Docs: Rename Example section to Python demo
ankitlade12 246cd2c
Docs: Standardize section underlines to '---'
ankitlade12 0c44f40
Docs: Add dataframe output to ArcSinhTransformer python demo
ankitlade12 cc95d3a
Docs: Update transformer setup text in ArcSinhTransformer demo
ankitlade12 aeeddad
Docs: Add commas around 'however' for grammar
ankitlade12 e3e6440
Docs: Add transformed dataframe output to ArcSinhTransformer demo
ankitlade12 a8d880a
Docs: Clarify intro text for plotting code
ankitlade12 8fad3b8
Docs: Add histogram plot image to ArcSinhTransformer guide
ankitlade12 9fa7698
Docs: Replace np.allclose with dataframe output in inverse transform …
ankitlade12 6a8fc64
Docs: Remove API Reference from User Guide (exists in api_doc)
ankitlade12 bdcb271
Docstring: Clarify linear behavior of arcsinh for small x
ankitlade12 73b9ef1
Docstring: Remove redundant 'does not learn parameters' sentence from…
ankitlade12 f055b05
Tests: Add explicit value assertions for negative values in ArcSinh test
ankitlade12 3f17f04
Tests: Add string and boolean to invalid_scale parameterization
ankitlade12 3be7004
Docs: Add practical explanation for using loc and scale parameters
ankitlade12 f990fde
Docs: Add ArcSinhTransformer to api_doc index and update description
ankitlade12 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| ArcSinhTransformer | ||
| ================== | ||
|
|
||
| .. autoclass:: feature_engine.transformation.ArcSinhTransformer | ||
| :members: | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,195 @@ | ||
| .. _arcsinh_transformer: | ||
|
|
||
| .. currentmodule:: feature_engine.transformation | ||
|
|
||
| ArcSinhTransformer | ||
| ================== | ||
|
|
||
| :class:`ArcSinhTransformer()` applies the inverse hyperbolic sine transformation | ||
| (arcsinh) to numerical variables. Also known as the pseudo-logarithm, this | ||
| transformation is useful for data that contains both positive and negative values. | ||
|
|
||
| The transformation is: x → arcsinh((x - loc) / scale) | ||
|
|
||
| Comparison to LogTransformer and ArcsinTransformer | ||
| -------------------------------------------------- | ||
|
|
||
| - **LogTransformer**: `log(x)` requires `x > 0`. If your data contains zeros or negative values, you cannot use the standard LogTransformer directly. You would need to shift the data (e.g. `LogCpTransformer`) or remove non-positive values. | ||
| - **ArcsinTransformer**: `arcsin(sqrt(x))` is typically used for proportions/ratios bounded between 0 and 1. It is not suitable for general unbounded numerical data. | ||
| - **ArcSinhTransformer**: `arcsinh(x)` works for **all real numbers** (positive, negative, and zero). It handles zero gracefully (arcsinh(0) = 0) and is symmetric around zero. | ||
|
|
||
| When to use ArcSinhTransformer: | ||
| - Your data contains zeros or negative values (e.g., profit/loss, debt, temperature). | ||
| - You want a log-like transformation to stabilize variance or compress extreme values. | ||
| - You don't want to add an arbitrary constant (shift) to make values positive. | ||
|
|
||
| Intuitive Explanation of Parameters | ||
| ----------------------------------- | ||
|
|
||
| The transformation includes optional `loc` (location) and `scale` parameters: | ||
|
|
||
| .. math:: | ||
| y = \text{arcsinh}\left(\frac{x - \text{loc}}{\text{scale}}\right) | ||
|
|
||
| - **Why scale?** | ||
| The `arcsinh(x)` function is linear near zero (for small x) and logarithmic for large x. | ||
| The "linear region" is roughly between -1 and 1. | ||
| By adjusting the `scale`, you control which part of your data falls into this linear region versus the logarithmic region. | ||
| - If `scale` is large, more of your data falls in the linear region (behavior close to original data). | ||
| - If `scale` is small, more of your data falls in the logarithmic region (stronger compression of values). | ||
| Common practice is to set `scale` to 1 or usage the standard deviation of the variable. | ||
|
|
||
| - **Why loc?** | ||
| The `loc` parameter centers the data. The transition from negative logarithmic behavior to positive logarithmic behavior happens around `x = loc`. | ||
| Common practice is to set `loc` to 0 or usage the mean of the variable. | ||
|
|
||
| References | ||
| ---------- | ||
|
|
||
| For more details on the inverse hyperbolic sine transformation: | ||
|
|
||
| 1. `How should I transform non-negative data including zeros? <https://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros>`_ (StackExchange) | ||
| 2. `Interpreting Treatment Effects: Inverse Hyperbolic Sine Outcome Variable <https://blogs.worldbank.org/en/impactevaluations/interpreting-treatment-effects-inverse-hyperbolic-sine-outcome-variable-and>`_ (World Bank Blog) | ||
| 3. `Burbidge, J. B., Magee, L., & Robb, A. L. (1988). Alternative transformations to handle extreme values of the dependent variable. Journal of the American Statistical Association. <https://www.jstor.org/stable/2288929>`_ | ||
|
|
||
| Python demo | ||
| ----------- | ||
|
|
||
| Unlike :class:`LogTransformer()`, :class:`ArcSinhTransformer()` can handle | ||
| zero and negative values without requiring any preprocessing. | ||
|
|
||
| Let's create a dataframe with positive and negative values and apply the arcsinh | ||
| transformation: | ||
|
|
||
| .. code:: python | ||
|
|
||
| import numpy as np | ||
| import pandas as pd | ||
| import matplotlib.pyplot as plt | ||
| from sklearn.model_selection import train_test_split | ||
|
|
||
| from feature_engine.transformation import ArcSinhTransformer | ||
|
|
||
| # Create sample data with positive and negative values | ||
| np.random.seed(42) | ||
| X = pd.DataFrame({ | ||
ankitlade12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| 'profit': np.random.randn(1000) * 10000, # Values from -30000 to 30000 | ||
| 'net_worth': np.random.randn(1000) * 50000, | ||
| }) | ||
|
|
||
| # Separate into train and test | ||
| X_train, X_test = train_test_split(X, test_size=0.3, random_state=0) | ||
|
|
||
| print(X.head()) | ||
|
|
||
| The dataframe contains positive and negative values: | ||
|
|
||
| .. code:: python | ||
|
|
||
| profit net_worth | ||
| 0 4967.141530 69967.771829 | ||
| 1 -1382.643012 46231.684146 | ||
| 2 6476.885381 2981.518496 | ||
| 3 15230.298564 -32346.838885 | ||
| 4 -2341.533747 34911.165681 | ||
|
|
||
| Now let's set up the ArcSinhTransformer and fit it to the training set: | ||
|
|
||
| .. code:: python | ||
|
|
||
| # Set up the arcsinh transformer | ||
| tf = ArcSinhTransformer(variables=['profit', 'net_worth']) | ||
|
|
||
| # Fit the transformer | ||
| tf.fit(X_train) | ||
|
|
||
| The transformer does not learn any parameters when applying the fit method. It does | ||
| check, however, that the variables are numerical. | ||
|
|
||
| We can now transform the variables: | ||
|
|
||
| .. code:: python | ||
|
|
||
| # Transform the data | ||
| train_t = tf.transform(X_train) | ||
| test_t = tf.transform(X_test) | ||
|
|
||
ankitlade12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| print(train_t.head()) | ||
|
|
||
| The dataframe with the transformed variables: | ||
|
|
||
| .. code:: python | ||
|
|
||
| profit net_worth | ||
| 105 8.997273 -11.552056 | ||
| 68 8.886371 -10.753000 | ||
| 479 10.016437 -10.686152 | ||
| 399 10.116836 -11.092693 | ||
| 434 10.310523 -9.723893 | ||
|
|
||
| The arcsinh transformation compresses extreme values while preserving the sign. We can inspect the distribution of the original and transformed variables with histograms: | ||
|
|
||
ankitlade12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| .. code:: python | ||
|
|
||
| # Compare original and transformed distributions | ||
| fig, axes = plt.subplots(1, 2, figsize=(12, 4)) | ||
|
|
||
| X_train['profit'].hist(ax=axes[0], bins=50) | ||
| axes[0].set_title('Original profit') | ||
|
|
||
| train_t['profit'].hist(ax=axes[1], bins=50) | ||
| axes[1].set_title('Transformed profit') | ||
|
|
||
| plt.tight_layout() | ||
ankitlade12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| .. image:: ../../images/arcsinh_profit_histogram.png | ||
|
|
||
| Using loc and scale parameters | ||
| ------------------------------ | ||
|
|
||
| :class:`ArcSinhTransformer()` supports location and scale parameters to | ||
| center and normalize data before transformation. | ||
|
|
||
| In practice, it is common to standardize the variable (zero mean, unit variance) | ||
| so that the center of the distribution falls in the linear region of the arcsinh | ||
| function, while the tails are compressed logarithmically. We can achieve this | ||
| by setting ``loc`` to the mean and ``scale`` to the standard deviation: | ||
|
|
||
| .. code:: python | ||
|
|
||
| # Center around mean and scale by std | ||
ankitlade12 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| tf = ArcSinhTransformer( | ||
| variables=['profit'], | ||
| loc=X_train['profit'].mean(), | ||
| scale=X_train['profit'].std() | ||
| ) | ||
|
|
||
| tf.fit(X_train) | ||
| train_t = tf.transform(X_train) | ||
|
|
||
| Inverse transformation | ||
| ---------------------- | ||
|
|
||
| :class:`ArcSinhTransformer()` supports inverse transformation to recover | ||
| the original values: | ||
|
|
||
| .. code:: python | ||
|
|
||
| # Transform and then inverse transform | ||
| train_t = tf.transform(X_train) | ||
| train_recovered = tf.inverse_transform(train_t) | ||
|
|
||
| print(train_recovered.head()) | ||
|
|
||
| The recovered data: | ||
|
|
||
| .. code:: python | ||
|
|
||
| profit net_worth | ||
| 105 4040.508568 -51995.296356 | ||
| 68 3616.360250 -23385.060066 | ||
| 479 11195.749114 -21872.915016 | ||
| 399 12378.163120 -32844.713949 | ||
| 434 15023.570521 -8356.085689 | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.