Add TextFeatures transformer for text feature extraction #880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ankitlade12 wants to merge 6 commits into feature-engine:main from ankitlade12:add-text-features

ankitlade12 commented Jan 8, 2026

Add TextFeatures class to extract features from text columns
Support for features: char_count, word_count, digit_count, uppercase_count, etc.
Add comprehensive tests with pytest parametrize
Add user guide documentation


          Add TextFeatures transformer for text feature extraction

8a015d0

- Add TextFeatures class to extract features from text columns
- Support for features: char_count, word_count, digit_count, uppercase_count, etc.
- Add comprehensive tests with pytest parametrize
- Add user guide documentation

solegalli mentioned this pull request

Add ArcSinhTransformer, TextFeatures, and GeoDistanceTransformer #875

Open

solegalli reviewed

View reviewed changes

Collaborator

solegalli left a comment

Hi @ankitlade12

Thanks a lot!

This transformer, function-wise, I'd say it's ready. I made a few suggestions regarding how to optimize the feature creation functions. Let me know if they make sense.

Other than that, we need the various docs file and we'll be good to go :)

Thanks again!

feature_engine/text/text_features.py

+              TEXT_FEATURES = {
+                  "char_count": lambda x: x.str.len(),
+                  "word_count": lambda x: x.str.split().str.len(),
+                  "sentence_count": lambda x: x.str.count(r"[.!?]+"),

Collaborator

solegalli Jan 11, 2026

this one is counting punctuation as a proxy for sentence count? did I get it right?

Author

ankitlade12 Jan 12, 2026

Yes, that's correct! It counts sentence-ending punctuation (., !, ?) as a proxy for sentence count. This is a simple heuristic that works well for most common text. It won't handle edge cases like abbreviations (e.g., 'Dr.', 'U.S.') or text without punctuation, but it's a reasonable approximation for basic text analysis.

feature_engine/text/text_features.py Outdated Show resolved Hide resolved

feature_engine/text/text_features.py Outdated Show resolved Hide resolved

feature_engine/text/text_features.py Show resolved Hide resolved

feature_engine/text/text_features.py

+                  word counts, sentence counts, and various ratios and indicators.
+                  A list of variables can be passed as an argument. Alternatively, the transformer
+                  will automatically select and transform all variables of type object (string).

Collaborator

solegalli Jan 11, 2026

This makes sense for compatibility with our other classes, however, it might be a disaster for less experienced users that will pass the transformer to the entire dataset without second thoughts.

Not sure what's best here: we could enforce the user to pass one or more text columns by not defaulting this to a value. Or we could select variables that actually have text, by maybe choosing those variables with texts lenghts greater than a certain value (we'll need a separate function).

Thoughts?

Author

ankitlade12 Jan 12, 2026

Thanks for the feedback! I kept it consistent with other transformers in the library (like encoders) which also default to auto-selecting object columns.

I agree there's a risk for less experienced users. Would you prefer one of these approaches?

Keep current behavior for consistency
Make variables a required parameter
Emit a UserWarning when auto-selecting multiple columns

Let me know which you'd prefer and I'll implement it!

feature_engine/text/text_features.py Show resolved Hide resolved

feature_engine/text/text_features.py

+                      X = check_X(X)
+                      # Find or validate text variables
+                      if self.variables is None:

Collaborator

solegalli Jan 11, 2026

If we were to stick to selecting all object variables, we have a function for this already. Check how it is done in the encoders. I still think that extracting features from all categorical variables is a massive overkill. We need to think what's best.

Author

ankitlade12 Jan 12, 2026

I kept the current behavior (variables=None auto-selects all object columns) for consistency with other Feature-engine transformers like the encoders. However, I'm happy to make variables a required parameter if you prefer a more explicit API for this transformer. What do you think is best for the library?

tests/test_text/test_text_features.py Show resolved Hide resolved

tests/test_text/test_text_features.py Show resolved Hide resolved

tests/test_text/test_text_features.py Show resolved Hide resolved

ankitlade12 added 5 commits

January 12, 2026 05:59


          Address PR review comments for TextFeatures transformer

d2d655f

- Optimize avg_word_length using vectorized char_count / word_count
- Simplify unique_word_count using x.str.lower().str.split().apply(set).str.len()
- Rename unique_word_ratio to lexical_diversity (word_count / unique_word_count)
- Use _check_variables_input_value for variable validation
- Use find_categorical_variables for automatic variable selection
- Remove redundant docstring text
- Add comprehensive test assertions with expected values


          Fix style issues and update docs for lexical_diversity

7c30958


          Add comprehensive assertions for all 19 features in test_default_all_…

2df5447

…features


          Use _check_variables_input_value for variable validation

1af41da


          Fix mypy type error: update variables type hint to match utility func…

63f68b9

…tion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet