-
-
Notifications
You must be signed in to change notification settings - Fork 333
Add TextFeatures transformer for text feature extraction #880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ankitlade12
commented
Jan 8, 2026
- Add TextFeatures class to extract features from text columns
- Support for features: char_count, word_count, digit_count, uppercase_count, etc.
- Add comprehensive tests with pytest parametrize
- Add user guide documentation
- Add TextFeatures class to extract features from text columns - Support for features: char_count, word_count, digit_count, uppercase_count, etc. - Add comprehensive tests with pytest parametrize - Add user guide documentation
solegalli
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ankitlade12
Thanks a lot!
This transformer, function-wise, I'd say it's ready. I made a few suggestions regarding how to optimize the feature creation functions. Let me know if they make sense.
Other than that, we need the various docs file and we'll be good to go :)
Thanks again!
| TEXT_FEATURES = { | ||
| "char_count": lambda x: x.str.len(), | ||
| "word_count": lambda x: x.str.split().str.len(), | ||
| "sentence_count": lambda x: x.str.count(r"[.!?]+"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one is counting punctuation as a proxy for sentence count? did I get it right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct! It counts sentence-ending punctuation (., !, ?) as a proxy for sentence count. This is a simple heuristic that works well for most common text. It won't handle edge cases like abbreviations (e.g., 'Dr.', 'U.S.') or text without punctuation, but it's a reasonable approximation for basic text analysis.
| word counts, sentence counts, and various ratios and indicators. | ||
|
|
||
| A list of variables can be passed as an argument. Alternatively, the transformer | ||
| will automatically select and transform all variables of type object (string). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense for compatibility with our other classes, however, it might be a disaster for less experienced users that will pass the transformer to the entire dataset without second thoughts.
Not sure what's best here: we could enforce the user to pass one or more text columns by not defaulting this to a value. Or we could select variables that actually have text, by maybe choosing those variables with texts lenghts greater than a certain value (we'll need a separate function).
Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback! I kept it consistent with other transformers in the library (like encoders) which also default to auto-selecting object columns.
I agree there's a risk for less experienced users. Would you prefer one of these approaches?
- Keep current behavior for consistency
- Make variables a required parameter
- Emit a
UserWarningwhen auto-selecting multiple columns
Let me know which you'd prefer and I'll implement it!
| X = check_X(X) | ||
|
|
||
| # Find or validate text variables | ||
| if self.variables is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we were to stick to selecting all object variables, we have a function for this already. Check how it is done in the encoders. I still think that extracting features from all categorical variables is a massive overkill. We need to think what's best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the current behavior (variables=None auto-selects all object columns) for consistency with other Feature-engine transformers like the encoders. However, I'm happy to make variables a required parameter if you prefer a more explicit API for this transformer. What do you think is best for the library?
- Optimize avg_word_length using vectorized char_count / word_count - Simplify unique_word_count using x.str.lower().str.split().apply(set).str.len() - Rename unique_word_ratio to lexical_diversity (word_count / unique_word_count) - Use _check_variables_input_value for variable validation - Use find_categorical_variables for automatic variable selection - Remove redundant docstring text - Add comprehensive test assertions with expected values