Implement Random Forest Classifier and Regressor from scratch (fixes #13537)#13610
Implement Random Forest Classifier and Regressor from scratch (fixes #13537)#13610Tejasrahane wants to merge 7 commits intoTheAlgorithms:masterfrom
Conversation
Implements Random Forest Classifier with: - Decision Tree base learners from scratch - Bootstrap sampling (bagging) - Random feature selection at splits - Majority voting aggregation - Clear docstrings and example usage Part of implementation for issue TheAlgorithms#13537
- Implemented DecisionTreeRegressor with MSE-based splitting - Implemented RandomForestRegressor with bootstrap aggregating - Added comprehensive docstrings and examples - Includes doctest and demo usage with sklearn metrics - Completes issue TheAlgorithms#13537 alongside the classifier implementation
There was a problem hiding this comment.
Click here to look at the relevant links ⬇️
🔗 Relevant Links
Repository:
Python:
Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.
algorithms-keeper commands and options
algorithms-keeper actions can be triggered by commenting on this PR:
@algorithms-keeper reviewto trigger the checks for only added pull request files@algorithms-keeper review-allto trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.
| tree: The built tree structure | ||
| """ | ||
|
|
||
| def __init__(self, max_depth=10, min_samples_split=2, n_features=None): |
There was a problem hiding this comment.
Please provide return type hint for the function: __init__. If the function does not return a value, please provide the type hint as: def function() -> None:
Please provide type hint for the parameter: max_depth
Please provide type hint for the parameter: min_samples_split
Please provide type hint for the parameter: n_features
| self.n_features = n_features | ||
| self.tree = None | ||
|
|
||
| def fit(self, X, y): |
There was a problem hiding this comment.
As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function fit
Please provide return type hint for the function: fit. If the function does not return a value, please provide the type hint as: def function() -> None:
Please provide descriptive name for the parameter: X
Please provide type hint for the parameter: X
Please provide descriptive name for the parameter: y
Please provide type hint for the parameter: y
| self.n_features = X.shape[1] if not self.n_features else min(self.n_features, X.shape[1]) | ||
| self.tree = self._grow_tree(X, y) | ||
|
|
||
| def _grow_tree(self, X, y, depth=0): |
There was a problem hiding this comment.
As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function _grow_tree
Please provide return type hint for the function: _grow_tree. If the function does not return a value, please provide the type hint as: def function() -> None:
Please provide descriptive name for the parameter: X
Please provide type hint for the parameter: X
Please provide descriptive name for the parameter: y
Please provide type hint for the parameter: y
Please provide type hint for the parameter: depth
| 'right': right | ||
| } | ||
|
|
||
| def _best_split(self, X, y, feat_idxs): |
There was a problem hiding this comment.
As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function _best_split
Please provide return type hint for the function: _best_split. If the function does not return a value, please provide the type hint as: def function() -> None:
Please provide descriptive name for the parameter: X
Please provide type hint for the parameter: X
Please provide descriptive name for the parameter: y
Please provide type hint for the parameter: y
Please provide type hint for the parameter: feat_idxs
| split_idx, split_thresh = None, None | ||
|
|
||
| for feat_idx in feat_idxs: | ||
| X_column = X[:, feat_idx] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_column
| for _ in range(self.n_estimators): | ||
| # Bootstrap sampling | ||
| indices = np.random.choice(n_samples, n_samples, replace=True) | ||
| X_bootstrap = X[indices] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap
| feature_indices = np.random.choice( | ||
| n_features, max_features, replace=False | ||
| ) | ||
| X_bootstrap = X_bootstrap[:, feature_indices] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap
|
|
||
| return self | ||
|
|
||
| def predict(self, X): |
There was a problem hiding this comment.
Please provide return type hint for the function: predict. If the function does not return a value, please provide the type hint as: def function() -> None:
Please provide descriptive name for the parameter: X
Please provide type hint for the parameter: X
| predictions = [] | ||
|
|
||
| for tree, feature_indices in self.trees: | ||
| X_subset = X[:, feature_indices] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_subset
| ) | ||
|
|
||
| # Split the data | ||
| X_train, X_test, y_train, y_test = train_test_split( |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_train
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_test
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Click here to look at the relevant links ⬇️
🔗 Relevant Links
Repository:
Python:
Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.
algorithms-keeper commands and options
algorithms-keeper actions can be triggered by commenting on this PR:
@algorithms-keeper reviewto trigger the checks for only added pull request files@algorithms-keeper review-allto trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.
| tree: The built tree structure | ||
| """ | ||
|
|
||
| def __init__(self, max_depth=10, min_samples_split=2, n_features=None): |
There was a problem hiding this comment.
Please provide return type hint for the function: __init__. If the function does not return a value, please provide the type hint as: def function() -> None:
Please provide type hint for the parameter: max_depth
Please provide type hint for the parameter: min_samples_split
Please provide type hint for the parameter: n_features
| self.n_features = n_features | ||
| self.tree = None | ||
|
|
||
| def fit(self, X, y): |
There was a problem hiding this comment.
Please provide return type hint for the function: fit. If the function does not return a value, please provide the type hint as: def function() -> None:
As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function fit
Please provide type hint for the parameter: X
Please provide descriptive name for the parameter: X
Please provide type hint for the parameter: y
Please provide descriptive name for the parameter: y
| ) | ||
| self.tree = self._grow_tree(X, y) | ||
|
|
||
| def _grow_tree(self, X, y, depth=0): |
There was a problem hiding this comment.
Please provide return type hint for the function: _grow_tree. If the function does not return a value, please provide the type hint as: def function() -> None:
As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function _grow_tree
Please provide type hint for the parameter: X
Please provide descriptive name for the parameter: X
Please provide type hint for the parameter: y
Please provide descriptive name for the parameter: y
Please provide type hint for the parameter: depth
| "right": right, | ||
| } | ||
|
|
||
| def _best_split(self, X, y, feat_idxs): |
There was a problem hiding this comment.
Please provide return type hint for the function: _best_split. If the function does not return a value, please provide the type hint as: def function() -> None:
As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function _best_split
Please provide type hint for the parameter: X
Please provide descriptive name for the parameter: X
Please provide type hint for the parameter: y
Please provide descriptive name for the parameter: y
Please provide type hint for the parameter: feat_idxs
| split_idx, split_thresh = None, None | ||
|
|
||
| for feat_idx in feat_idxs: | ||
| X_column = X[:, feat_idx] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_column
| for _ in range(self.n_estimators): | ||
| # Bootstrap sampling | ||
| indices = np.random.choice(n_samples, n_samples, replace=True) | ||
| X_bootstrap = X[indices] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap
|
|
||
| # Feature sampling | ||
| feature_indices = np.random.choice(n_features, max_features, replace=False) | ||
| X_bootstrap = X_bootstrap[:, feature_indices] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap
|
|
||
| return self | ||
|
|
||
| def predict(self, X): |
There was a problem hiding this comment.
Please provide return type hint for the function: predict. If the function does not return a value, please provide the type hint as: def function() -> None:
Please provide type hint for the parameter: X
Please provide descriptive name for the parameter: X
| predictions = [] | ||
|
|
||
| for tree, feature_indices in self.trees: | ||
| X_subset = X[:, feature_indices] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_subset
| ) | ||
|
|
||
| # Split the data | ||
| X_train, X_test, y_train, y_test = train_test_split( |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_train
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_test
…dule - Annotate all function parameters and return types - Rename variables to snake_case (x_column, x_bootstrap, x_subset, x_train/x_test) - Add/expand doctests for public and core internal functions - Address algorithms-keeper review comments
There was a problem hiding this comment.
Click here to look at the relevant links ⬇️
🔗 Relevant Links
Repository:
Python:
Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.
algorithms-keeper commands and options
algorithms-keeper actions can be triggered by commenting on this PR:
@algorithms-keeper reviewto trigger the checks for only added pull request files@algorithms-keeper review-allto trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.
| self.n_features: Optional[int] = n_features | ||
| self.tree: Optional[TreeNode] = None | ||
|
|
||
| def fit(self, x: np.ndarray, y: np.ndarray) -> None: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
Please provide descriptive name for the parameter: y
| ) | ||
| self.tree = self._grow_tree(x, y, depth=0) | ||
|
|
||
| def _grow_tree(self, x: np.ndarray, y: np.ndarray, depth: int = 0) -> TreeNode: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
Please provide descriptive name for the parameter: y
| } | ||
|
|
||
| def _best_split( | ||
| self, x: np.ndarray, y: np.ndarray, feat_indices: Sequence[int] |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
Please provide descriptive name for the parameter: y
| split_thresh = float(threshold) | ||
| return split_idx, split_thresh | ||
|
|
||
| def _information_gain(self, y: np.ndarray, x_column: np.ndarray, threshold: float) -> float: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: y
| ig = parent_entropy - child_entropy | ||
| return float(ig) | ||
|
|
||
| def _entropy(self, y: np.ndarray) -> float: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: y
| for _ in range(self.n_estimators): | ||
| # Bootstrap sampling | ||
| indices = np.random.choice(n_samples, n_samples, replace=True) | ||
| X_bootstrap = X[indices] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap
|
|
||
| # Feature sampling | ||
| feature_indices = np.random.choice(n_features, max_features, replace=False) | ||
| X_bootstrap = X_bootstrap[:, feature_indices] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap
|
|
||
| return self | ||
|
|
||
| def predict(self, X): |
There was a problem hiding this comment.
Please provide return type hint for the function: predict. If the function does not return a value, please provide the type hint as: def function() -> None:
Please provide type hint for the parameter: X
Please provide descriptive name for the parameter: X
| predictions = [] | ||
|
|
||
| for tree, feature_indices in self.trees: | ||
| X_subset = X[:, feature_indices] |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_subset
| ) | ||
|
|
||
| # Split the data | ||
| X_train, X_test, y_train, y_test = train_test_split( |
There was a problem hiding this comment.
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_train
Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_test
for more information, see https://pre-commit.ci
- Annotate all parameters and return types across tree and forest - Rename variables to snake_case (x_bootstrap, x_subset, etc.) - Add doctests for predict, _best_split, _calculate_mse, and class examples - Replace RNG usage with numpy Generator for determinism
There was a problem hiding this comment.
Click here to look at the relevant links ⬇️
🔗 Relevant Links
Repository:
Python:
Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.
algorithms-keeper commands and options
algorithms-keeper actions can be triggered by commenting on this PR:
@algorithms-keeper reviewto trigger the checks for only added pull request files@algorithms-keeper review-allto trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.
| self.n_features: Optional[int] = n_features | ||
| self.tree: Optional[TreeNode] = None | ||
|
|
||
| def fit(self, x: np.ndarray, y: np.ndarray) -> None: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
Please provide descriptive name for the parameter: y
| ) | ||
| self.tree = self._grow_tree(x, y, depth=0) | ||
|
|
||
| def _grow_tree(self, x: np.ndarray, y: np.ndarray, depth: int = 0) -> TreeNode: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
Please provide descriptive name for the parameter: y
| } | ||
|
|
||
| def _best_split( | ||
| self, x: np.ndarray, y: np.ndarray, feat_indices: Sequence[int] |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
Please provide descriptive name for the parameter: y
| return split_idx, split_thresh | ||
|
|
||
| def _information_gain( | ||
| self, y: np.ndarray, x_column: np.ndarray, threshold: float |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: y
| ig = parent_entropy - child_entropy | ||
| return float(ig) | ||
|
|
||
| def _entropy(self, y: np.ndarray) -> float: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: y
| self.tree = self._grow_tree(x, y) | ||
| return self | ||
|
|
||
| def _grow_tree(self, x: np.ndarray, y: np.ndarray, depth: int = 0) -> TreeNodeReg: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
Please provide descriptive name for the parameter: y
| "right": right_subtree, | ||
| } | ||
|
|
||
| def _best_split(self, x: np.ndarray, y: np.ndarray, n_features: int) -> Optional[Dict[str, Any]]: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
Please provide descriptive name for the parameter: y
| mse_right = float(np.var(right_y)) if n_right > 0 else 0.0 | ||
| return (n_left / n_samples) * mse_left + (n_right / n_samples) * mse_right | ||
|
|
||
| def predict(self, x: np.ndarray) -> np.ndarray: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
| self.random_state: Optional[int] = random_state | ||
| self.trees: List[Tuple[DecisionTreeRegressor, np.ndarray]] = [] | ||
|
|
||
| def fit(self, x: np.ndarray, y: np.ndarray) -> "RandomForestRegressor": |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
Please provide descriptive name for the parameter: y
| self.trees.append((tree, feature_indices)) | ||
| return self | ||
|
|
||
| def predict(self, x: np.ndarray) -> np.ndarray: |
There was a problem hiding this comment.
Please provide descriptive name for the parameter: x
for more information, see https://pre-commit.ci
Describe your change:
Description:
This PR implements Random Forest Classifier and Regressor from scratch as requested in issue #13537.
Classifier Implementation (
random_forest_classifier.py):Regressor Implementation (
random_forest_regressor.py):Both implementations are built from scratch without using sklearn's ensemble models, only using numpy for numerical operations and sklearn for demo/testing purposes.
Checklist:
Fixes #13537
References: