Skip to content

Comments

Implement Random Forest Classifier and Regressor from scratch (fixes #13537)#13610

Closed
Tejasrahane wants to merge 7 commits intoTheAlgorithms:masterfrom
Tejasrahane:random-forest-implementation-13537
Closed

Implement Random Forest Classifier and Regressor from scratch (fixes #13537)#13610
Tejasrahane wants to merge 7 commits intoTheAlgorithms:masterfrom
Tejasrahane:random-forest-implementation-13537

Conversation

@Tejasrahane
Copy link
Contributor

Describe your change:

  • Add an algorithm?
  • Fix a bug or typo in an existing algorithm?
  • Add or change doctests? -- Note: Please avoid changing both code and tests in a single pull request.
  • Documentation change?

Description:

This PR implements Random Forest Classifier and Regressor from scratch as requested in issue #13537.

Classifier Implementation (random_forest_classifier.py):

  • Decision tree classifier using entropy-based information gain
  • Bootstrap sampling (bagging) for ensemble diversity
  • Random feature selection at each split
  • Majority voting for final predictions
  • Comprehensive doctests and examples

Regressor Implementation (random_forest_regressor.py):

  • Decision tree regressor using MSE/variance-based splitting
  • Bootstrap sampling with feature subsampling
  • Averaging of predictions from all trees
  • Comprehensive doctests and examples
  • Includes demo with sklearn datasets and metrics

Both implementations are built from scratch without using sklearn's ensemble models, only using numpy for numerical operations and sklearn for demo/testing purposes.

Checklist:

  • I have read CONTRIBUTING.md.
  • This pull request is all my own work -- I have not plagiarized.
  • I know that pull requests will not be merged if they fail the automated tests.
  • This PR only changes one algorithm file. To ease review, please open separate PRs for separate algorithms.
  • All new Python files are placed inside an existing directory.
  • All filenames are in all lowercase characters with no spaces or dashes.
  • All functions and variable names follow Python naming conventions.
  • All function parameters and return values are annotated with Python type hints.
  • All functions have doctests that pass the automated testing.
  • All new algorithms include at least one URL that points to Wikipedia or another similar explanation.
  • If this pull request resolves one or more open issues then the description above includes the issue number(s) with a closing keyword: "Fixes #ISSUE-NUMBER".

Fixes #13537

References:

Implements Random Forest Classifier with:
- Decision Tree base learners from scratch
- Bootstrap sampling (bagging)
- Random feature selection at splits
- Majority voting aggregation
- Clear docstrings and example usage

Part of implementation for issue TheAlgorithms#13537
- Implemented DecisionTreeRegressor with MSE-based splitting
- Implemented RandomForestRegressor with bootstrap aggregating
- Added comprehensive docstrings and examples
- Includes doctest and demo usage with sklearn metrics
- Completes issue TheAlgorithms#13537 alongside the classifier implementation
@algorithms-keeper algorithms-keeper bot added require descriptive names This PR needs descriptive function and/or variable names require tests Tests [doctest/unittest/pytest] are required require type hints https://docs.python.org/3/library/typing.html labels Oct 20, 2025
Copy link

@algorithms-keeper algorithms-keeper bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

  • @algorithms-keeper review to trigger the checks for only added pull request files
  • @algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

tree: The built tree structure
"""

def __init__(self, max_depth=10, min_samples_split=2, n_features=None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide return type hint for the function: __init__. If the function does not return a value, please provide the type hint as: def function() -> None:

Please provide type hint for the parameter: max_depth

Please provide type hint for the parameter: min_samples_split

Please provide type hint for the parameter: n_features

self.n_features = n_features
self.tree = None

def fit(self, X, y):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function fit

Please provide return type hint for the function: fit. If the function does not return a value, please provide the type hint as: def function() -> None:

Please provide descriptive name for the parameter: X

Please provide type hint for the parameter: X

Please provide descriptive name for the parameter: y

Please provide type hint for the parameter: y

self.n_features = X.shape[1] if not self.n_features else min(self.n_features, X.shape[1])
self.tree = self._grow_tree(X, y)

def _grow_tree(self, X, y, depth=0):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function _grow_tree

Please provide return type hint for the function: _grow_tree. If the function does not return a value, please provide the type hint as: def function() -> None:

Please provide descriptive name for the parameter: X

Please provide type hint for the parameter: X

Please provide descriptive name for the parameter: y

Please provide type hint for the parameter: y

Please provide type hint for the parameter: depth

'right': right
}

def _best_split(self, X, y, feat_idxs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function _best_split

Please provide return type hint for the function: _best_split. If the function does not return a value, please provide the type hint as: def function() -> None:

Please provide descriptive name for the parameter: X

Please provide type hint for the parameter: X

Please provide descriptive name for the parameter: y

Please provide type hint for the parameter: y

Please provide type hint for the parameter: feat_idxs

split_idx, split_thresh = None, None

for feat_idx in feat_idxs:
X_column = X[:, feat_idx]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_column

for _ in range(self.n_estimators):
# Bootstrap sampling
indices = np.random.choice(n_samples, n_samples, replace=True)
X_bootstrap = X[indices]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap

feature_indices = np.random.choice(
n_features, max_features, replace=False
)
X_bootstrap = X_bootstrap[:, feature_indices]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap


return self

def predict(self, X):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide return type hint for the function: predict. If the function does not return a value, please provide the type hint as: def function() -> None:

Please provide descriptive name for the parameter: X

Please provide type hint for the parameter: X

predictions = []

for tree, feature_indices in self.trees:
X_subset = X[:, feature_indices]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_subset

)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_train

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_test

@algorithms-keeper algorithms-keeper bot added the awaiting reviews This PR is ready to be reviewed label Oct 20, 2025
@algorithms-keeper algorithms-keeper bot added the tests are failing Do not merge until tests pass label Oct 20, 2025
@Tejasrahane Tejasrahane reopened this Oct 21, 2025
Copy link

@algorithms-keeper algorithms-keeper bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

  • @algorithms-keeper review to trigger the checks for only added pull request files
  • @algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

tree: The built tree structure
"""

def __init__(self, max_depth=10, min_samples_split=2, n_features=None):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide return type hint for the function: __init__. If the function does not return a value, please provide the type hint as: def function() -> None:

Please provide type hint for the parameter: max_depth

Please provide type hint for the parameter: min_samples_split

Please provide type hint for the parameter: n_features

self.n_features = n_features
self.tree = None

def fit(self, X, y):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide return type hint for the function: fit. If the function does not return a value, please provide the type hint as: def function() -> None:

As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function fit

Please provide type hint for the parameter: X

Please provide descriptive name for the parameter: X

Please provide type hint for the parameter: y

Please provide descriptive name for the parameter: y

)
self.tree = self._grow_tree(X, y)

def _grow_tree(self, X, y, depth=0):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide return type hint for the function: _grow_tree. If the function does not return a value, please provide the type hint as: def function() -> None:

As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function _grow_tree

Please provide type hint for the parameter: X

Please provide descriptive name for the parameter: X

Please provide type hint for the parameter: y

Please provide descriptive name for the parameter: y

Please provide type hint for the parameter: depth

"right": right,
}

def _best_split(self, X, y, feat_idxs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide return type hint for the function: _best_split. If the function does not return a value, please provide the type hint as: def function() -> None:

As there is no test file in this pull request nor any test function or class in the file machine_learning/random_forest_classifier.py, please provide doctest for the function _best_split

Please provide type hint for the parameter: X

Please provide descriptive name for the parameter: X

Please provide type hint for the parameter: y

Please provide descriptive name for the parameter: y

Please provide type hint for the parameter: feat_idxs

split_idx, split_thresh = None, None

for feat_idx in feat_idxs:
X_column = X[:, feat_idx]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_column

for _ in range(self.n_estimators):
# Bootstrap sampling
indices = np.random.choice(n_samples, n_samples, replace=True)
X_bootstrap = X[indices]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap


# Feature sampling
feature_indices = np.random.choice(n_features, max_features, replace=False)
X_bootstrap = X_bootstrap[:, feature_indices]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap


return self

def predict(self, X):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide return type hint for the function: predict. If the function does not return a value, please provide the type hint as: def function() -> None:

Please provide type hint for the parameter: X

Please provide descriptive name for the parameter: X

predictions = []

for tree, feature_indices in self.trees:
X_subset = X[:, feature_indices]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_subset

)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_train

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_test

…dule

- Annotate all function parameters and return types
- Rename variables to snake_case (x_column, x_bootstrap, x_subset, x_train/x_test)
- Add/expand doctests for public and core internal functions
- Address algorithms-keeper review comments
@algorithms-keeper algorithms-keeper bot removed the require tests Tests [doctest/unittest/pytest] are required label Oct 21, 2025
Copy link

@algorithms-keeper algorithms-keeper bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

  • @algorithms-keeper review to trigger the checks for only added pull request files
  • @algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

self.n_features: Optional[int] = n_features
self.tree: Optional[TreeNode] = None

def fit(self, x: np.ndarray, y: np.ndarray) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

)
self.tree = self._grow_tree(x, y, depth=0)

def _grow_tree(self, x: np.ndarray, y: np.ndarray, depth: int = 0) -> TreeNode:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

}

def _best_split(
self, x: np.ndarray, y: np.ndarray, feat_indices: Sequence[int]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

split_thresh = float(threshold)
return split_idx, split_thresh

def _information_gain(self, y: np.ndarray, x_column: np.ndarray, threshold: float) -> float:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: y

ig = parent_entropy - child_entropy
return float(ig)

def _entropy(self, y: np.ndarray) -> float:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: y

for _ in range(self.n_estimators):
# Bootstrap sampling
indices = np.random.choice(n_samples, n_samples, replace=True)
X_bootstrap = X[indices]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap


# Feature sampling
feature_indices = np.random.choice(n_features, max_features, replace=False)
X_bootstrap = X_bootstrap[:, feature_indices]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_bootstrap


return self

def predict(self, X):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide return type hint for the function: predict. If the function does not return a value, please provide the type hint as: def function() -> None:

Please provide type hint for the parameter: X

Please provide descriptive name for the parameter: X

predictions = []

for tree, feature_indices in self.trees:
X_subset = X[:, feature_indices]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_subset

)

# Split the data
X_train, X_test, y_train, y_test = train_test_split(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_train

Variable and function names should follow the snake_case naming convention. Please update the following name accordingly: X_test

pre-commit-ci bot and others added 2 commits October 21, 2025 02:43
- Annotate all parameters and return types across tree and forest
- Rename variables to snake_case (x_bootstrap, x_subset, etc.)
- Add doctests for predict, _best_split, _calculate_mse, and class examples
- Replace RNG usage with numpy Generator for determinism
@algorithms-keeper algorithms-keeper bot removed the require type hints https://docs.python.org/3/library/typing.html label Oct 21, 2025
Copy link

@algorithms-keeper algorithms-keeper bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Click here to look at the relevant links ⬇️

🔗 Relevant Links

Repository:

Python:

Automated review generated by algorithms-keeper. If there's any problem regarding this review, please open an issue about it.

algorithms-keeper commands and options

algorithms-keeper actions can be triggered by commenting on this PR:

  • @algorithms-keeper review to trigger the checks for only added pull request files
  • @algorithms-keeper review-all to trigger the checks for all the pull request files, including the modified files. As we cannot post review comments on lines not part of the diff, this command will post all the messages in one comment.

NOTE: Commands are in beta and so this feature is restricted only to a member or owner of the organization.

self.n_features: Optional[int] = n_features
self.tree: Optional[TreeNode] = None

def fit(self, x: np.ndarray, y: np.ndarray) -> None:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

)
self.tree = self._grow_tree(x, y, depth=0)

def _grow_tree(self, x: np.ndarray, y: np.ndarray, depth: int = 0) -> TreeNode:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

}

def _best_split(
self, x: np.ndarray, y: np.ndarray, feat_indices: Sequence[int]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

return split_idx, split_thresh

def _information_gain(
self, y: np.ndarray, x_column: np.ndarray, threshold: float

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: y

ig = parent_entropy - child_entropy
return float(ig)

def _entropy(self, y: np.ndarray) -> float:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: y

self.tree = self._grow_tree(x, y)
return self

def _grow_tree(self, x: np.ndarray, y: np.ndarray, depth: int = 0) -> TreeNodeReg:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

"right": right_subtree,
}

def _best_split(self, x: np.ndarray, y: np.ndarray, n_features: int) -> Optional[Dict[str, Any]]:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

mse_right = float(np.var(right_y)) if n_right > 0 else 0.0
return (n_left / n_samples) * mse_left + (n_right / n_samples) * mse_right

def predict(self, x: np.ndarray) -> np.ndarray:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

self.random_state: Optional[int] = random_state
self.trees: List[Tuple[DecisionTreeRegressor, np.ndarray]] = []

def fit(self, x: np.ndarray, y: np.ndarray) -> "RandomForestRegressor":

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

Please provide descriptive name for the parameter: y

self.trees.append((tree, feature_indices))
return self

def predict(self, x: np.ndarray) -> np.ndarray:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide descriptive name for the parameter: x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting reviews This PR is ready to be reviewed require descriptive names This PR needs descriptive function and/or variable names tests are failing Do not merge until tests pass

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement Random Forest Classifier and Regressor from Scratch

1 participant