Random Forest in Machine learning

Rajan Lagah
2 min readAug 4, 2020

--

Main idea is create randomly uncorrelated decision trees and output the result that is the output of most of decision trees.

Step 1

Create the Bootstrapped data. You will choose n ( size of original data) number of observations from the original data randomly and you can repeat the observations.

Step 2

Creating decision trees. For decision tree we will choose any 1 feature as Root of tree and then select 2 other feature randomly ( without repetition ).

Now we have created 1 tree with same procedure we will create many other trees

Now we will select the data from our bootstrapped data and the left data then check the validity of trees. Prediction will be what majority of decision trees say. We will then keep record and will repeat step 1 again and then in Step 2 we will choose 3 ( 2 + 1) features randomly and create new set of decision trees. And validate the results we will repeat this process again and again and finally we will keep the set of trees with least error.

Simple Implementation

Grab the data for training and testing from here.

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
test_path = './../data/test_clean.csv'
train_path = './../data/train_clean.csv'
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
test_df = test_df.drop(['SalePrice'],axis=1)
valid_df = train_df[-200:]
train_df = train_df[:-200]
X_train = train_df.drop(['SalePrice'],axis=1)
y_train = train_df['SalePrice']
X_valid =valid_df.drop(['SalePrice'],axis=1)
y_valid = valid_df['SalePrice']
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(X_train,y_train)
preds = rf_model.predict(X_valid)
mean_squared_error(preds,y_valid)

Output will be the large number. But we did not tune hyper parameters yet.

Advantage of Random Forest

  • Can be use for classification and regression problems
  • Got enough randomness, So model will be protected from over fitting if we choose appropriate amount of trees .
  • Handle Outliers
  • It is not sensitive to data variation opposite to Decision trees as decision trees are sensitive to data variation.

Drawbacks

  • As number of trees depend on number of rows in dataset for high data it will create high number of trees and that will occupy more memory.
  • It is very slow ( i think it run on single thread )

If you find this helpful please applaud. If you have suggestions please share them via email at rajanlagah@gmail.com

--

--

Rajan Lagah
Rajan Lagah

Written by Rajan Lagah

React expert and Deep learning beginner.

No responses yet