Random Forest in Machine learning

2 min readAug 4, 2020

Main idea is create randomly uncorrelated decision trees and output the result that is the output of most of decision trees.

Step 1

Create the Bootstrapped data. You will choose n ( size of original data) number of observations from the original data randomly and you can repeat the observations.

Step 2

Creating decision trees. For decision tree we will choose any 1 feature as Root of tree and then select 2 other feature randomly ( without repetition ).

Now we have created 1 tree with same procedure we will create many other trees

Now we will select the data from our bootstrapped data and the left data then check the validity of trees. Prediction will be what majority of decision trees say. We will then keep record and will repeat step 1 again and then in Step 2 we will choose 3 ( 2 + 1) features randomly and create new set of decision trees. And validate the results we will repeat this process again and again and finally we will keep the set of trees with least error.

Simple Implementation

Grab the data for training and testing from here.

import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_errortest_path = './../data/test_clean.csv'
train_path = './../data/train_clean.csv'train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
test_df = test_df.drop(['SalePrice'],axis=1)
valid_df = train_df[-200:]
train_df = train_df[:-200]X_train = train_df.drop(['SalePrice'],axis=1)
y_train = train_df['SalePrice']X_valid =valid_df.drop(['SalePrice'],axis=1)
y_valid = valid_df['SalePrice']rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(X_train,y_train)preds = rf_model.predict(X_valid)
mean_squared_error(preds,y_valid)

Output will be the large number. But we did not tune hyper parameters yet.

Advantage of Random Forest

Can be use for classification and regression problems
Got enough randomness, So model will be protected from over fitting if we choose appropriate amount of trees .
Handle Outliers
It is not sensitive to data variation opposite to Decision trees as decision trees are sensitive to data variation.

Drawbacks

As number of trees depend on number of rows in dataset for high data it will create high number of trees and that will occupy more memory.
It is very slow ( i think it run on single thread )

If you find this helpful please applaud. If you have suggestions please share them via email at rajanlagah@gmail.com