Data cleaning for Machine Learning Algorithms to eat.

6 min readAug 1, 2020

We will be using kaggle’s Housing Price competition's data.

Why we are using that data ?

Because it do have lot of missing values in both continuous and categorical category.

Download data from here and save it in folder and fire up jupyter notebook

Importing required packages

import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Inspecting data

train_path = './train.csv'
test_path = './test.csv'
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

Now we have data in pandas dataframe so we can look up what is in data by

train_df.head()

output will be first 5 rows of data.

Checking if Null

Now 2 ways to check if any feature have null value is

— train_df.isnull().sum()

— And other is by using heatmaps

train_df.isnull().sum()

and you will see the list of few features with number of null values, This is not clear so we will now plot the heatmap to check the null values

sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)

But if we see number of feature by using

train_df.shape

output will be

(1460, 81)

and we are sure that heatmap is not telling is about 81 features so to do that

plt.figure(figsize = (16,5))
sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)

Now we can see density of missing values for each feature. We can replace features with float values with mean and null in features with integer values can be replace by mean or mode and null values in features with string values (categorical features ) can be replace by mode

But …

How we can know which feature is categorical ( with strings as values ) or continuous ?

train_df.info()

output of this will tell us what type of data is in that. Float/integer/object

we will first start by removing features having lots ( >50% ) values as null

We have to replace that from both test and train data

train_df.drop(['Alley'],axis=1,inplace=True)
test_df.drop(['Alley'],axis=1,inplace=True)

or you can use

train_df.drop(['PoolQC','Fence','MiscFeature','GarageYrBlt','FireplaceQu'],axis=1,inplace=True)
test_df.drop(['PoolQC','Fence','MiscFeature','GarageYrBlt','FireplaceQu '],axis=1,inplace=True)

to replace many in one go.

Handling Categorical features

train_df['BsmtCond'] = train_df['BsmtCond'].fillna(train_df['BsmtCond'].mode())
test_df['BsmtCond'] = test_df['BsmtCond'].fillna(test_df['BsmtCond'].mode())
train_df['BsmtQual'] = train_df['BsmtQual'].fillna(train_df['BsmtQual'].mode())
test_df['BsmtQual'] = test_df['BsmtQual'].fillna(test_df['BsmtQual'].mode())
train_df['FireplaceQu'] = train_df['FireplaceQu'].fillna(train_df['FireplaceQu'].mode())
test_df['FireplaceQu'] = test_df['FireplaceQu'].fillna(test_df['FireplaceQu'].mode())
train_df['GarageFinish'] = train_df['GarageFinish'].fillna(train_df['GarageFinish'].mode())
test_df['GarageFinish'] = test_df['GarageFinish'].fillna(test_df['GarageFinish'].mode())
test_df['GarageQual'] = test_df['GarageQual'].fillna(test_df['GarageFinish'].mode())
train_df['GarageQual'] = train_df['GarageQual'].fillna(train_df['GarageQual'].mode())
test_df['BsmtFinType1'] = test_df['BsmtFinType1'].fillna(test_df['BsmtFinType1'].mode())
train_df['BsmtFinType1'] = train_df['BsmtFinType1'].fillna(train_df['BsmtFinType1'].mode())
test_df['BsmtFinType2'] = test_df['BsmtFinType2'].fillna(test_df['BsmtFinType2'].mode())
train_df['BsmtFinType2'] = train_df['BsmtFinType2'].fillna(train_df['BsmtFinType2'].mode())train_df['GarageType'] = train_df['GarageType'].fillna(train_df['GarageType'].mode())
test_df['GarageType'] = test_df['GarageType'].fillna(test_df['GarageType'].mode())train_df['MasVnrType'] = train_df['MasVnrType'].fillna(train_df['MasVnrType'].mode())
test_df['MasVnrType'] = test_df['MasVnrType'].fillna(test_df['MasVnrType'].mode())

And handling continuous variables by

train_df['MasVnrArea'] = train_df['MasVnrArea'].fillna(train_df['MasVnrArea'].mean())
test_df['MasVnrArea'] = test_df['MasVnrArea'].fillna(test_df['MasVnrArea'].mean())
train_df['TotalBsmtSF'] = train_df['TotalBsmtSF'].fillna(train_df['TotalBsmtSF'].mean())
test_df['TotalBsmtSF'] = test_df['TotalBsmtSF'].fillna(test_df['TotalBsmtSF'].mean())
train_df['BsmtFinSF2'] = train_df['BsmtFinSF2'].fillna(train_df['BsmtFinSF2'].mean())
test_df['BsmtFinSF2'] = test_df['BsmtFinSF2'].fillna(test_df['BsmtFinSF2'].mean())
train_df['BsmtUnfSF'] = train_df['BsmtUnfSF'].fillna(train_df['BsmtUnfSF'].mean())
test_df['BsmtUnfSF'] = test_df['BsmtUnfSF'].fillna(test_df['BsmtUnfSF'].mean())train_df['BsmtHalfBath'] = train_df['BsmtHalfBath'].fillna(train_df['BsmtHalfBath'].mean())
test_df['BsmtHalfBath'] = test_df['BsmtHalfBath'].fillna(test_df['BsmtHalfBath'].mean())train_df['BsmtFullBath'] = train_df['BsmtFullBath'].fillna(train_df['BsmtFullBath'].mean())
test_df['BsmtFullBath'] = test_df['BsmtFullBath'].fillna(test_df['BsmtFullBath'].mean())train_df['GarageCars'] = train_df['GarageCars'].fillna(train_df['GarageCars'].mean())
test_df['GarageCars'] = test_df['GarageCars'].fillna(test_df['GarageCars'].mean())train_df['GarageArea'] = train_df['GarageArea'].fillna(train_df['GarageArea'].mean())
test_df['GarageArea'] = test_df['GarageArea'].fillna(test_df['GarageArea'].mean())

Now if we do

plt.figure(figsize = (16,5))
sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)

We have still some points left. Hmm….

If we see the heatmap it is showing GarageType with some white values but we have handle that thing above. That is because we have to do

train_df['BsmtCond'] = train_df['BsmtCond'].fillna(train_df['BsmtCond'].mode()[0])
test_df['BsmtCond'] = test_df['BsmtCond'].fillna(test_df['BsmtCond'].mode()[0])test_df['GarageQual'] = test_df['GarageQual'].fillna(test_df['GarageQual'].mode()[0])
train_df['GarageQual'] = train_df['GarageQual'].fillna(train_df['GarageQual'].mode()[0])test_df['GarageCond'] = test_df['GarageCond'].fillna(test_df['GarageCond'].mode()[0])
train_df['GarageCond'] = train_df['GarageCond'].fillna(train_df['GarageCond'].mode()[0])test_df['GarageFinish'] = test_df['GarageFinish'].fillna(test_df['GarageFinish'].mode()[0])
train_df['GarageFinish'] = train_df['GarageFinish'].fillna(train_df['GarageFinish'].mode()[0])

test_df['BsmtFinType1'] = test_df['BsmtFinType1'].fillna(test_df['BsmtFinType1'].mode()[0])
train_df['BsmtFinType1'] = train_df['BsmtFinType1'].fillna(train_df['BsmtFinType1'].mode()[0])test_df['BsmtFinSF1'] = test_df['BsmtFinSF1'].fillna(test_df['BsmtFinSF1'].mode()[0])
train_df['BsmtFinSF1'] = train_df['BsmtFinSF1'].fillna(train_df['BsmtFinSF1'].mode()[0])test_df['BsmtFinSF2'] = test_df['BsmtFinSF2'].fillna(test_df['BsmtFinSF2'].mode()[0])
train_df['BsmtFinSF2'] = train_df['BsmtFinSF2'].fillna(train_df['BsmtFinSF2'].mode()[0])

now if we `train_df[‘GarageType’].isnull().sum()` then the output will be `0`

Lets check heatmap to know how many null values are left

plt.figure(figsize = (16,5))sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)

we have lot less null values left now, we will remove the row containing null value now.

train_df.dropna(inplace=True)
train_df.dropna(inplace=True)

and now heatmap is clean

plt.figure(figsize = (16,5))sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)

Now the final step will be converting strings to numbers as we cant feed strings to out Math formula ( machine learning algorithms )

Category to codes

We have to keep in mind that if we have 4 categories in test data and 3rd category is missing in test data then 4th class in test set will be 3 which will be wrong.

example you have 2 category is in Pet feature in train set. like Cat,Dog now you will replace all the Cats with 0 and all the dogs with 1. Now 1 represents dogs and 0 represents cat, But if in test set no cat is present and only have one category then dog will be replaced by 0. So 0 in train is cat and 0 in test set is dog which will give wrong results.

So to handle this problem we will combine test and train set and then will calculate the categorie’s code.

train_df.shape
o/p: (1412, 74)

( we will later need index to separate test and train set later )

Combining train and test set

final_df = pd.concat([train_df,test_df],axis=0)

Now to convert categories in any categorical feature we will test it on 1 feature.

We will get number of categories Street feature contains and then after converting to its categories code we will verify that they are correct

to get classes in features

final_df['Street'].value_counts()

output will be

Now to we will convert 2859 Pave to 2859s 1 and 12 Grvl to 12s 0. ( similarly for n categories we will use n integers )

final_df['Street'] = final_df['Street'].astype('category')
final_df['Street'] = final_df['Street'].cat.codes

now if we check

final_df['Street'].value_counts()

we got

So we will do this to every categorical feature

final_df['MSZoning'] = final_df['MSZoning'].astype('category')
final_df['MSZoning'] = final_df['MSZoning'].cat.codesfinal_df['LotShape'] = final_df['LotShape'].astype('category')
final_df['LotShape'] = final_df['LotShape'].cat.codesfinal_df['LandContour'] = final_df['LandContour'].astype('category')
final_df['LandContour'] = final_df['LandContour'].cat.codes
final_df['Utilities'] = final_df['Utilities'].astype('category')
final_df['Utilities'] = final_df['Utilities'].cat.codesfinal_df['GarageType'] = final_df['GarageType'].astype('category')
final_df['GarageType'] = final_df['GarageType'].cat.codesfinal_df['LotConfig'] = final_df['LotConfig'].astype('category')
final_df['LotConfig'] = final_df['LotConfig'].cat.codesfinal_df['LandSlope'] = final_df['LandSlope'].astype('category')
final_df['LandSlope'] = final_df['LandSlope'].cat.codesfinal_df['Neighborhood'] = final_df['Neighborhood'].astype('category')
final_df['Neighborhood'] = final_df['Neighborhood'].cat.codesfinal_df['Heating'] = final_df['Heating'].astype('category')
final_df['Heating'] = final_df['Heating'].cat.codesfinal_df['Exterior1st'] = final_df['Exterior1st'].astype('category')
final_df['Exterior1st'] = final_df['Exterior1st'].cat.codesfinal_df['Condition1'] = final_df['Condition1'].astype('category')
final_df['Condition1'] = final_df['Condition1'].cat.codesfinal_df['Condition2'] = final_df['Condition2'].astype('category')
final_df['Condition2'] = final_df['Condition2'].cat.codesfinal_df['BldgType'] = final_df['BldgType'].astype('category')
final_df['BldgType'] = final_df['BldgType'].cat.codesfinal_df['HouseStyle'] = final_df['HouseStyle'].astype('category')
final_df['HouseStyle'] = final_df['HouseStyle'].cat.codesfinal_df['RoofStyle'] = final_df['RoofStyle'].astype('category')
final_df['RoofStyle'] = final_df['RoofStyle'].cat.codesfinal_df['RoofMatl'] = final_df['RoofMatl'].astype('category')
final_df['RoofMatl'] = final_df['RoofMatl'].cat.codesfinal_df['MasVnrType'] = final_df['MasVnrType'].astype('category')
final_df['MasVnrType'] = final_df['MasVnrType'].cat.codesfinal_df['Exterior2nd'] = final_df['Exterior2nd'].astype('category')
final_df['Exterior2nd'] = final_df['Exterior2nd'].cat.codesfinal_df['ExterQual'] = final_df['ExterQual'].astype('category')
final_df['ExterQual'] = final_df['ExterQual'].cat.codesfinal_df['ExterCond'] = final_df['ExterCond'].astype('category')
final_df['ExterCond'] = final_df['ExterCond'].cat.codesfinal_df['Foundation'] = final_df['Foundation'].astype('category')
final_df['Foundation'] = final_df['Foundation'].cat.codesfinal_df['BsmtExposure'] = final_df['BsmtExposure'].astype('category')
final_df['BsmtExposure'] = final_df['BsmtExposure'].cat.codesfinal_df['BsmtQual'] = final_df['BsmtQual'].astype('category')
final_df['BsmtQual'] = final_df['BsmtQual'].cat.codesfinal_df['BsmtCond'] = final_df['BsmtCond'].astype('category')
final_df['BsmtCond'] = final_df['BsmtCond'].cat.codesfinal_df['BsmtExposure'] = final_df['BsmtExposure'].astype('category')
final_df['BsmtExposure'] = final_df['BsmtExposure'].cat.codesfinal_df['BsmtFinType1'] = final_df['BsmtFinType1'].astype('category')
final_df['BsmtFinType1'] = final_df['BsmtFinType1'].cat.codesfinal_df['HeatingQC'] = final_df['HeatingQC'].astype('category')
final_df['HeatingQC'] = final_df['HeatingQC'].cat.codesfinal_df['CentralAir'] = final_df['CentralAir'].astype('category')
final_df['CentralAir'] = final_df['CentralAir'].cat.codesfinal_df['Electrical'] = final_df['Electrical'].astype('category')
final_df['Electrical'] = final_df['Electrical'].cat.codesfinal_df['KitchenQual'] = final_df['KitchenQual'].astype('category')
final_df['KitchenQual'] = final_df['KitchenQual'].cat.codesfinal_df['Functional'] = final_df['Functional'].astype('category')
final_df['Functional'] = final_df['Functional'].cat.codesfinal_df['GarageFinish'] = final_df['GarageFinish'].astype('category')
final_df['GarageFinish'] = final_df['GarageFinish'].cat.codesfinal_df['GarageQual'] = final_df['GarageQual'].astype('category')
final_df['GarageQual'] = final_df['GarageQual'].cat.codesfinal_df['GarageCond'] = final_df['GarageCond'].astype('category')
final_df['GarageCond'] = final_df['GarageCond'].cat.codesfinal_df['PavedDrive'] = final_df['PavedDrive'].astype('category')
final_df['PavedDrive'] = final_df['PavedDrive'].cat.codesfinal_df['LotConfig'] = final_df['LotConfig'].astype('category')
final_df['LotConfig'] = final_df['LotConfig'].cat.codesfinal_df['LandSlope'] = final_df['LandSlope'].astype('category')
final_df['LandSlope'] = final_df['LandSlope'].cat.codesfinal_df['SaleType'] = final_df['SaleType'].astype('category')
final_df['SaleType'] = final_df['SaleType'].cat.codesfinal_df['SaleCondition'] = final_df['SaleCondition'].astype('category')
final_df['SaleCondition'] = final_df['SaleCondition'].cat.codesfinal_df['BsmtFinType2'] = final_df['BsmtFinType2'].astype('category')
final_df['BsmtFinType2'] = final_df['BsmtFinType2'].cat.codes

Finally

We will now separate test and train set using

clean_train_df = final_df.iloc[:1412,:]
clean_test_df = final_df.iloc[1412:,:]

and now our data is clean and ready for any machine learning algorithm to rock.

More you can do.

Instead of assigning random categorical codes we can fix them by self. Idea is to categorical value that reduce the target value will be given ‘0’ value and the one which increases the most will be given ’n’. Where ’n’ is number of unique categorical values of that feature

from pandas.api.types import CategoricalDtype
df['MSZoning'] = df['MSZoning'].astype(
                               CategoricalDtype(
                                categories=[
                                        'FV',
                                        'C (all)', 
                                        'RL',
                                        'RM', 
                                        'RH'], 
                               ordered = True)).cat.codes

If you have any suggestions please mail me that on rajanlagah@gmail.com.