Data cleaning for Machine Learning Algorithms to eat.
We will be using kaggle’s Housing Price competition's data.
Why we are using that data ?
Because it do have lot of missing values in both continuous and categorical category.
Download data from here and save it in folder and fire up jupyter notebook
Importing required packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Inspecting data
train_path = './train.csv'
test_path = './test.csv'
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
Now we have data in pandas dataframe so we can look up what is in data by
train_df.head()
output will be first 5 rows of data.
Checking if Null
Now 2 ways to check if any feature have null value is
— train_df.isnull().sum()
— And other is by using heatmaps
train_df.isnull().sum()
and you will see the list of few features with number of null values, This is not clear so we will now plot the heatmap to check the null values
sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)
But if we see number of feature by using
train_df.shape
output will be
(1460, 81)
and we are sure that heatmap is not telling is about 81 features so to do that
plt.figure(figsize = (16,5))
sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)
Now we can see density of missing values for each feature. We can replace features with float values with mean and null in features with integer values can be replace by mean or mode and null values in features with string values (categorical features ) can be replace by mode
But …
How we can know which feature is categorical ( with strings as values ) or continuous ?
train_df.info()
output of this will tell us what type of data is in that. Float/integer/object
we will first start by removing features having lots ( >50% ) values as null
- We have to replace that from both test and train data
train_df.drop(['Alley'],axis=1,inplace=True)
test_df.drop(['Alley'],axis=1,inplace=True)
or you can use
train_df.drop(['PoolQC','Fence','MiscFeature','GarageYrBlt','FireplaceQu'],axis=1,inplace=True)
test_df.drop(['PoolQC','Fence','MiscFeature','GarageYrBlt','FireplaceQu '],axis=1,inplace=True)
to replace many in one go.
Handling Categorical features
train_df['BsmtCond'] = train_df['BsmtCond'].fillna(train_df['BsmtCond'].mode())
test_df['BsmtCond'] = test_df['BsmtCond'].fillna(test_df['BsmtCond'].mode())
train_df['BsmtQual'] = train_df['BsmtQual'].fillna(train_df['BsmtQual'].mode())
test_df['BsmtQual'] = test_df['BsmtQual'].fillna(test_df['BsmtQual'].mode())
train_df['FireplaceQu'] = train_df['FireplaceQu'].fillna(train_df['FireplaceQu'].mode())
test_df['FireplaceQu'] = test_df['FireplaceQu'].fillna(test_df['FireplaceQu'].mode())
train_df['GarageFinish'] = train_df['GarageFinish'].fillna(train_df['GarageFinish'].mode())
test_df['GarageFinish'] = test_df['GarageFinish'].fillna(test_df['GarageFinish'].mode())
test_df['GarageQual'] = test_df['GarageQual'].fillna(test_df['GarageFinish'].mode())
train_df['GarageQual'] = train_df['GarageQual'].fillna(train_df['GarageQual'].mode())
test_df['BsmtFinType1'] = test_df['BsmtFinType1'].fillna(test_df['BsmtFinType1'].mode())
train_df['BsmtFinType1'] = train_df['BsmtFinType1'].fillna(train_df['BsmtFinType1'].mode())
test_df['BsmtFinType2'] = test_df['BsmtFinType2'].fillna(test_df['BsmtFinType2'].mode())
train_df['BsmtFinType2'] = train_df['BsmtFinType2'].fillna(train_df['BsmtFinType2'].mode())train_df['GarageType'] = train_df['GarageType'].fillna(train_df['GarageType'].mode())
test_df['GarageType'] = test_df['GarageType'].fillna(test_df['GarageType'].mode())train_df['MasVnrType'] = train_df['MasVnrType'].fillna(train_df['MasVnrType'].mode())
test_df['MasVnrType'] = test_df['MasVnrType'].fillna(test_df['MasVnrType'].mode())
And handling continuous variables by
train_df['MasVnrArea'] = train_df['MasVnrArea'].fillna(train_df['MasVnrArea'].mean())
test_df['MasVnrArea'] = test_df['MasVnrArea'].fillna(test_df['MasVnrArea'].mean())
train_df['TotalBsmtSF'] = train_df['TotalBsmtSF'].fillna(train_df['TotalBsmtSF'].mean())
test_df['TotalBsmtSF'] = test_df['TotalBsmtSF'].fillna(test_df['TotalBsmtSF'].mean())
train_df['BsmtFinSF2'] = train_df['BsmtFinSF2'].fillna(train_df['BsmtFinSF2'].mean())
test_df['BsmtFinSF2'] = test_df['BsmtFinSF2'].fillna(test_df['BsmtFinSF2'].mean())
train_df['BsmtUnfSF'] = train_df['BsmtUnfSF'].fillna(train_df['BsmtUnfSF'].mean())
test_df['BsmtUnfSF'] = test_df['BsmtUnfSF'].fillna(test_df['BsmtUnfSF'].mean())train_df['BsmtHalfBath'] = train_df['BsmtHalfBath'].fillna(train_df['BsmtHalfBath'].mean())
test_df['BsmtHalfBath'] = test_df['BsmtHalfBath'].fillna(test_df['BsmtHalfBath'].mean())train_df['BsmtFullBath'] = train_df['BsmtFullBath'].fillna(train_df['BsmtFullBath'].mean())
test_df['BsmtFullBath'] = test_df['BsmtFullBath'].fillna(test_df['BsmtFullBath'].mean())train_df['GarageCars'] = train_df['GarageCars'].fillna(train_df['GarageCars'].mean())
test_df['GarageCars'] = test_df['GarageCars'].fillna(test_df['GarageCars'].mean())train_df['GarageArea'] = train_df['GarageArea'].fillna(train_df['GarageArea'].mean())
test_df['GarageArea'] = test_df['GarageArea'].fillna(test_df['GarageArea'].mean())
Now if we do
plt.figure(figsize = (16,5))
sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)
We have still some points left. Hmm….
If we see the heatmap it is showing GarageType with some white values but we have handle that thing above. That is because we have to do
train_df['BsmtCond'] = train_df['BsmtCond'].fillna(train_df['BsmtCond'].mode()[0])
test_df['BsmtCond'] = test_df['BsmtCond'].fillna(test_df['BsmtCond'].mode()[0])test_df['GarageQual'] = test_df['GarageQual'].fillna(test_df['GarageQual'].mode()[0])
train_df['GarageQual'] = train_df['GarageQual'].fillna(train_df['GarageQual'].mode()[0])test_df['GarageCond'] = test_df['GarageCond'].fillna(test_df['GarageCond'].mode()[0])
train_df['GarageCond'] = train_df['GarageCond'].fillna(train_df['GarageCond'].mode()[0])test_df['GarageFinish'] = test_df['GarageFinish'].fillna(test_df['GarageFinish'].mode()[0])
train_df['GarageFinish'] = train_df['GarageFinish'].fillna(train_df['GarageFinish'].mode()[0])
test_df['BsmtFinType1'] = test_df['BsmtFinType1'].fillna(test_df['BsmtFinType1'].mode()[0])
train_df['BsmtFinType1'] = train_df['BsmtFinType1'].fillna(train_df['BsmtFinType1'].mode()[0])test_df['BsmtFinSF1'] = test_df['BsmtFinSF1'].fillna(test_df['BsmtFinSF1'].mode()[0])
train_df['BsmtFinSF1'] = train_df['BsmtFinSF1'].fillna(train_df['BsmtFinSF1'].mode()[0])test_df['BsmtFinSF2'] = test_df['BsmtFinSF2'].fillna(test_df['BsmtFinSF2'].mode()[0])
train_df['BsmtFinSF2'] = train_df['BsmtFinSF2'].fillna(train_df['BsmtFinSF2'].mode()[0])
now if we `train_df[‘GarageType’].isnull().sum()` then the output will be `0`
Lets check heatmap to know how many null values are left
plt.figure(figsize = (16,5))sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)
we have lot less null values left now, we will remove the row containing null value now.
train_df.dropna(inplace=True)
train_df.dropna(inplace=True)
and now heatmap is clean
plt.figure(figsize = (16,5))sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)
Now the final step will be converting strings to numbers as we cant feed strings to out Math formula ( machine learning algorithms )
Category to codes
We have to keep in mind that if we have 4 categories in test data and 3rd category is missing in test data then 4th class in test set will be 3 which will be wrong.
example you have 2 category is in Pet feature in train set. like Cat,Dog now you will replace all the Cats with 0 and all the dogs with 1. Now 1 represents dogs and 0 represents cat, But if in test set no cat is present and only have one category then dog will be replaced by 0. So 0 in train is cat and 0 in test set is dog which will give wrong results.
So to handle this problem we will combine test and train set and then will calculate the categorie’s code.
train_df.shape
o/p: (1412, 74)
( we will later need index to separate test and train set later )
Combining train and test set
final_df = pd.concat([train_df,test_df],axis=0)
Now to convert categories in any categorical feature we will test it on 1 feature.
We will get number of categories Street feature contains and then after converting to its categories code we will verify that they are correct
to get classes in features
final_df['Street'].value_counts()
output will be
Now to we will convert 2859 Pave to 2859s 1 and 12 Grvl to 12s 0. ( similarly for n categories we will use n integers )
final_df['Street'] = final_df['Street'].astype('category')
final_df['Street'] = final_df['Street'].cat.codes
now if we check
final_df['Street'].value_counts()
we got
So we will do this to every categorical feature
final_df['MSZoning'] = final_df['MSZoning'].astype('category')
final_df['MSZoning'] = final_df['MSZoning'].cat.codesfinal_df['LotShape'] = final_df['LotShape'].astype('category')
final_df['LotShape'] = final_df['LotShape'].cat.codesfinal_df['LandContour'] = final_df['LandContour'].astype('category')
final_df['LandContour'] = final_df['LandContour'].cat.codes
final_df['Utilities'] = final_df['Utilities'].astype('category')
final_df['Utilities'] = final_df['Utilities'].cat.codesfinal_df['GarageType'] = final_df['GarageType'].astype('category')
final_df['GarageType'] = final_df['GarageType'].cat.codesfinal_df['LotConfig'] = final_df['LotConfig'].astype('category')
final_df['LotConfig'] = final_df['LotConfig'].cat.codesfinal_df['LandSlope'] = final_df['LandSlope'].astype('category')
final_df['LandSlope'] = final_df['LandSlope'].cat.codesfinal_df['Neighborhood'] = final_df['Neighborhood'].astype('category')
final_df['Neighborhood'] = final_df['Neighborhood'].cat.codesfinal_df['Heating'] = final_df['Heating'].astype('category')
final_df['Heating'] = final_df['Heating'].cat.codesfinal_df['Exterior1st'] = final_df['Exterior1st'].astype('category')
final_df['Exterior1st'] = final_df['Exterior1st'].cat.codesfinal_df['Condition1'] = final_df['Condition1'].astype('category')
final_df['Condition1'] = final_df['Condition1'].cat.codesfinal_df['Condition2'] = final_df['Condition2'].astype('category')
final_df['Condition2'] = final_df['Condition2'].cat.codesfinal_df['BldgType'] = final_df['BldgType'].astype('category')
final_df['BldgType'] = final_df['BldgType'].cat.codesfinal_df['HouseStyle'] = final_df['HouseStyle'].astype('category')
final_df['HouseStyle'] = final_df['HouseStyle'].cat.codesfinal_df['RoofStyle'] = final_df['RoofStyle'].astype('category')
final_df['RoofStyle'] = final_df['RoofStyle'].cat.codesfinal_df['RoofMatl'] = final_df['RoofMatl'].astype('category')
final_df['RoofMatl'] = final_df['RoofMatl'].cat.codesfinal_df['MasVnrType'] = final_df['MasVnrType'].astype('category')
final_df['MasVnrType'] = final_df['MasVnrType'].cat.codesfinal_df['Exterior2nd'] = final_df['Exterior2nd'].astype('category')
final_df['Exterior2nd'] = final_df['Exterior2nd'].cat.codesfinal_df['ExterQual'] = final_df['ExterQual'].astype('category')
final_df['ExterQual'] = final_df['ExterQual'].cat.codesfinal_df['ExterCond'] = final_df['ExterCond'].astype('category')
final_df['ExterCond'] = final_df['ExterCond'].cat.codesfinal_df['Foundation'] = final_df['Foundation'].astype('category')
final_df['Foundation'] = final_df['Foundation'].cat.codesfinal_df['BsmtExposure'] = final_df['BsmtExposure'].astype('category')
final_df['BsmtExposure'] = final_df['BsmtExposure'].cat.codesfinal_df['BsmtQual'] = final_df['BsmtQual'].astype('category')
final_df['BsmtQual'] = final_df['BsmtQual'].cat.codesfinal_df['BsmtCond'] = final_df['BsmtCond'].astype('category')
final_df['BsmtCond'] = final_df['BsmtCond'].cat.codesfinal_df['BsmtExposure'] = final_df['BsmtExposure'].astype('category')
final_df['BsmtExposure'] = final_df['BsmtExposure'].cat.codesfinal_df['BsmtFinType1'] = final_df['BsmtFinType1'].astype('category')
final_df['BsmtFinType1'] = final_df['BsmtFinType1'].cat.codesfinal_df['HeatingQC'] = final_df['HeatingQC'].astype('category')
final_df['HeatingQC'] = final_df['HeatingQC'].cat.codesfinal_df['CentralAir'] = final_df['CentralAir'].astype('category')
final_df['CentralAir'] = final_df['CentralAir'].cat.codesfinal_df['Electrical'] = final_df['Electrical'].astype('category')
final_df['Electrical'] = final_df['Electrical'].cat.codesfinal_df['KitchenQual'] = final_df['KitchenQual'].astype('category')
final_df['KitchenQual'] = final_df['KitchenQual'].cat.codesfinal_df['Functional'] = final_df['Functional'].astype('category')
final_df['Functional'] = final_df['Functional'].cat.codesfinal_df['GarageFinish'] = final_df['GarageFinish'].astype('category')
final_df['GarageFinish'] = final_df['GarageFinish'].cat.codesfinal_df['GarageQual'] = final_df['GarageQual'].astype('category')
final_df['GarageQual'] = final_df['GarageQual'].cat.codesfinal_df['GarageCond'] = final_df['GarageCond'].astype('category')
final_df['GarageCond'] = final_df['GarageCond'].cat.codesfinal_df['PavedDrive'] = final_df['PavedDrive'].astype('category')
final_df['PavedDrive'] = final_df['PavedDrive'].cat.codesfinal_df['LotConfig'] = final_df['LotConfig'].astype('category')
final_df['LotConfig'] = final_df['LotConfig'].cat.codesfinal_df['LandSlope'] = final_df['LandSlope'].astype('category')
final_df['LandSlope'] = final_df['LandSlope'].cat.codesfinal_df['SaleType'] = final_df['SaleType'].astype('category')
final_df['SaleType'] = final_df['SaleType'].cat.codesfinal_df['SaleCondition'] = final_df['SaleCondition'].astype('category')
final_df['SaleCondition'] = final_df['SaleCondition'].cat.codesfinal_df['BsmtFinType2'] = final_df['BsmtFinType2'].astype('category')
final_df['BsmtFinType2'] = final_df['BsmtFinType2'].cat.codes
Finally
We will now separate test and train set using
clean_train_df = final_df.iloc[:1412,:]
clean_test_df = final_df.iloc[1412:,:]
and now our data is clean and ready for any machine learning algorithm to rock.
More you can do.
- Instead of assigning random categorical codes we can fix them by self. Idea is to categorical value that reduce the target value will be given ‘0’ value and the one which increases the most will be given ’n’. Where ’n’ is number of unique categorical values of that feature
from pandas.api.types import CategoricalDtype
df['MSZoning'] = df['MSZoning'].astype(
CategoricalDtype(
categories=[
'FV',
'C (all)',
'RL',
'RM',
'RH'],
ordered = True)).cat.codes
If you have any suggestions please mail me that on rajanlagah@gmail.com.