Steps to Exploratory data analysis.

4 min readSep 29, 2020

Its is said data scientist spend most of time on data cleaning and preparing for model. Here i am going to create a cheat sheet of steps and related code for EDA.

Checking null values
Box plot
Checking distribution and skewness
Checking Correlation
Regplot for feature w.r.t target feature
Hexagonal bin

Categorical Data

Bar plot

Checking if no null value

import pandas as pd
import seaborn as snstrain_df = pd.read_csv('./train.csv')

we will create heatmap so that we can see if features have null value or not.

plt.figure(figsize=(20,9))
sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)

We will remove all the null values and finally it will look like

here i have discussed how to do that

Box Plot

import seaborn as sns

sns.set(style="whitegrid")
data =  [0.,1,12,15,11,0.,20.,31.,34,22,70,22,26.],
    
ax = sns.boxplot(data)

output will be

The left and right 25 percentile and 75 percentile respectively. Median in centre. More about box plot here

Checking Distribution of feature

In it we will check skewness of data in feature.

import pandas as pdimport seaborn as sns
import matplotlib.pyplot as plttrain_csv = pd.read_csv('./train.csv')

now creating function that take will take feature name and colour as argument and output the plot to show distribution of that feature’s values

plt.figure(figsize=(25,20))
for i in range(len(train_csv.columns)):
    if i <= 28:
        plt.subplot(7,4,i+1)
        plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
        ax = sns.distplot(train_csv[train_csv.columns[i]])
        ax.legend(["Skewness{:.2f}"
                    .format(train_csv[skewed_features[i]].skew())], 
                  fontsize = 'xx-large')

now lets list the feature names

train_csv.columns
# o/p Index(['MSSubClass','LotFrontage', 'LotArea', 'OverallCond', 'Age', ...

Significance

As we know skewness data is bad so this graph help to see that

1 ≥ skewness ≥+1 is very bad.

Facts

Area under curve is 1.
To decrease skewness we can

for i in skewed_features:
    df[i] = np.log(df[i] + 1)

Correlation

If you have lot of features then you can drop the less correlated feature or merge some features. To check correlation we can. It is highly sensitive to outliers .

plt.figure(figsize=(10,8)) # To adjust size
ax = sns.heatmap(train[features_array].corr(), cmap = "coolwarm", annot=True, linewidth=3)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)

If you have lot of features use this.

hig_corr = train.corr()
hig_corr_features = hig_corr.index[abs(hig_corr["SalePrice"])>=0.5]
hig_corr_features

SalePrice is target and we can see that these are highly correlated array.

Plot target w.r.t feature

We can see features highly correlated data from above plot and then we can plot that feature’s distribution w.r.t target

plt.figure(figsize=(16,9))
for i in range(len(hig_corr_features)):
    if i <= 9:
        plt.subplot(3,4,i+1)
        plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
        sns.regplot(data=train,
                    x = hig_corr_features[i], 
                    y = 'SalePrice')

here we can see that linear model will fit well.

Hexagonal bin

Like graph above we can see that it is mess of dots. So now we will create hexagonal bins and group points that come inside it and colour them according to the density of point under that hexagonal area

We can use pandas function.

train_df.plot.hexbin(x='LotArea', y='YrSold', gridsize=20)

Violin plot

Box plot just tell about statistics of data. Violin plot will show the density of spread of data too.

tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips)

Categorical Data

Non continuous definite data example

Days in week
Weather type

Get categorical features’s list by.

obj_feat = list(data.loc[:, data.dtypes == 'object'].columns.values)
obj_feat`

Bar chart

It is different to histogram as in x-asix it have categorical variables rather than numerical values of some feature. We can group features to see the relation too.

ax = train.groupby(['YrSold','SaleType'])['SaleType'].count().unstack(0).plot.bar(figsize=(14,8))

above code is from here

And output will be

Conclusion

I tried mentioning all the basic one. If you have some other useful EDA technique please share that with me at rajanlagah@gmail.com so we can create awesome EDA blog for starters. Also will keep updating on my way to learning ML