Steps to Exploratory data analysis.
Its is said data scientist spend most of time on data cleaning and preparing for model. Here i am going to create a cheat sheet of steps and related code for EDA.
- Checking null values
- Box plot
- Checking distribution and skewness
- Checking Correlation
- Regplot for feature w.r.t target feature
- Hexagonal bin
Categorical Data
- Bar plot
Checking if no null value
import pandas as pd
import seaborn as snstrain_df = pd.read_csv('./train.csv')
we will create heatmap so that we can see if features have null value or not.
plt.figure(figsize=(20,9))
sns.heatmap(train_df.isnull(),yticklabels=False,cbar=False)
We will remove all the null values and finally it will look like
here i have discussed how to do that
Box Plot
import seaborn as sns
sns.set(style="whitegrid")
data = [0.,1,12,15,11,0.,20.,31.,34,22,70,22,26.],
ax = sns.boxplot(data)
output will be
The left and right 25 percentile and 75 percentile respectively. Median in centre. More about box plot here
Checking Distribution of feature
- In it we will check skewness of data in feature.
import pandas as pdimport seaborn as sns
import matplotlib.pyplot as plttrain_csv = pd.read_csv('./train.csv')
now creating function that take will take feature name and colour as argument and output the plot to show distribution of that feature’s values
plt.figure(figsize=(25,20))
for i in range(len(train_csv.columns)):
if i <= 28:
plt.subplot(7,4,i+1)
plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
ax = sns.distplot(train_csv[train_csv.columns[i]])
ax.legend(["Skewness{:.2f}"
.format(train_csv[skewed_features[i]].skew())],
fontsize = 'xx-large')
now lets list the feature names
train_csv.columns
# o/p Index(['MSSubClass','LotFrontage', 'LotArea', 'OverallCond', 'Age', ...
Significance
As we know skewness data is bad so this graph help to see that
- 1 ≥ skewness ≥+1 is very bad.
Facts
- Area under curve is 1.
- To decrease skewness we can
for i in skewed_features:
df[i] = np.log(df[i] + 1)
Correlation
If you have lot of features then you can drop the less correlated feature or merge some features. To check correlation we can. It is highly sensitive to outliers .
plt.figure(figsize=(10,8)) # To adjust size
ax = sns.heatmap(train[features_array].corr(), cmap = "coolwarm", annot=True, linewidth=3)
bottom, top = ax.get_ylim()
ax.set_ylim(bottom + 0.5, top - 0.5)
If you have lot of features use this.
hig_corr = train.corr()
hig_corr_features = hig_corr.index[abs(hig_corr["SalePrice"])>=0.5]
hig_corr_features
SalePrice is target and we can see that these are highly correlated array.
Plot target w.r.t feature
We can see features highly correlated data from above plot and then we can plot that feature’s distribution w.r.t target
plt.figure(figsize=(16,9))
for i in range(len(hig_corr_features)):
if i <= 9:
plt.subplot(3,4,i+1)
plt.subplots_adjust(hspace = 0.5, wspace = 0.5)
sns.regplot(data=train,
x = hig_corr_features[i],
y = 'SalePrice')
here we can see that linear model will fit well.
Hexagonal bin
Like graph above we can see that it is mess of dots. So now we will create hexagonal bins and group points that come inside it and colour them according to the density of point under that hexagonal area
We can use pandas function.
train_df.plot.hexbin(x='LotArea', y='YrSold', gridsize=20)
Violin plot
Box plot just tell about statistics of data. Violin plot will show the density of spread of data too.
tips = sns.load_dataset("tips")
ax = sns.violinplot(x="day", y="total_bill", data=tips)
Categorical Data
Non continuous definite data example
- Days in week
- Weather type
Get categorical features’s list by.
obj_feat = list(data.loc[:, data.dtypes == 'object'].columns.values)
obj_feat`
Bar chart
It is different to histogram as in x-asix it have categorical variables rather than numerical values of some feature. We can group features to see the relation too.
ax = train.groupby(['YrSold','SaleType'])['SaleType'].count().unstack(0).plot.bar(figsize=(14,8))
above code is from here
And output will be
Conclusion
I tried mentioning all the basic one. If you have some other useful EDA technique please share that with me at rajanlagah@gmail.com so we can create awesome EDA blog for starters. Also will keep updating on my way to learning ML