Exploratory Data Analysis

Machine learning is an application of AI(Artificial Intelligence) that makes computers to learn themselves from given data without being explicitly programmed. Now days computers are much powerful that they can easily be trained with much amount of data with so much minimum time. As a data scientist it is also mandatory that one have to know how the data is varying, how the data is categorized and how distributed. With the help of Exploratory Data Analysis(EDA) we get conclusions about the data that human can observe with the help of graphs, charts and values.

Definition

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns,to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations.

Explanation of EDA with sample iris dataset:

I am taking iris dataset as a sample dataset and performing EDA. Iris dataset contains four features:

1. sepal_length
2. sepal_width
3. petal_length
4. petal_width and 3 classes
  1. setosa
  2. verginica
  3. versicolor

Please click here to get more information about iris dataset.

First step is to import required libraries and than read data files.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
iris = pd.read_csv(“iris.csv”)

Now we have to show number of raws and columns in dataframe, shape provides that functionality. After that we have to figure out which columns that dataframes contains. dataframe.columns returns list of columns that dataframe contains.

print(“Shape of dataframe is”, iris.shape)
print(“Columns of dataframe are” , iris.columns)

We have to observe how the data is, so we have to display initial first raws.head() function provides that functionality.

iris.head()

Data contains null values , so we have to fill that null blocks with some values. So first of all we have to figure out how much null each columns contains.

iris.isna().sum()

Here species is target column , means we have to classified data in spices. So we have to observe which unique species exists in data frame.

iris[“species”].value_counts()

2 D Scatter plot

Two-dimensional scatterplots visualize a relation (correlation) between two variables X and Y . Individual data points are represented in two-dimensional space, where axes represent the variables . The two coordinates that determine the location of each point correspond to its specific values on the two variables.

sns.set_style(“whitegrid”);
sns.FacetGrid(iris, hue=”species”, size=4) \
.map(plt.scatter, “sepal_length”, “sepal_width”) \
.add_legend();
plt.show();

sns.set_style(“whitegrid”);
sns.pairplot(iris, hue=”species”, size=3);


plt.show()

Histogram

If the kurtosis value is greater than 3 than there are outliers in data . We must have to remove that outliers.

So finally we can conclude that EDA helps in best way to observe data to humans with the help of graphs and characteristics of numerical data.

Parth Shah

Hey there, my name is Parth Shah. I am from Modasa(Gujarat).
Currently I am working in Tata Consultency Services Gandhinagar as a Assistant System Engineer Trainee.
I have completed my graduation from Dharmsinh Desai Institute of Techology, Nadiad(DDIT) in Computer Science.
As a curious student I have developed interest in Data science and Machine learning so I am learning these stuffs also.
I love to code, to play with numbers and to play with data structures and algorithms for solving problems.
I am addicted to read blogs and books related to finance and technology.
Motto of my life is to be happy and make others happy as much as I can.