Step by Step Exploratory Data Analysis

Step by Step Exploratory Data Analysis

Step by Step Exploratory Data Analysis
Machine learning model building starts from data gathering to model deployment.The detailed workflow of  building a model in shown in my previous post. The first and foremost step in any machine learning problem is cleaning the data or data wrangling or exploratory data analysis.Here cleaning the data is nothing but making the data useful for model building Exploratory Data Analysis is the detailed process of cleaning the data and making it useful for model building.The detailed step by step of exploratory data analysis is shown below:

1.Data Analysis
2.Feature Engineering
3.Feature Selection
4.Model Building
5.Model Deployment

Data Analysis:
After collection of data, the next step of a data scientist is to clean the data.Data cleaning or Data wrangling or Exploratory Data Analysis carries the same meaning.Exploratory Data Analysis(EDA) is most widely used term.
In data analysis there are many substeps which are to be followed.The steps are listed below:

1.Basic data exploration like checking the size of the dataset,shape of the data,summary of data is to be seen to get a clear odea of how and what a dataset is.
2.We need to check for the missing values in our dataset because, our data may not be complete or it can come with some NaN values which makes our machine learning model difficult to predict.So we need to handle missing values either by dropping(if there are lesss missing values) or by imputation(replacing with mean,median,mode) depending on the problem statement.
3.Check the  duplicate values and drop them because duplicate values doesn't carry any value to our model so we need to drop our duplicate values.
4.Handling Outliers: The main and important step in data analysis is handling outliers.An outlier is an observation that lies an abnormal distance from other values in a random sample from a population.We need to handle outliers as they can negatively affect the statistical analysis and the training process of a machine learning algorithm resulting in lower accuracy.We need to detect outliers and handle them.

Outlier Detection with IQR:
 
The picture above shows that outliers are located outside the upper bound and the lower bound. So, it simply understands. But, how can we know the sum of outliers in each variable? To solve it, we have to implement the formula like the picture above. First, we have to calculate the IQR by dividing Q3 with Q1. Second, calculating Upper Bound with Q3 plus the result of 1.5 times IQR. Third, calculating Lower Bound with Q1 plus the result of 1.5 times IQR.
When we have found the IQR, the upper bound and the lower bound, we can see which values exceed the upper bound and which values are smaller than the lower bound. Then we can calculate that value as the number of outliers.

Handling Outliers:
Here,I have given 3 ways to handle outliers:
a. Dropping the outliers data: You omit the outliers values.
b. Caping the outliers data: You replace the outliers values with upper bound and lower bound. outliers that are located at more upper bound be replaced by upper bound values. Otherwise, outliers that are located at more the lower bound can be replaced with lower bound.
c. Replacing with new values: You replace outliers value with mean, median, or mode.

5.Converting categorical variables to numerical variables:As machine learning model will take only numerical data, we need to handle and convert categorical variables into numerical variables.We use 1.Label Encoding 2.One Hot Encoding 3.Target Encoding for converting categorical variable to numerical variables.
6.After converting all variables to numerical we need to standardize the data i.e, we need to scale down the data to same range using 1.Standardization or 2.Normalization.

Feature Engineering:
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.Some features are to be engineered based on the problem statement.It includes 1.Feature Binning 2.Feature Encoding etc.
Sometimes, removing the unwanted feature is also feature engineering. As the feature which is not related degrade the performance of the model.The steps to do feature engineering are as follows:
  • Brainstorm features.
  • Create features.
  • Check how the features work with the model.
  • Start again from first until the features work perfectly.

Feature Selection:
After Feature engineering we need to select the most important features required for better accuracy, we call this as feature selection.Feature selection is a process of selecting a subset of relevant features/ attributes (such as a column in tabular data) that are most relevant for the modeling and business objective of the problem. It basically helps in finding the most meaningful inputs from the data.Some of the methods of feature selection are 
  • Exhaustive Search
  • Filter Methods
  • Wrapper Methods
  • Embedded Methods
Model Building: 
After choosing the correct features for our model,we need to choose the right model for our problem statement.We need to build our model and tweak our model parameters and choose the best parameter values which gives better accuracy.We call these parameters as hyperparameters and tweaking these parameters is called hyperparamaeter tuning.

Model Deployment: 
Model deployment is the final part of our EDA.After building our model we need to deploy it in some cloud platforms like Google cloud platform,Microsoft Azure,Heroku.We can choose anyone of them as per our convienince.

NOTE: This is the detailed process of EDA.The above mentioned process is not the exact process to be followed.Based on our problem statement and our requirement the process changes.The above mentioned is the general standardized process in the industry.

0 Response to "Step by Step Exploratory Data Analysis"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel