Building a Machine Learning Model.

Building a Machine Learning Model.

Building  a Machine Learning Model.

We had seen what is Exploratory Data Analysis,Performance Metrics & Detailed workflow of model in our previous posts.Let's see how to build a machine learning model step by step in this article.The detailed step by step is beautifully depicted in the below picture.
Cartoon Infographic on Building the Machine Learning Model. (Drawn by Chanin Nantasenamat)

Initially after getting the dataset, we need to clean the data which is done by using Exploratory Data Analysis.Exploratory data analysis (EDA) is performed in order to gain a preliminary understanding and allow us to get acquainted with the dataset.

Three major EDA approaches  includes:

Descriptive statistics — Mean, median, mode, standard deviation
Data visualisations — Heat maps (discerning feature intra-correlation), box plot (visualize group differences), scatter plots (visualize correlations between features), principal component analysis (visualize distribution of clusters presented in the dataset), etc.
Data shaping — Pivoting data, grouping data, filtering data, etc.


Data Pre-Processing:

Data pre-processing (also known as data cleaning, data wrangling or data munging) is the process by which the data is subjected to various checks and scrutiny in order to remedy issues of missing values, spelling errors, normalizing/standardizing values such that they are comparable, transforming data (e.g. logarithmic transformation), etc.

“Garbage in, Garbage out.”
— George Fuechsel

As the above quote suggests, the quality of data is going to exert a big impact on the quality of the generated model. Therefore, to achieve the highest model quality, significant effort should be spent in the data pre-processing phase. It is said that data pre-processing could easily account for 80% of the time spent on data science projects while the actual model building phase and subsequent post-model analysis account for the remaining 20%.

Data Splitting:

Train-Test Split
In the development of machine learning models, it is desirable that the trained model perform well on new, unseen data. In order to simulate the new, unseen data, the available data is subjected to data splitting whereby it is split to 2 portions (sometimes referred to as the train-test split). Particularly, the first portion is the larger data subset that is used as the training set (such as accounting for 80% of the original data) and the second is normally a smaller subset and used as the testing set (the remaining 20% of the data). It should be noted that such data split is performed once.

Next, the training set is used to build a predictive model and such trained model is then applied on the testing set (i.e. serving as the new, unseen data) to make predictions. Selection of the best model is made on the basis of the model’s performance on the testing set and in efforts to obtain the best possible model, hyperparameter optimization may also be performed.


Train-Validation-Test Split
Another common approach for data splitting is to split the data to 3 portions: (1) training set, (2) validation set and (3) testing set. Similar to what was explained above, the training set is used to build a predictive model and is also evaluated on the validation set whereby predictions are made, model tuning can be made (e.g. hyperparameter optimization) and selection of the best performing model based on results of the validation set. As we can see, similar to what was performed above to the test set, here we do the same procedure on the validation set instead. Notice that the testing set is not involved in any of the model building and preparation. Thus, the testing set can truly act as the new, unseen data.

Cross-Validation:

In order to make the most economical use of the available data, an N-fold cross-validation (CV) is normally used whereby the dataset is partitioned to N folds (i.e. commonly 5-fold or 10-fold CV are used). In such N-fold CV, one of the fold is left out as the testing data while the remaining folds are used as the training data for model building.
For example, in a 5-fold CV, 1 fold is left out and used as the testing data while the remaining 4 folds are pooled together and used as the training data for model building. The trained model is then applied on the aforementioned left-out fold (i.e. the test data). This process is carried out iteratively until all folds had a chance to be left out as the testing data. As a result, we will have built 5 models (i.e. where each of the 5 folds have been left out as the testing set) where each of the 5 models contain associated performance metrics (which we will discuss soon in the forthcoming section). Finally, the metric values are based on the average performance computed from the 5 models.


In situations when N is equal to the number of data samples, we call this leave-one-out cross-validation. In this type of CV, each data sample represents a fold. For example, if N is equal to 30 then there are 30 folds (1 sample per fold). As in any other N-fold CV, 1 fold is left out as the testing set while the remaining 29 folds are used to build the model. Next, the built model is applied to make prediction on the left-out fold. As before, this process is performed iteratively for a total of 30 times; and the average performance from the 30 models are computed and used as the CV performance metric.

Model Building:
Now, comes the fun part where we finally get to use the meticulously prepared data for model building. Depending on the data type (qualitative or quantitative) of the target variable (commonly referred to as the Y variable) we are either going to be building a classification (if Y is qualitative) or regression (if Y is quantitative) model.

Learning Algorithms
Machine learning algorithms could be broadly categorised to one of three types:
Supervised learning — is a machine learning task that establishes the mathematical relationship between input X and output Y variables. Such X, Y pair constitutes the labeled data that are used for model building in an effort to learn how to predict the output from the input.
Unsupervised learning — is a machine learning task that makes use of only the input X variables. Such X variables are unlabeled data that the learning algorithm uses in modeling the inherent structure of the data.
Reinforcement learning — is a machine learning task that decides on the next course of action and it does this by learning through trial and error in an effort to maximize the reward.

Hyperparameter Tuning:
Hyperparameters are essentially parameters of the machine learning algorithm that directly impacts the learning process and prediction performance. As there are no “one-size fits all” hyperparameter settings that will universally work for all datasets therefore one will need to perform hyperparameter optimization (also known as hyperparameter tuning or model tuning).

Model Deployement:

Model deployement is the final part of machine learning.After making the model with best accuracy we need to deploy it in a cloud platform for an end user to use it.

The above images are beautifully depicted for clear understanding.

Note: Images are taken from internet.
***Thank you***

0 Response to "Building a Machine Learning Model."

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel