Practical 9: Predictive Modelling Analysis and Classification Models

Verified

Added on 2021/05/27

AI Summary

This practical assignment delves into the realm of predictive modelling, utilizing the Iris dataset as a case study. The assignment addresses key concepts such as feature selection, dataset splitting, and the importance of correlation analysis in identifying optimal features for classification tasks. It explores the performance of linear and kernel Support Vector Machines (SVMs), comparing their predictive accuracy and training speed. The analysis includes a discussion on the challenges of classifying the two moons dataset and the necessity of parameter selection, including the use of cross-validation, to enhance model performance. The assignment concludes with a comparison of different classification models and highlights the importance of parameter tuning for achieving optimal results in predictive modelling. The student's work provides valuable insights into the practical application of machine learning techniques and demonstrates an understanding of core concepts in data science.

Practical 9 - Predictive Modelling
Answer each of the questions below using the examples and code provided in working python file:
1. How many features are there for the iris dataset? How many examples? How many labels?
There are four features in the iris dataset. These features are measured in centimetres.
The features are:
1. Sepal length
2. Sepal width
3. Petal length
4. Petal width
Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor,
Covariate)
There are 50 samples for each specie (Iris Setosa, Iris virginica and Iris versicolor) of Iris flower. This
results in 150 records (examples) where each observation will have 4 features, as stated above. Each
row is an observation (also known as: sample, example, instance, record)
Labels are also known as targets. Each value that we predict is the response (also known as: target,
outcome, label, dependent variable.
Classification is a supervised learning where label is categorical. There are 150 labels in iris dataset
falling under 3 categories:
0= Setosa
1= Versicolor
2= Virginica
2. Why is it important to split the dataset into training and test set? Why a classification model
needs to be trained on the training set and the prediction performance needs to be measured on
the test set?
In Machine Learning, we make a model which is nothing but an algorithm where some parameters
needs to be modified such that it is able to perform good at the application i.e. it is able to predict
values of one wants to.
We can train the model using data which we call as training data or training set. The training data is
the one which already has the actual value that the model should have predicted and thus the
algorithm changes the value of parameters to account for the data in the training set.
To know after training the model is overall good or not, we have test data/test set which is basically
a different data for which we know the values but this data was never shown to the model before.
Thus, if the model after training is performing good on test set as well then, we can say that the
Machine Learning model is good.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

It is important to learn the predictive model (i.e. the classifier) on the training set and test its
performance on the test set. The purpose of predictive modelling is to create models that are able to
predict on future data. Hence it is important to keep training and test data separate and do not use
test data for learning predictive models.
A classification model can be used to predict the class label of unknown records. A classification
technique is a systematic approach to building classification models from an input set. The model
generated by a learning algorithm should both fit the input data well and correctly predict the class
labels of records it has never seen before.
First, a training set consisting of records whose class labels are known must be provided. The
training set is used to build a classification model, which is subsequently applied to the test set,
which consists of records with unknown class labels.
Evaluation of the performance of a classification model is based on the counts of test records
correctly and incorrectly predicted by the model.
3.How correlation analysis can help identify the best features for the classification task? What are
the best features for the iris data based on correlation analysis results?
Data correlation is the way in which one set of data may correspond to another set. For the
classification problem, feature selection aims to select subset of highly discriminant features. In
other words, it selects features that are capable of discriminating samples that belong to different
classes.
For the problem of feature selection for classification, due to the availability of label information, the
relevance of features is assessed as the capability of distinguishing classes.
For example, a feature fi is said to be relevant to a class cj if fi and cj are highly correlated.
Classification is the problem of identifying to which of a set of categories a new observation belongs,
on the basis of a training set of data containing observations whose category membership is known.
Based on the correlation analysis results, we can see that features petal_length and petal_width are
the best features for iris classification. As per the pair plot graphs, petal_length and petal_width is
highly correlated.
If you try to train a model on a set of features with no or very little correlation, it will give inaccurate
results.
4. Which class is easier to identify than the other two classes for the iris dataset? How can you tell
it?
As per the Correlation analysis results, the class Setosa with target value 0 is easier to identify than
the other two classes (1-Versiocolor, 2-Virginica) for the iris dataset.
As evident in the plotted graph, Setosa (represented by blue color) is easily separable and can be
distinguished by the other two classes of species of iris dataset. Setosa is easy to classify and has an
easily separable boundary around it and helps to eliminate it from the other two classes.

5. Which classification model produces better test result for the iris data? Linear SVM trained on
all features or linear SVM trained on the two best features? What does this tell you?
As per the classifier test performance, we see that linear SVM helps the classification results by
visually plotting the decision boundaries. Different colored regions correspond to different classes.
As per the linear SVM classifier test performance, linear SVM trained on all features has better result
compared to linear SVM trained on the two best features for the iris dataset because the test
accuracy has gone up to 95% compared to initial level of 85%.
6. Why linear SVM does not produce good result for the two moons example?
Linear SVM does not produce good result for the two moons example because this is a binary
classification problem and the targets from this dataset will not be well separated with a linear
classifier.
7. Compare linear and kernel SVM in terms of predictive performance and training speed. What
conclusions can you make?
a. Kernel SVM achieves better performance in terms of higher accuracy than linear SVM.
Accuracy of Linear SVM = 86.0 %
Accuracy of Kernel SVM = 93.4 %
b. Kernel SVM produces a nonlinear decision boundary (a curve) to separate points from two
classes, showing different regions in different colors while Linear SVM produces a linear
decision boundary (a line) to separate points from two classes, which is not appropriate for
this case.
c. Though kernel SVM is effective yet it is slower than linear SVM in training. When we
measured the average time by training both linear and kernel SVM classifiers 3 * 100 times,
the results are as follows:
Linear SVM:
100 loops, best of 3: 11 ms per loop
Kernel SVM:
100 loops, best of 3: 22.8 ms per loop
We can conclude that kernel SVM is a good classifier in terms of predictive performance
while linear SVM is better classifier in terms of training speed.
8. Why do we need to perform parameter selection in training classification models for predictive
modelling?
We need to perform parameter selection in training classification models for predictive modelling
because it helps in further improving the performance of the model, particularly training
performance.

Based on the test results, we can see that both training and testing performances are affected by the
choice of parameter. For Example, we have used the regularisation parameter C to see the effect on
performance.
Increasing value of C shows improvement in training performance but not in testing performance
due to overfitting of the model.
9. Why can't we choose the classifier parameter that produces the best training performance?
We can’t choose the classifier parameter that produces the best training performance because
maximizing training accuracy rewards overly complex models which overfit the training data.
There is an effective approach called cross validation for parameter selection on the training dataset.
10. What is cross validation and why it is an effective technique for parameter selection in
classifier training?
Cross Validation is used to assess the predictive performance of the models and to judge how they
perform outside the sample to a new data set also known as test data.
The motivation to use cross validation techniques is that when we fit a model, we are fitting it to a
training dataset. Without cross validation we only have information on how our model performs to
our in-sample data. Ideally, we would like to see how the model performs when we have a new data
in terms of accuracy of its predictions. In science, theories are judged by its predictive performance.
k-fold cross-validation is mostly suggested in machine learning.
Cross validation is an effective technique for parameter selection in classifier training because it uses
data more efficiently as every observation is used for both training and testing and it provides more
accurate estimate of out-of-sample accuracy.
In our example, we can see the accuracy value achieved by kernel SVM classifier trained with
optimal parameter is higher than that produced with kernel SVM classifier trained using default
parameter value. This validates the importance and effectiveness of parameter selection.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

References:
Jason Brownlee (2016) Your First Machine Learning Project in Python Step-By-Step. Tate [online].
Available from: https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
[Accessed 21 May 2018].
Karlijn Willems (2017) Python Exploratory Data Analysis Tutorial. Tate [online].
Available from: https://www.datacamp.com/community/tutorials/exploratory-data-analysis-python
[Accessed 21 May 2018].
Roberto Lopez (2018) Iris flowers classification. Tate [online]. Available
from: https://www.neuraldesigner.com/learning/examples/iris_flowers_classification [Accessed 21
May 2018].
Ritchie Ng (2018). Cross-Validation. Tate [online].
Available from: http://www.ritchieng.com/machine-learning-cross-validation/
[Accessed 21 May 2018].