Machine Learning: KNN, Logistic Regression, SVM, Random Forest

Verified

Added on 2023/06/04

AI Summary

This project focuses on implementing machine learning techniques for classification, including KNN, logistic regression, SVM, and random forest. The data set is analyzed and understood, and each model is tested and compared. The best model is identified and discussed.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Machine Learning
Contents
Introduction.................................................................................................................................................1
Task A: Understanding of the data..............................................................................................................1
Task B: K-Nearest Neighbor Classification...................................................................................................2
Task C: Multiclass Logistic Regression with Elastic Net................................................................................6
Task D: Support Vector Machine ( RBF Kernal)............................................................................................9
Task E: Random Forest..............................................................................................................................11
Task F: Comparison of the different models..............................................................................................12
References:................................................................................................................................................14
List of Figures
Figure 1 Cross validation for the KNN model...............................................................................................6
Figure 2 Cross validation plot for the logistic regression.............................................................................9
Figure 3 Cross validation for the SVM model.............................................................................................11
Figure 4 Cross validation plot for the random forest.................................................................................13
List of Tables
Table 1 overview of the data.......................................................................................................................3
Table 2 calculating the f1 score for KNN model...........................................................................................4
Table 3 Results of Grid Search for the logistic regression............................................................................8
Table 4 Results from the GridSearch for random forest............................................................................12

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Introduction
Machine learning and data analytics has become an important part of every industry ranging
from the FMCG companies to the health care and the logistics industries. Availability of
different types of data (both structured and the unstructured data) and the analytical and
statistical techniques has helped various industries to make better business decisions(Acharjya &
P 2016; Xie et al. 2018). From predicting the weather to predicting the customer purchasing
behavior data analytics has become important part of every business. So, in the project is also
aimed to implement some of the machine learning techniques and interpret the findings. This
project will focus majorly on the classification techniques. Some of the classification techniques
included in the current project are the random forest, support vector machine, KNN and the
logistic regression. All the techniques have been tested on the same data set so that the
comparison can be made. The comparison of the models and identification of the best model has
been presented in the last part of the project.
Task A: Understanding of the data
The first part is focused on the understanding the given data set from various perspective. For
analysis, whether it is a big or small, the understanding of the data is very important. Until and
unless the data is understood properly, the appropriate analysis cannot be performed. In other
words, to find useful information from the given data, understanding of the given data is very
important(Ziafat & Shakeri 2014). In this section, the understanding of the data has been done on
the basis of the aim and the number of the data points in the data set.
As the data and the given guidelines suggests, the main objective of the current data analysis
project is to use the different analytical and statistical techniques and profile the
customers/individual’s behavior. In other words classifying the individuals in different groups
based on their behavior.
Data from both the train and test data set shows that there are different kinds of activities
included in the research. These are the daily activities of the human beings or more specifically
the physical activities. Some of them are walking, lying down, climbing the stairs, sitting etc.
In terms of the data, the total number of rows are 7352, which means there are 7352 instances,
whereas the data shows that the number of columns are 561 which are also the number of
features. Data also shows that the data was collected among 30 different individuals.
While talking about the results, among the different classification techniques used in the study,
the results from the SVM method shows the most promising results with 96 % accuracy.

Task B: K-Nearest Neighbor Classification
The second part of the research is focused on the implementing the K Nearest Neighbor which is
one of the most popular classification techniques. For all the analysis in the current study, the
programming language python has been used.
Table 1 overview of the data
The overview of the data has been shown in the above figure.
As per the given instruction in the project in the first stage a 10 fold cross validation has been
developed. In this cross validation, the minimum value of k is 1 and the maximum value is 50.
The results from this analysis shows that the f1 score (which has been used for the current
analysis as the evaluation parameter) is around 0.97 when the training data set was used and
when the same was used for the test data it was 0.90.
Table 2 calculating the f1 score for KNN model
Now, the next step is to run the Gridsearch and the results from the gridsearch is shown below:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

mean: 0.89097, std: 0.03219, params: {'n_neighbors': 1},
mean: 0.88330, std: 0.04121, params: {'n_neighbors': 2},
mean: 0.90214, std: 0.03369, params: {'n_neighbors': 3},
mean: 0.90180, std: 0.03341, params: {'n_neighbors': 4},
mean: 0.90639, std: 0.03574, params: {'n_neighbors': 5},
mean: 0.90461, std: 0.03294, params: {'n_neighbors': 6},
mean: 0.90706, std: 0.03620, params: {'n_neighbors': 7},
mean: 0.90864, std: 0.03803, params: {'n_neighbors': 8},
mean: 0.90620, std: 0.03628, params: {'n_neighbors': 9},
mean: 0.90936, std: 0.03636, params: {'n_neighbors': 10},
mean: 0.90823, std: 0.03584, params: {'n_neighbors': 11},
mean: 0.90865, std: 0.03529, params: {'n_neighbors': 12},
mean: 0.90628, std: 0.03625, params: {'n_neighbors': 13},
mean: 0.90897, std: 0.03392, params: {'n_neighbors': 14},
mean: 0.90778, std: 0.03427, params: {'n_neighbors': 15},
mean: 0.90651, std: 0.03579, params: {'n_neighbors': 16},
mean: 0.90596, std: 0.03585, params: {'n_neighbors': 17},
mean: 0.90788, std: 0.03475, params: {'n_neighbors': 18},
mean: 0.90694, std: 0.03434, params: {'n_neighbors': 19},
mean: 0.90796, std: 0.03278, params: {'n_neighbors': 20},
mean: 0.90555, std: 0.03482, params: {'n_neighbors': 21},
mean: 0.90531, std: 0.03352, params: {'n_neighbors': 22},
mean: 0.90512, std: 0.03655, params: {'n_neighbors': 23},
mean: 0.90434, std: 0.03502, params: {'n_neighbors': 24},
mean: 0.90470, std: 0.03598, params: {'n_neighbors': 25},
mean: 0.90394, std: 0.03570, params: {'n_neighbors': 26},
mean: 0.90419, std: 0.03532, params: {'n_neighbors': 27},
mean: 0.90338, std: 0.03449, params: {'n_neighbors': 28},
mean: 0.90130, std: 0.03539, params: {'n_neighbors': 29},
mean: 0.90229, std: 0.03492, params: {'n_neighbors': 30},
mean: 0.90214, std: 0.03597, params: {'n_neighbors': 31},
mean: 0.90297, std: 0.03471, params: {'n_neighbors': 32},
mean: 0.90253, std: 0.03496, params: {'n_neighbors': 33},
mean: 0.90226, std: 0.03356, params: {'n_neighbors': 34},
mean: 0.90175, std: 0.03392, params: {'n_neighbors': 35},
mean: 0.90175, std: 0.03303, params: {'n_neighbors': 36},
mean: 0.90172, std: 0.03386, params: {'n_neighbors': 37},
mean: 0.90255, std: 0.03244, params: {'n_neighbors': 38},
mean: 0.90210, std: 0.03276, params: {'n_neighbors': 39},
mean: 0.90190, std: 0.03287, params: {'n_neighbors': 40},
mean: 0.90014, std: 0.03411, params: {'n_neighbors': 41},
mean: 0.90174, std: 0.03223, params: {'n_neighbors': 42},
mean: 0.90009, std: 0.03441, params: {'n_neighbors': 43},
mean: 0.90070, std: 0.03303, params: {'n_neighbors': 44},
mean: 0.90039, std: 0.03331, params: {'n_neighbors': 45},
mean: 0.90047, std: 0.03249, params: {'n_neighbors': 46},
mean: 0.89809, std: 0.03393, params: {'n_neighbors': 47},
mean: 0.89939, std: 0.03262, params: {'n_neighbors': 48},

Table Results from the gridsearch for KNN model
Now, the next task is to find the best estimator the knn.best.estimator has been used.
Now, for all the values from the grid search has been used and the f1 score has been tested for
each iteration.
Cross Validation plot
In this section the results from the cross validation has been discussed and the plot has been
shown
Figure 1 Cross validation for the KNN model

.
On the basis of the results from the cross validation, the f1 score continue to increase till k = 10.
In fact f1 is highest when k = 10. After that the f1 value declines. So, on the basis of this results
the accuracy score and the confusion matrix has been constructed.
0.906012889136
0.906684764167
[[534 2 1 0 0 0]
[ 0 409 78 0 0 4]
[ 0 47 485 0 0 0]
[ 0 0 0 486 10 0]
[ 0 0 0 51 331 38]
[ 0 0 0 36 8 427]]
As the results show the f1 value is 0.90 and also the value of accuracy is 0.90. So, it can be
concluded that our model is 90 % accurate.
Task C: Multiclass Logistic Regression with Elastic Net
Another model used in the current project is the Multiple logistics model and the results from the
model are discussed in the current section (Armstrong 2012; Cerrito 2010; George, Seals &
Aban 2014).
The results from the gridsearch for logistic regresion is shown below:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

[mean: 0.33910, std: 0.19486, params: {'alpha': 0.0001, 'l1_ratio': 0},
mean: 0.33151, std: 0.19988, params: {'alpha': 0.0001, 'l1_ratio': 0.15},
mean: 0.36839, std: 0.23374, params: {'alpha': 0.0001, 'l1_ratio': 0.5},
mean: 0.37234, std: 0.23505, params: {'alpha': 0.0001, 'l1_ratio': 0.7},
mean: 0.34980, std: 0.20884, params: {'alpha': 0.0001, 'l1_ratio': 1},
mean: 0.37182, std: 0.18679, params: {'alpha': 0.0003, 'l1_ratio': 0},
mean: 0.34282, std: 0.19412, params: {'alpha': 0.0003, 'l1_ratio': 0.15},
mean: 0.37101, std: 0.22326, params: {'alpha': 0.0003, 'l1_ratio': 0.5},
mean: 0.35400, std: 0.21213, params: {'alpha': 0.0003, 'l1_ratio': 0.7},
mean: 0.33693, std: 0.20672, params: {'alpha': 0.0003, 'l1_ratio': 1},
mean: 0.38480, std: 0.22782, params: {'alpha': 0.001, 'l1_ratio': 0},
mean: 0.37996, std: 0.20958, params: {'alpha': 0.001, 'l1_ratio': 0.15},
mean: 0.36685, std: 0.21883, params: {'alpha': 0.001, 'l1_ratio': 0.5},
mean: 0.40036, std: 0.20936, params: {'alpha': 0.001, 'l1_ratio': 0.7},
mean: 0.41256, std: 0.21517, params: {'alpha': 0.001, 'l1_ratio': 1},
mean: 0.38741, std: 0.19943, params: {'alpha': 0.003, 'l1_ratio': 0},
mean: 0.39637, std: 0.21459, params: {'alpha': 0.003, 'l1_ratio': 0.15},
mean: 0.40270, std: 0.23931, params: {'alpha': 0.003, 'l1_ratio': 0.5},
mean: 0.40973, std: 0.21253, params: {'alpha': 0.003, 'l1_ratio': 0.7},
mean: 0.39437, std: 0.23102, params: {'alpha': 0.003, 'l1_ratio': 1},
mean: 0.37772, std: 0.21328, params: {'alpha': 0.01, 'l1_ratio': 0},
mean: 0.38142, std: 0.21158, params: {'alpha': 0.01, 'l1_ratio': 0.15},
mean: 0.37323, std: 0.20258, params: {'alpha': 0.01, 'l1_ratio': 0.5},
mean: 0.36746, std: 0.22137, params: {'alpha': 0.01, 'l1_ratio': 0.7},
mean: 0.36198, std: 0.22322, params: {'alpha': 0.01, 'l1_ratio': 1},
mean: 0.31990, std: 0.19613, params: {'alpha': 0.03, 'l1_ratio': 0},
mean: 0.29392, std: 0.19807, params: {'alpha': 0.03, 'l1_ratio': 0.15},
mean: 0.29590, std: 0.17576, params: {'alpha': 0.03, 'l1_ratio': 0.5},
mean: 0.30318, std: 0.18595, params: {'alpha': 0.03, 'l1_ratio': 0.7},
mean: 0.29989, std: 0.17865, params: {'alpha': 0.03, 'l1_ratio': 1}]
Table 3 Results of Grid Search for the logistic regression
On the basis of the gridsearch , the best estimator will be identified. In this case the best alpha
value has comes out to be 0.001 when L1_ration takes the value 0.
Now, based on L1_ration and the best alpha value the researcher have plotted the cross
validation plot.

Figure 2 Cross validation plot for the logistic regression
On the basis cross validation results, the best alpha has been identified. However this was all
done on the train data. Now, the same can be implied on the given test data.
0.916455898484
[[537 0 0 0 0 0]
[ 5 336 145 2 0 3]
[ 0 7 523 2 0 0]
[ 0 0 0 494 2 0]
[ 0 0 0 16 394 10]
[ 0 0 0 46 3 422]]
Similar to the previous case, when the model was run on the test data, the f1 score comes out to
be 0.91. In other words the model is 91% accurate. The same for the KNN model was 90 %.

Task D: Support Vector Machine ( RBF Kernal)
In this fourth section another popular classification technique has been discussed and this
technique is SVM or Support Vector Machine. In this the RBF kernel will be used (Bhavsar &
Panchal 2012).
In this case the SVM has been optimized on the basis of the following two parameters:
The first one is “C” which is the penalty meter of the error term. Whereas the second is the
“Gamma” which is described as the kernel coefficient for the functional form.
In this case also the grisearch has been used for the purpose of tuning the given hyper
parameters and the SVM best estimator has been identified. .
Identification of classification of SVM best estimator:
The value of C and Gamma are as follows:
{'gamma': [ 1e-3, 1e-4],
'C':[1, 10, 100, 1000]}
On the basis of the above results, the optimal values are:
C = 1000
gamma value= 0.001
Now, the next step is to plot the cross validation plot on the basis of the given “C” and
“Gamma”.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Figure 3 Cross validation for the SVM model
After the plot, the next step is to calculate the f1 and cofusion matrix. For SVM the results are as
follows:
0.965624534728
[[537 0 0 0 0 0]
[ 0 436 53 0 0 2]
[ 0 12 520 0 0 0]
[ 0 0 0 493 3 0]
[ 0 0 0 4 406 10]
[ 0 0 0 17 0 454]]
As the results shows, the f1 is 0.96 which indicates that there is 96 % accuracy in our model
when SVM technique is used. This is highest as compared to the f1 score of previous classifier
techniques.

Task E: Random Forest
The last technique used in the current case is the random forest which is another popular
classification techniques used by the researchers (Biau 2012).
[mean: 0.58191, std: 0.21445, params: {'n_estimators': 200, 'max_depth': 300},
mean: 0.57864, std: 0.21359, params: {'n_estimators': 500, 'max_depth': 300},
mean: 0.57928, std: 0.21264, params: {'n_estimators': 700, 'max_depth': 300},
mean: 0.58191, std: 0.21445, params: {'n_estimators': 200, 'max_depth': 500},
mean: 0.57864, std: 0.21359, params: {'n_estimators': 500, 'max_depth': 500},
mean: 0.57928, std: 0.21264, params: {'n_estimators': 700, 'max_depth': 500},
mean: 0.58191, std: 0.21445, params: {'n_estimators': 200, 'max_depth': 600},
mean: 0.57864, std: 0.21359, params: {'n_estimators': 500, 'max_depth': 600},
mean: 0.57928, std: 0.21264, params: {'n_estimators': 700, 'max_depth': 600}]
Table 4 Results from the GridSearch for random forest
The results of the grid research for Random forest is shown in table above. In this case the tuning
of the parameters has been conducted based on the trees numbers and the maximum depth of the
tree. To show the results visually the plot has been shown in the figure below with respect to f1.
.

Figure 4 Cross validation plot for the random forest
.The results for the f1 and confusion matrix for random forest is shown below
0.928839558732
[[537 0 0 0 0 0]
[ 0 441 50 0 0 0]
[ 0 42 490 0 0 0]
[ 0 0 0 482 9 5]
[ 0 0 0 20 357 43]
[ 0 0 0 34 6 431]]
As the results shows the f1 value is 0.92 which indicates that the model is 92 % accurate.
Task F: Comparison of the different models
Different classification models have been discussed in the current project and for the evaluation
purpose the f1 score has been used. There are other parameters also which can be used such as
the precision, recall, accuracy. However f1 is considered to be most reliable so f1 has been used

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

in this case. On the basis of the results from f1 in different models, the best model comes out to
be the SVM model. This is because the f1 score was highest (96%) in this model. However the
results can be different if other parameters are used for the comparison. Also, it should be noted
that different types of classifiers have their advantages and limitations. The selection of the
model also depends on the types of data and the main aim of the classification.

References:
Acharjya, DP & P, KA 2016, ‘A Survey on Big Data Analytics: Challenges, Open Research
Issues and Tools’, International Journal of Advanced Computer Science and Applications,, vol.
7, no. 2, pp. 511–518.
Armstrong, JS 2012, ‘Illusions in regression analysis’, International Journal of Forecasting,,
vol. 6, pp. 689–694.
Bhavsar, H & Panchal, MH 2012, ‘A Review on Support Vector Machine for Data
Classification’, International Journal of Advanced Research in Computer Engineering &
Technology, vol. 1, no. 10, pp. 185–189.
Biau, G 2012, ‘Analysis of a Random Forests Model’, Journal of Machine Learning Research,
vol. 13, pp. 1063–1095.
Cerrito, PB 2010, The Difference Between Predictive Modeling and Regression, Louisville.
George, B, Seals, S & Aban, I 2014, ‘Survival analysis and regression models’, NCBI, vol. 21,
no. 4, pp. 686–694.
Xie, Ji, Song, Z, Li, Y, Zhang, Y, Yu, H, Zhan, J, Ma, Z, Qiao, Y, Zhang, J & Guo, J 2018, ‘A
Survey on Machine Learning-Based Mobile Big Data Analysis: Challenges and Applications’,
Wireless Communications and Mobile Computing, vol. 2018, pp. 1–19.
Ziafat, H & Shakeri, M 2014, ‘Using Data Mining Techniques in Customer Segmentation’, Int.
Journal of Engineering Research and Applications, vol. 4, no. 9, pp. 70–79.