Machine Learning: Binary & Multiclass Classification with Regression

Verified

Added on 2023/06/07

AI Summary

This machine learning project implements logistic regression techniques for real-world problem-solving, focusing on both binary and multiclass classification. It explores concepts like overfitting and underfitting, emphasizing the application of classification techniques learned in class. The project details data munging processes, including handling missing values through mean imputation, and the implementation of logistic regression models using Python. The analysis covers hyperparameter tuning to optimize model performance. The data is divided into training and test sets, with the optimal model developed on the training data and then applied to the test data. The project also provides summary statistics of the variables, frequency distribution, and missing value analysis.

Machine learning project
Contents
Introduction.................................................................................................................................................2
Task A: Binary Classification......................................................................................................................2
Section 1.1 : Data Munging.....................................................................................................................3
Section 1.2 : Logistic Regression train logistic regression models..........................................................8
Section 1.3 : Choosing the best hyper parameter...................................................................................10
Task B: Multiclass Classification..........................................................................................................17
Section Choosing the best hyper parameter2.2 : - ...................................................................................19
References.................................................................................................................................................26
List of Tables
Table 1 Results for the summary statistics of the variables in the data set...................................................6
Table 2 Summary statistics after the mean value imputation for the missing values...................................8

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Introduction
The current project is aimed to implement the logistic regression techniques to solve the problem
of the real world. For the analysis purpose the two class and the multi-class classification
techniques have been used. Furthermore the concepts of the over fitting and under fitting have
also been used for the analysis purpose. The main purpose of the current project is to implement
the classification techniques learned in the class to solve the real world problem(Forbes, 2016;
Kambatla, Kollias, Kumar, & Grama, 2014; Picciano, 2012).
In the recent time with increase in the volume and the variety of the data generated, it has
become important part of every business and every sector to use the various data analysis
techniques to solve the business problems. It has been proved from the previous researches that
the data backed business process and the data backed business strategies are more effective and
cost efficient as compared to the traditional business strategies, which were mostly based on the
previous experience and intuitions of the decision maker. However this process have changed
and with the development of the various machine learning techniques and statistical modelling it
is now possible to predict the future values more accurately than ever before. In machine
learning techniques also the algorithms train on the historical data and once the optimal model is
finalized, same model is used for the test data. These techniques are very powerful and it has
been widely used in many organizations(Belle et al., 2015; Cao, Chychyla, & Stewart, 2015;
Chen, Chiang, & Storey, 2012; Peisker & Dalai, 2015).
Task A: Binary Classification
In this section the results for the binary classification has been shown. In the first part, the results
for the data munging has been shown followed by the implementation of the logistic regression
models. The third part of the first part is focused on the selecting the choosing the best hyper-
parameter.
For the analysis, in this section python has been used which has become one of the most widely
used software among both the data scientists and software developers as well as the data
analysts.

Section 1.1 : Data Munging
The results for the data munging has been shown in the following analysis where the first step is
to import the required pandas.
import pandas as pd
import numpy as np
#importing the wisconsin_data
df_train = pd.read_csv("train_wbcd.csv")
df_test = pd.read_csv("test_wbcd.csv")
Now after importing the data set, the entire data set has to be divided into the training and the test
data. The model will be first run in the train data and the optimal model will be developed. Once
the optimal modeis developed the same model is implemented in the test data.
#dividing the data into training and the test data
train_test=df_train.append(df_test)
The next task is to find the target variable which is also known as the response variable. This is
the main variable of interest in the every model. Traditionally it is also known as the dependent
variable(Jonker, J. and Pennink, 2010; Kumar, 2014).
#indentification of the response varaible
df_train.columns
Out[43]:
Index([u'Patient_ID', u'Diagnosis', u'f1', u'f2', u'f3', u'f4', u'f5', u'f6',
u'f7', u'f8', u'f9', u'f10', u'f11', u'f12', u'f13', u'f14', u'f15',
u'f16', u'f17', u'f18', u'f19', u'f20', u'f21', u'f22', u'f23', u'f24',
u'f25', u'f26', u'f27', u'f28', u'f29', u'f30'],
dtype='object')

While examining the frequency distribution of the data, it has been found that the data set is
balanced.
#examining the frequency distribution of the data
df_train['Diagnosis'].value_counts()
#It is a balanced distribution
B 58
M 42
dtype: int64
The next step is to examine whether there is missing data in the variables. This is because
sometimes the mssing data significantly affect the results of the data analysis.
#examining the missing values in the data
train_all_desc=train_test.describe()
train_all_desc
Summary statistics of the variables
In this section the results for the summary statistics of the variables has been shown. For the
descriptive st atistis the mean, standard deviation , minimum and maximum values have been
used.
Pa
tie
nt
_I
D
f1 f2 f3 f4 f5 f6 f7 f8 f9 ...
f
2
1
f2
2
f2
3
f2
4 f25 f2
6
f2
7
f2
8
f2
9
f3
0
co
un
t
1.2
000
00e
+02
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
.
.
.
11
7.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
m
ea
2.7
139
14.
06
19.
47
91.
67
65
6.7
0.0
95
0.1
05
0.0
89
0.0
48
0.1
81
.
.
16.
33
25.
80
10
7.8
90
1.9
0.1
33
0.2
64
0.2
75
0.1
14
0.2
93
0.0
84

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Pa
tie
nt
_I
D
f1 f2 f3 f4 f5 f6 f7 f8 f9 ...
f
2
1
f2
2
f2
3
f2
4 f25 f2
6
f2
7
f2
8
f2
9
f3
0
n 75e
+07
61
50
81
67
30
00
77
50
0
90
0
61
2
10
5
09
6
35
3 . 86
32
65
00
82
33
3
81
66
7
22
7
19
4
37
5
70
6
51
5
86
9
std
1.1
637
07e
+08
3.8
47
44
1
4.6
28
01
6
26.
74
62
11
39
1.5
81
71
4
0.0
13
67
2
0.0
56
22
7
0.0
85
17
3
0.0
42
44
9
0.0
28
15
1
.
.
.
5.4
20
70
0
6.6
62
57
6
37.
72
43
62
64
9.9
77
40
8
0.0
23
28
5
0.1
59
73
8
0.2
11
00
3
0.0
74
70
0
0.0
65
19
4
0.0
16
39
9
mi
n
8.6
700
00e
+03
7.7
29
00
0
10.
82
00
00
47.
98
00
00
17
8.8
00
00
0
0.0
68
83
0
0.0
23
44
0
0.0
00
00
0
0.0
00
00
0
0.1
06
00
0
.
.
.
8.9
52
00
0
14.
10
00
00
56.
65
00
00
24
0.1
00
00
0
0.0
71
17
0
0.0
27
29
0
0.0
00
00
0
0.0
00
00
0
0.1
56
60
0
0.0
59
05
0
25
%
8.6
584
65e
+05
11.
74
75
00
15.
80
00
00
75.
02
25
00
42
5.1
00
00
0
0.0
84
70
8
0.0
61
53
2
0.0
25
11
0
0.0
17
74
5
0.1
61
87
5
.
.
.
12.
82
00
00
19.
68
75
00
83.
72
25
00
51
0.2
75
00
0
0.1
18
32
5
0.1
45
92
5
0.0
93
13
2
0.0
60
36
2
0.2
49
37
5
0.0
73
96
0
50
%
9.0
432
75e
+05
13.
49
00
00
19.
03
00
00
86.
71
50
00
56
4.1
50
00
0
0.0
95
53
0
0.0
95
84
0
0.0
62
90
5
0.0
31
26
0
0.1
80
05
0
.
.
.
15.
29
00
00
26.
01
00
00
98.
24
50
00
72
7.1
00
00
0
0.1
34
30
0
0.2
37
70
0
0.2
49
60
0
0.0
91
63
0
0.2
82
05
0
0.0
81
64
5
75
%
8.7
368
26e
+06
15.
32
50
00
21.
94
75
00
10
0.5
25
00
0
73
0.9
25
00
0
0.1
04
32
5
0.1
30
42
5
0.1
31
95
0
0.0
69
86
2
0.1
96
70
0
.
.
.
18.
33
00
00
30.
92
25
00
12
3.5
00
00
0
10
29.
75
00
00
0.1
49
30
0
0.3
57
92
5
0.4
00
90
0
0.1
64
57
5
0.3
24
25
0
0.0
94
77
8
m
ax
9.1
129
62e
+08
27.
42
00
00
32.
47
00
00
18
6.9
00
00
0
25
01.
00
00
00
0.1
32
60
0
0.3
11
40
0
0.4
26
40
0
0.1
84
50
0
0.2
56
90
0
Table 1 Results for the summary statistics of the variables in the data set.
In [45]:

#feature with missing values
train_all_desc.columns[train_all_desc.ix["count"]<len(train_test)]
Out[45]:
Index([u'f21'], dtype='object')
In [46]:
train_all_desc.columns[train_all_desc.ix["count"]<len(train_test)]
Out[46]:
Index([u'f21'], dtype='object')
In [47]:
On the basis of the missing value analysis, it has been found that there are some variables with
the missing values, so to handle such missing values, the missing values are being replaced by
the mean value of the series. This is because all the variables with the missing values are
numerical variables. Other than the mean value imputation, other imputations can also be run
such as the mode value imputation and the median value imputation(Macdonald & Headlam,
2010).
#imputing the missing values
missing_feature = train_all_desc.columns[train_all_desc.ix["count"]<len(train_test)]
In [48]:
print df_train[missing_feature].dtypes
f21 float64
dtype: object
In [49]:
print df_train[missing_feature].dtypes
f21 float64
dtype: object
In [50]:
# impute with mean value of the series
df_train[missing_feature]=df_train[missing_feature].fillna(train_test[missing_feature].mean())

df_test[missing_feature]=df_test[missing_feature].fillna(train_test[missing_feature].mean())
In [51]:
df_train.describe()
Out[51]:
Pa
tie
nt
_I
D
f1 f2 f3 f4 f5 f6 f7 f8 f9 ...
f
2
1
f2
2
f2
3
f2
4 f25 f2
6
f2
7
f2
8
f2
9
f3
0
co
un
t
100.
000
000
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
.
.
.
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
m
ea
n
116
079
11.5
900
00
14.
22
59
20
19.
20
79
00
92.
72
26
00
66
6.3
75
00
0
0.0
95
69
6
0.1
06
12
9
0.0
90
36
4
0.0
49
34
6
0.1
79
89
3
.
.
.
16.
48
38
03
25.
38
08
00
10
8.9
25
20
0
90
9.1
91
00
0
0.1
32
56
3
0.2
65
14
4
0.2
78
17
6
0.1
17
59
7
0.2
89
19
6
0.0
83
99
9
st
d
273
760
03.7
460
93
3.7
29
96
3
4.7
32
47
6
25.
92
49
25
36
6.7
68
84
6
0.0
13
49
6
0.0
57
69
4
0.0
84
44
9
0.0
42
06
6
0.0
27
48
2
.
.
.
5.1
96
88
6
6.6
89
07
2
36.
43
29
02
59
7.8
43
39
6
0.0
22
10
8
0.1
61
63
2
0.2
10
61
7
0.0
75
22
7
0.0
58
58
6
0.0
14
82
3
mi
n
867
0.00
000
0
7.7
29
00
0
10.
82
00
00
47.
98
00
00
17
8.8
00
00
0
0.0
68
83
0
0.0
23
44
0
0.0
00
00
0
0.0
00
00
0
0.1
06
00
0
.
.
.
9.0
77
00
0
14.
10
00
00
57.
17
00
00
24
8.0
00
00
0
0.0
71
17
0
0.0
27
29
0
0.0
00
00
0
0.0
00
00
0
0.1
56
60
0
0.0
59
05
0
25
%
865
035.
000
000
11.
88
00
00
15.
60
75
00
75.
66
75
00
43
0.8
25
00
0
0.0
84
64
5
0.0
62
06
5
0.0
25
07
0
0.0
19
01
7
0.1
61
70
0
.
.
.
13.
05
75
00
19.
51
00
00
84.
05
50
00
51
4.9
25
00
0
0.1
19
27
5
0.1
56
57
5
0.0
93
76
2
0.0
60
36
2
0.2
46
87
5
0.0
73
96
0
50
%
901
301.
500
000
13.
60
00
00
18.
80
50
00
87.
35
50
00
57
2.0
50
00
0
0.0
94
98
5
0.0
96
48
5
0.0
66
14
5
0.0
32
56
5
0.1
79
70
0
.
.
.
15.
41
00
00
25.
67
00
00
98.
24
50
00
72
7.1
00
00
0
0.1
34
30
0
0.2
37
70
0
0.2
56
65
0
0.1
04
25
0
0.2
79
60
0
0.0
81
66
0

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Pa
tie
nt
_I
D
f1 f2 f3 f4 f5 f6 f7 f8 f9 ...
f
2
1
f2
2
f2
3
f2
4 f25 f2
6
f2
7
f2
8
f2
9
f3
0
75
%
282
168
9.25
000
0
15.
70
75
00
21.
91
75
00
10
3.6
50
00
0
76
8.3
25
00
0
0.1
03
82
5
0.1
30
02
5
0.1
35
87
5
0.0
76
82
5
0.1
93
40
0
.
.
.
18.
92
25
00
30.
87
00
00
12
5.4
50
00
0
11
10.
25
00
00
0.1
47
87
5
0.3
57
05
0
0.4
00
90
0
0.1
73
95
0
0.3
20
60
0
0.0
93
80
8
m
ax
919
797
01.0
000
00
25.
22
00
00
32.
47
00
00
17
1.5
00
00
0
18
78.
00
00
00
0.1
32
60
0
0.3
11
40
0
0.4
26
40
0
0.1
84
50
0
0.2
55
60
0
.
.
.
31.
01
00
00
45.
41
00
00
21
1.7
00
00
0
29
44.
00
00
00
0.1
87
80
0
0.7
58
40
0
0.9
60
80
0
0.2
91
00
0
0.4
75
30
0
0.1
28
40
0
Table 2 Summary statistics after the mean value imputation for the missing values.
In [53]:
Y=df_train['Diagnosis']
In [115]:
Section 1.2 : Logistic Regression train logistic regression models
In this section the results for the different logistic regression has been shown. After handling the
missing values and the normalization of the data, the logistic regression models can be
performed.
To run the logistic regression model, first the library has to be imported, which has been done by
using the following code.
#fitting of the first logistic model
from sklearn import linear_model
In case of the logistic regression, the first model is being tested using the L1 regularization,
where the alpha is equal to 0.1 and lambda is equal to 0.1. The model is shown below:
#First model with L1 regularization
clf_l1 = linear_model.SGDClassifier(alpha=0.1, penalty='l1', random_state=1)

clf_l1.fit(X, Y)
Out[115]:
SGDClassifier(alpha=0.1, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
penalty='l1', power_t=0.5, random_state=1, shuffle=True, verbose=0,
warm_start=False)
In [116]:
#Second model with L2 regularization
clf_l2 = linear_model.SGDClassifier(alpha=0.1, penalty='l2',random_state=1)
clf_l2.fit(X, Y)
Out[116]:
SGDClassifier(alpha=0.1, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
penalty='l2', power_t=0.5, random_state=1, shuffle=True, verbose=0,
warm_start=False)
In [117]:
# Now, normalising the test data
df_test_scaled = preprocessing.scale(df_test_sub)
In [118]:
#After normalizing, the next step is to predict test data
pred_l1 = clf_l1.predict(df_test_scaled)
pred_l2 = clf_l2.predict(df_test_scaled)
In [120]:
pred_l2
Out[120]:
array(['B', 'B', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'M', 'M', 'B', 'B',
'B', 'B', 'B', 'B', 'M', 'B', 'M'],
dtype='|S1')

In [121]:
Section 1.3 : Choosing the best hyper parameter
After running the model, now the model will be tested on the basis of the accuracy, precision and
the recall values(Cerrito, 2010; George, Seals, & Aban, 2014). These values for both the models
have been calculated and shown in the below section.
#checking for model fit
#accuracy
from sklearn.metrics import accuracy_score
print accuracy_score(df_test['Diagnosis'], pred_l1)
print accuracy_score(df_test['Diagnosis'], pred_l2)
0.65
0.9
This shows that the accuracy of the first model is 0.65 which indicates that that prediction is 65
% accurate while in the second model the accuracy is 0.9 meaning that there is 90 % accuracy.
So, in terms of accuracy the second model is better than the first model.(Armstrong, 2012;
Cerrito, 2010)
In [95]:
#precision
from sklearn.metrics import precision_score
print precision_score(df_test['Diagnosis'], pred_l1, pos_label='M')
print precision_score(df_test['Diagnosis'], pred_l2, pos_label='M')
0.444444444444
0.833333333333
The results for precision is shown in above and it indicates that the precision for first model is
0.44 where the second model precision is 0.83 which is much higher than the first model. So, in
terms of the precision also the second model perform better than the first model.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

In [122]:
#recall
from sklearn.metrics import recall_score
print recall_score(df_test['Diagnosis'], pred_l1, pos_label='M')
print recall_score(df_test['Diagnosis'], pred_l2, pos_label='M')
0.666666666667
0.833333333333
The results from the recall are shown above. Results shows that the recall for the first model is
0.66, whereas the recall for the second model is 0.833.
In [97]:
#f1
from sklearn.metrics import f1_score
print f1_score(df_test['Diagnosis'], pred_l1, pos_label='M')
print f1_score(df_['Diagnosis'], pred_l2, pos_label='M')
0.533333333333
0.833333333333
In [123]:
Finally the last parameter to test the models in the current project is the f1, as shown above, the
f1 for the first model is 0.5 whereas for the second model the f1 is 0.8. The f1 is calculated by
taking the ratio of product of precision and recall to the sum of precision and the
recall(Rajasekar, Philominathan, & Chinnathambi, 2013).
#confusion matrix
from sklearn.metrics import confusion_matrix
print confusion_matrix(df_test['Diagnosis'], pred_l1)
print confusion_matrix(df_test['Diagnosis'], pred_l2)
[[9 5]
[2 4]]
[[13 1]
[ 1 5]]

In [124]:
ss = ShuffleSplit(n=len(X)-1,n_iter=100, test_size=0.25)
In [125]:
#finding the best hyperparameter based on average accuracy
#L1
from sklearn.cross_validation import ShuffleSplit
dict_l1=[]
for i in [0.1,1,3,10,33,100,333,1000, 3333, 10000, 33333]:
accu=[]
for train_index, test_index in ss:
#print list(train_index)
#print("%s %s" % (train_index, test_index))
clf_l1 = linear_model.SGDClassifier(alpha=i, penalty='l1', random_state=1)
clf_l1.fit(X.ix[list(train_index)], Y.ix[list(train_index)])
pred=clf_l1.predict(X.ix[list(test_index)])
accu.append( accuracy_score(pred,Y[test_index]))
print i,sum(accu)/float(len(accu))
dict_l1.append(sum(accu)/float(len(accu)))
0.1 0.9572
1 0.5456
3 0.4964
10 0.5168
33 0.532
100 0.504
333 0.4932
1000 0.5036