Implementing Logistic Regression for Binary and Multiclass Classification

Verified

Added on 2023/06/07

AI Summary

This project implements logistic regression techniques for binary and multiclass classification. It covers data munging, training logistic regression models, and choosing the best hyperparameters. The project uses Python for analysis. The purpose is to apply classification techniques to real-world problems using machine learning algorithms.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Machine learning project
Contents
Introduction.................................................................................................................................................2
Task A: Binary Classification......................................................................................................................2
Section 1.1 : Data Munging.....................................................................................................................3
Section 1.2 : Logistic Regression train logistic regression models..........................................................8
Section 1.3 : Choosing the best hyper parameter...................................................................................10
Task B: Multiclass Classification..........................................................................................................17
Section Choosing the best hyper parameter2.2 : - ...................................................................................19
References.................................................................................................................................................26
List of Tables
Table 1 Results for the summary statistics of the variables in the data set...................................................6
Table 2 Summary statistics after the mean value imputation for the missing values...................................8

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Introduction
The current project is aimed to implement the logistic regression techniques to solve the problem
of the real world. For the analysis purpose the two class and the multi-class classification
techniques have been used. Furthermore the concepts of the over fitting and under fitting have
also been used for the analysis purpose. The main purpose of the current project is to implement
the classification techniques learned in the class to solve the real world problem(Forbes, 2016;
Kambatla, Kollias, Kumar, & Grama, 2014; Picciano, 2012).
In the recent time with increase in the volume and the variety of the data generated, it has
become important part of every business and every sector to use the various data analysis
techniques to solve the business problems. It has been proved from the previous researches that
the data backed business process and the data backed business strategies are more effective and
cost efficient as compared to the traditional business strategies, which were mostly based on the
previous experience and intuitions of the decision maker. However this process have changed
and with the development of the various machine learning techniques and statistical modelling it
is now possible to predict the future values more accurately than ever before. In machine
learning techniques also the algorithms train on the historical data and once the optimal model is
finalized, same model is used for the test data. These techniques are very powerful and it has
been widely used in many organizations(Belle et al., 2015; Cao, Chychyla, & Stewart, 2015;
Chen, Chiang, & Storey, 2012; Peisker & Dalai, 2015).
Task A: Binary Classification
In this section the results for the binary classification has been shown. In the first part, the results
for the data munging has been shown followed by the implementation of the logistic regression
models. The third part of the first part is focused on the selecting the choosing the best hyper-
parameter.
For the analysis, in this section python has been used which has become one of the most widely
used software among both the data scientists and software developers as well as the data
analysts.

Section 1.1 : Data Munging
The results for the data munging has been shown in the following analysis where the first step is
to import the required pandas.
import pandas as pd
import numpy as np
#importing the wisconsin_data
df_train = pd.read_csv("train_wbcd.csv")
df_test = pd.read_csv("test_wbcd.csv")
Now after importing the data set, the entire data set has to be divided into the training and the test
data. The model will be first run in the train data and the optimal model will be developed. Once
the optimal modeis developed the same model is implemented in the test data.
#dividing the data into training and the test data
train_test=df_train.append(df_test)
The next task is to find the target variable which is also known as the response variable. This is
the main variable of interest in the every model. Traditionally it is also known as the dependent
variable(Jonker, J. and Pennink, 2010; Kumar, 2014).
#indentification of the response varaible
df_train.columns
Out[43]:
Index([u'Patient_ID', u'Diagnosis', u'f1', u'f2', u'f3', u'f4', u'f5', u'f6',
u'f7', u'f8', u'f9', u'f10', u'f11', u'f12', u'f13', u'f14', u'f15',
u'f16', u'f17', u'f18', u'f19', u'f20', u'f21', u'f22', u'f23', u'f24',
u'f25', u'f26', u'f27', u'f28', u'f29', u'f30'],
dtype='object')

While examining the frequency distribution of the data, it has been found that the data set is
balanced.
#examining the frequency distribution of the data
df_train['Diagnosis'].value_counts()
#It is a balanced distribution
B 58
M 42
dtype: int64
The next step is to examine whether there is missing data in the variables. This is because
sometimes the mssing data significantly affect the results of the data analysis.
#examining the missing values in the data
train_all_desc=train_test.describe()
train_all_desc
Summary statistics of the variables
In this section the results for the summary statistics of the variables has been shown. For the
descriptive st atistis the mean, standard deviation , minimum and maximum values have been
used.
Pa
tie
nt
_I
D
f1 f2 f3 f4 f5 f6 f7 f8 f9 ...
f
2
1
f2
2
f2
3
f2
4 f25 f2
6
f2
7
f2
8
f2
9
f3
0
co
un
t
1.2
000
00e
+02
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
.
.
.
11
7.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
m
ea
2.7
139
14.
06
19.
47
91.
67
65
6.7
0.0
95
0.1
05
0.0
89
0.0
48
0.1
81
.
.
16.
33
25.
80
10
7.8
90
1.9
0.1
33
0.2
64
0.2
75
0.1
14
0.2
93
0.0
84

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Pa
tie
nt
_I
D
f1 f2 f3 f4 f5 f6 f7 f8 f9 ...
f
2
1
f2
2
f2
3
f2
4 f25 f2
6
f2
7
f2
8
f2
9
f3
0
n 75e
+07
61
50
81
67
30
00
77
50
0
90
0
61
2
10
5
09
6
35
3 . 86
32
65
00
82
33
3
81
66
7
22
7
19
4
37
5
70
6
51
5
86
9
std
1.1
637
07e
+08
3.8
47
44
1
4.6
28
01
6
26.
74
62
11
39
1.5
81
71
4
0.0
13
67
2
0.0
56
22
7
0.0
85
17
3
0.0
42
44
9
0.0
28
15
1
.
.
.
5.4
20
70
0
6.6
62
57
6
37.
72
43
62
64
9.9
77
40
8
0.0
23
28
5
0.1
59
73
8
0.2
11
00
3
0.0
74
70
0
0.0
65
19
4
0.0
16
39
9
mi
n
8.6
700
00e
+03
7.7
29
00
0
10.
82
00
00
47.
98
00
00
17
8.8
00
00
0
0.0
68
83
0
0.0
23
44
0
0.0
00
00
0
0.0
00
00
0
0.1
06
00
0
.
.
.
8.9
52
00
0
14.
10
00
00
56.
65
00
00
24
0.1
00
00
0
0.0
71
17
0
0.0
27
29
0
0.0
00
00
0
0.0
00
00
0
0.1
56
60
0
0.0
59
05
0
25
%
8.6
584
65e
+05
11.
74
75
00
15.
80
00
00
75.
02
25
00
42
5.1
00
00
0
0.0
84
70
8
0.0
61
53
2
0.0
25
11
0
0.0
17
74
5
0.1
61
87
5
.
.
.
12.
82
00
00
19.
68
75
00
83.
72
25
00
51
0.2
75
00
0
0.1
18
32
5
0.1
45
92
5
0.0
93
13
2
0.0
60
36
2
0.2
49
37
5
0.0
73
96
0
50
%
9.0
432
75e
+05
13.
49
00
00
19.
03
00
00
86.
71
50
00
56
4.1
50
00
0
0.0
95
53
0
0.0
95
84
0
0.0
62
90
5
0.0
31
26
0
0.1
80
05
0
.
.
.
15.
29
00
00
26.
01
00
00
98.
24
50
00
72
7.1
00
00
0
0.1
34
30
0
0.2
37
70
0
0.2
49
60
0
0.0
91
63
0
0.2
82
05
0
0.0
81
64
5
75
%
8.7
368
26e
+06
15.
32
50
00
21.
94
75
00
10
0.5
25
00
0
73
0.9
25
00
0
0.1
04
32
5
0.1
30
42
5
0.1
31
95
0
0.0
69
86
2
0.1
96
70
0
.
.
.
18.
33
00
00
30.
92
25
00
12
3.5
00
00
0
10
29.
75
00
00
0.1
49
30
0
0.3
57
92
5
0.4
00
90
0
0.1
64
57
5
0.3
24
25
0
0.0
94
77
8
m
ax
9.1
129
62e
+08
27.
42
00
00
32.
47
00
00
18
6.9
00
00
0
25
01.
00
00
00
0.1
32
60
0
0.3
11
40
0
0.4
26
40
0
0.1
84
50
0
0.2
56
90
0
Table 1 Results for the summary statistics of the variables in the data set.
In [45]:

#feature with missing values
train_all_desc.columns[train_all_desc.ix["count"]<len(train_test)]
Out[45]:
Index([u'f21'], dtype='object')
In [46]:
train_all_desc.columns[train_all_desc.ix["count"]<len(train_test)]
Out[46]:
Index([u'f21'], dtype='object')
In [47]:
On the basis of the missing value analysis, it has been found that there are some variables with
the missing values, so to handle such missing values, the missing values are being replaced by
the mean value of the series. This is because all the variables with the missing values are
numerical variables. Other than the mean value imputation, other imputations can also be run
such as the mode value imputation and the median value imputation(Macdonald & Headlam,
2010).
#imputing the missing values
missing_feature = train_all_desc.columns[train_all_desc.ix["count"]<len(train_test)]
In [48]:
print df_train[missing_feature].dtypes
f21 float64
dtype: object
In [49]:
print df_train[missing_feature].dtypes
f21 float64
dtype: object
In [50]:
# impute with mean value of the series
df_train[missing_feature]=df_train[missing_feature].fillna(train_test[missing_feature].mean())

df_test[missing_feature]=df_test[missing_feature].fillna(train_test[missing_feature].mean())
In [51]:
df_train.describe()
Out[51]:
Pa
tie
nt
_I
D
f1 f2 f3 f4 f5 f6 f7 f8 f9 ...
f
2
1
f2
2
f2
3
f2
4 f25 f2
6
f2
7
f2
8
f2
9
f3
0
co
un
t
100.
000
000
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
.
.
.
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
10
0.0
00
00
0
m
ea
n
116
079
11.5
900
00
14.
22
59
20
19.
20
79
00
92.
72
26
00
66
6.3
75
00
0
0.0
95
69
6
0.1
06
12
9
0.0
90
36
4
0.0
49
34
6
0.1
79
89
3
.
.
.
16.
48
38
03
25.
38
08
00
10
8.9
25
20
0
90
9.1
91
00
0
0.1
32
56
3
0.2
65
14
4
0.2
78
17
6
0.1
17
59
7
0.2
89
19
6
0.0
83
99
9
st
d
273
760
03.7
460
93
3.7
29
96
3
4.7
32
47
6
25.
92
49
25
36
6.7
68
84
6
0.0
13
49
6
0.0
57
69
4
0.0
84
44
9
0.0
42
06
6
0.0
27
48
2
.
.
.
5.1
96
88
6
6.6
89
07
2
36.
43
29
02
59
7.8
43
39
6
0.0
22
10
8
0.1
61
63
2
0.2
10
61
7
0.0
75
22
7
0.0
58
58
6
0.0
14
82
3
mi
n
867
0.00
000
0
7.7
29
00
0
10.
82
00
00
47.
98
00
00
17
8.8
00
00
0
0.0
68
83
0
0.0
23
44
0
0.0
00
00
0
0.0
00
00
0
0.1
06
00
0
.
.
.
9.0
77
00
0
14.
10
00
00
57.
17
00
00
24
8.0
00
00
0
0.0
71
17
0
0.0
27
29
0
0.0
00
00
0
0.0
00
00
0
0.1
56
60
0
0.0
59
05
0
25
%
865
035.
000
000
11.
88
00
00
15.
60
75
00
75.
66
75
00
43
0.8
25
00
0
0.0
84
64
5
0.0
62
06
5
0.0
25
07
0
0.0
19
01
7
0.1
61
70
0
.
.
.
13.
05
75
00
19.
51
00
00
84.
05
50
00
51
4.9
25
00
0
0.1
19
27
5
0.1
56
57
5
0.0
93
76
2
0.0
60
36
2
0.2
46
87
5
0.0
73
96
0
50
%
901
301.
500
000
13.
60
00
00
18.
80
50
00
87.
35
50
00
57
2.0
50
00
0
0.0
94
98
5
0.0
96
48
5
0.0
66
14
5
0.0
32
56
5
0.1
79
70
0
.
.
.
15.
41
00
00
25.
67
00
00
98.
24
50
00
72
7.1
00
00
0
0.1
34
30
0
0.2
37
70
0
0.2
56
65
0
0.1
04
25
0
0.2
79
60
0
0.0
81
66
0

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Pa
tie
nt
_I
D
f1 f2 f3 f4 f5 f6 f7 f8 f9 ...
f
2
1
f2
2
f2
3
f2
4 f25 f2
6
f2
7
f2
8
f2
9
f3
0
75
%
282
168
9.25
000
0
15.
70
75
00
21.
91
75
00
10
3.6
50
00
0
76
8.3
25
00
0
0.1
03
82
5
0.1
30
02
5
0.1
35
87
5
0.0
76
82
5
0.1
93
40
0
.
.
.
18.
92
25
00
30.
87
00
00
12
5.4
50
00
0
11
10.
25
00
00
0.1
47
87
5
0.3
57
05
0
0.4
00
90
0
0.1
73
95
0
0.3
20
60
0
0.0
93
80
8
m
ax
919
797
01.0
000
00
25.
22
00
00
32.
47
00
00
17
1.5
00
00
0
18
78.
00
00
00
0.1
32
60
0
0.3
11
40
0
0.4
26
40
0
0.1
84
50
0
0.2
55
60
0
.
.
.
31.
01
00
00
45.
41
00
00
21
1.7
00
00
0
29
44.
00
00
00
0.1
87
80
0
0.7
58
40
0
0.9
60
80
0
0.2
91
00
0
0.4
75
30
0
0.1
28
40
0
Table 2 Summary statistics after the mean value imputation for the missing values.
In [53]:
Y=df_train['Diagnosis']
In [115]:
Section 1.2 : Logistic Regression train logistic regression models
In this section the results for the different logistic regression has been shown. After handling the
missing values and the normalization of the data, the logistic regression models can be
performed.
To run the logistic regression model, first the library has to be imported, which has been done by
using the following code.
#fitting of the first logistic model
from sklearn import linear_model
In case of the logistic regression, the first model is being tested using the L1 regularization,
where the alpha is equal to 0.1 and lambda is equal to 0.1. The model is shown below:
#First model with L1 regularization
clf_l1 = linear_model.SGDClassifier(alpha=0.1, penalty='l1', random_state=1)

clf_l1.fit(X, Y)
Out[115]:
SGDClassifier(alpha=0.1, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
penalty='l1', power_t=0.5, random_state=1, shuffle=True, verbose=0,
warm_start=False)
In [116]:
#Second model with L2 regularization
clf_l2 = linear_model.SGDClassifier(alpha=0.1, penalty='l2',random_state=1)
clf_l2.fit(X, Y)
Out[116]:
SGDClassifier(alpha=0.1, average=False, class_weight=None, epsilon=0.1,
eta0=0.0, fit_intercept=True, l1_ratio=0.15,
learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
penalty='l2', power_t=0.5, random_state=1, shuffle=True, verbose=0,
warm_start=False)
In [117]:
# Now, normalising the test data
df_test_scaled = preprocessing.scale(df_test_sub)
In [118]:
#After normalizing, the next step is to predict test data
pred_l1 = clf_l1.predict(df_test_scaled)
pred_l2 = clf_l2.predict(df_test_scaled)
In [120]:
pred_l2
Out[120]:
array(['B', 'B', 'B', 'B', 'B', 'M', 'B', 'M', 'B', 'M', 'M', 'B', 'B',
'B', 'B', 'B', 'B', 'M', 'B', 'M'],
dtype='|S1')

In [121]:
Section 1.3 : Choosing the best hyper parameter
After running the model, now the model will be tested on the basis of the accuracy, precision and
the recall values(Cerrito, 2010; George, Seals, & Aban, 2014). These values for both the models
have been calculated and shown in the below section.
#checking for model fit
#accuracy
from sklearn.metrics import accuracy_score
print accuracy_score(df_test['Diagnosis'], pred_l1)
print accuracy_score(df_test['Diagnosis'], pred_l2)
0.65
0.9
This shows that the accuracy of the first model is 0.65 which indicates that that prediction is 65
% accurate while in the second model the accuracy is 0.9 meaning that there is 90 % accuracy.
So, in terms of accuracy the second model is better than the first model.(Armstrong, 2012;
Cerrito, 2010)
In [95]:
#precision
from sklearn.metrics import precision_score
print precision_score(df_test['Diagnosis'], pred_l1, pos_label='M')
print precision_score(df_test['Diagnosis'], pred_l2, pos_label='M')
0.444444444444
0.833333333333
The results for precision is shown in above and it indicates that the precision for first model is
0.44 where the second model precision is 0.83 which is much higher than the first model. So, in
terms of the precision also the second model perform better than the first model.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

In [122]:
#recall
from sklearn.metrics import recall_score
print recall_score(df_test['Diagnosis'], pred_l1, pos_label='M')
print recall_score(df_test['Diagnosis'], pred_l2, pos_label='M')
0.666666666667
0.833333333333
The results from the recall are shown above. Results shows that the recall for the first model is
0.66, whereas the recall for the second model is 0.833.
In [97]:
#f1
from sklearn.metrics import f1_score
print f1_score(df_test['Diagnosis'], pred_l1, pos_label='M')
print f1_score(df_['Diagnosis'], pred_l2, pos_label='M')
0.533333333333
0.833333333333
In [123]:
Finally the last parameter to test the models in the current project is the f1, as shown above, the
f1 for the first model is 0.5 whereas for the second model the f1 is 0.8. The f1 is calculated by
taking the ratio of product of precision and recall to the sum of precision and the
recall(Rajasekar, Philominathan, & Chinnathambi, 2013).
#confusion matrix
from sklearn.metrics import confusion_matrix
print confusion_matrix(df_test['Diagnosis'], pred_l1)
print confusion_matrix(df_test['Diagnosis'], pred_l2)
[[9 5]
[2 4]]
[[13 1]
[ 1 5]]

In [124]:
ss = ShuffleSplit(n=len(X)-1,n_iter=100, test_size=0.25)
In [125]:
#finding the best hyperparameter based on average accuracy
#L1
from sklearn.cross_validation import ShuffleSplit
dict_l1=[]
for i in [0.1,1,3,10,33,100,333,1000, 3333, 10000, 33333]:
accu=[]
for train_index, test_index in ss:
#print list(train_index)
#print("%s %s" % (train_index, test_index))
clf_l1 = linear_model.SGDClassifier(alpha=i, penalty='l1', random_state=1)
clf_l1.fit(X.ix[list(train_index)], Y.ix[list(train_index)])
pred=clf_l1.predict(X.ix[list(test_index)])
accu.append( accuracy_score(pred,Y[test_index]))
print i,sum(accu)/float(len(accu))
dict_l1.append(sum(accu)/float(len(accu)))
0.1 0.9572
1 0.5456
3 0.4964
10 0.5168
33 0.532
100 0.504
333 0.4932
1000 0.5036

3333 0.5164
10000 0.5056
33333 0.5172
In [127]:
#L2
from sklearn.cross_validation import ShuffleSplit
ss = ShuffleSplit(n=len(X)-1,n_iter=100, test_size=0.25, random_state=1)
dict_l2=[]
for i in {0.001, 0.003, 0.01, 0.03, 0.1,0.3,1,3,10,33}:
accu=[]
for train_index, test_index in ss:
#print list(train_index)
#print("%s %s" % (train_index, test_index))
clf_l2 = linear_model.SGDClassifier(alpha=i, penalty='l2', random_state=1)
clf_l2.fit(X.ix[list(train_index)], Y.ix[list(train_index)])
pred=clf_l2.predict(X.ix[list(test_index)])
accu.append( accuracy_score(pred,Y[test_index]))
print i,sum(accu)/float(len(accu))
dict_l2.append(sum(accu)/float(len(accu)))
33 0.5824
3 0.9108
0.1 0.9792
1 0.9632
10 0.7992
0.001 0.972
0.3 0.9752
0.03 0.9728
0.003 0.9728
0.01 0.9704
In [159]:
alp_q1_l2=[0.001, 0.003, 0.01, 0.03, 0.1,0.3,1,3,10,33]

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

alp_q1_l1=[0.1,1,3,10,33,100,333,1000, 3333, 10000, 33333]
In [160]:
best_alpha_l1=alp_q1_l1[dict_l1.index(max(dict_l1))]
best_alpha_l2=alp_q1_l2[dict_l2.index(max(dict_l2))]
print best_alpha_l1
print best_alpha_l2
0.1
0.01
In [161]:
#Running the model using the best alpha & lamba
#since the alpha & lambda can only be within 0 to 1, we will take the best out of them.
#L1
clf_l1_opt = linear_model.SGDClassifier(alpha=best_alpha_l1, penalty='l1')
clf_l1_opt.fit(X, Y)
pred_l1=clf_l1_opt.predict(df_test_scaled)
In [162]:
#L2
clf_l2_opt = linear_model.SGDClassifier(alpha=best_alpha_l2, penalty='l2')
clf_l2_opt.fit(X, Y)
pred_l2=clf_l2_opt.predict(df_test_scaled)
In [163]:
#checking for model fit
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
#accuracy

from sklearn.metrics import accuracy_score
print accuracy_score(df_test['Diagnosis'], pred_l1)
print accuracy_score(df_test['Diagnosis'], pred_l2)
#precision
from sklearn.metrics import precision_score
print precision_score(df_test['Diagnosis'], pred_l1, pos_label='M')
print precision_score(df_test['Diagnosis'], pred_l2, pos_label='M')
#recall
print recall_score(df_test['Diagnosis'], pred_l1, pos_label='M')
print recall_score(df_test['Diagnosis'], pred_l2, pos_label='M')
#f1
from sklearn.metrics import f1_score
print f1_score(df_test['Diagnosis'], pred_l1, pos_label='M')
print f1_score(df_test['Diagnosis'], pred_l2, pos_label='M')
#confusion matrix
from sklearn.metrics import confusion_matrix
print confusion_matrix(df_test['Diagnosis'], pred_l1)
print confusion_matrix(df_test['Diagnosis'], pred_l2)
0.9
0.95
0.833333333333
1.0
0.833333333333
0.833333333333
0.833333333333

0.909090909091
[[13 1]
[ 1 5]]
[[14 0]
[ 1 5]]
In [164]:
#top 5 features
type(clf_l1_opt.coef_)
Out[164]:
numpy.ndarray
In [165]:
df=pd.DataFrame(X.columns)
In [166]:
df['coef']=clf_l1_opt.coef_[0]
In [167]:
#choose top 5 features from the list below
df_final_features=df.sort('coef',ascending=False)
print df_final_features[:5]
0 coef
21 f22 0.480102
22 f23 0.447370
26 f27 0.393079
27 f28 0.293350
7 f8 0.204748

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Task B: Multiclass Classification
This is the second task in the current project and the main focus of the current section is to
implement the multi-class classification using the binary classification. The same binary
classification was used in the L1 regularized logistic regression in the previous section. The first
task in this section I to understand the given data set and also create a one versus Rest Classifier.
Section 2.1 : Read and understand the data, create a default One-versus Rest Classifier
To create the classifier, it is important to first import libraries required for the task and same has
been done with the following codes.
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
In [60]:
After importing the necessary libraries, the next step is to import the required data set for the
analysis. In this case the data for MNIST dataset for handwritten digits has been used. As per the
given information there is no missing values in the data set.
#importing the data
df_minst = pd.read_csv("reduced_mnist.csv")

In [61]:
df_minst.shape
Out[61]:
(2520, 785)
In [62]:
df_minst.shape
Out[62]:
(2520, 785)
In [63]:
type(df_minst)
Out[63]:
pandas.core.frame.DataFrame
In [64]:
df_minst.columns
Out[64]:
Index([u'label', u'pixel0', u'pixel1', u'pixel2', u'pixel3', u'pixel4',
u'pixel5', u'pixel6', u'pixel7', u'pixel8',
...
u'pixel774', u'pixel775', u'pixel776', u'pixel777', u'pixel778',
u'pixel779', u'pixel780', u'pixel781', u'pixel782', u'pixel783'],
dtype='object', length=785)
In [65]:
#creating separate data for target & attributes
Y=df_minst['label']
X=df_minst.drop('label', axis=1)
In [66]:
print Y.shape
print X.shape
(2520L,)
(2520, 784)

In [67]:
#1.1 number of data points
print len(df_minst)
#1.2 total number of features
print len(df_minst.columns)
#1.3.unique labels in the data
print set(df_minst.label)
Understanding the data set
Number of data points : 2520
Total number of features: 785
Unique levels in the data : set([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
As the results shows there are 2520 data points in the data set. In other words there are 2520
rows in the data set. Furthermore, there are 785 features in the data set which are the number of
variables in the data set.
In [68]:
Section 2.2 : Choosing the best hyper- parameter
n this section the results for the best hyper parameter has been shown owever before conducting theI - . H
results the given data needs to be divided into the training and the test data .
#2.1 Spliting the data into training & validation
from sklearn.cross_validation import train_test_split
import numpy
data_X = numpy.array(X) #convert array to numpy type array
x_train ,x_test = train_test_split(data_X,test_size=0.3)
data_Y = numpy.array(Y)

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

y_train, y_test = train_test_split(data_Y, test_size=0.3)
print len(x_train),len(x_test),len(y_train), len(y_test)
1764 756 1764 756
In [69]:
After splitting the data, the next step is to build a onevsrest classifier which has been build using
the following method.
# 2.2 Building a Onevsrest classifier
>>> from sklearn.multiclass import OneVsRestClassifier
>>> from sklearn.svm import LinearSVC
>>> ovr=OneVsRestClassifier(LinearSVC(random_state=0, penalty='l1', C=1, loss='log'))
In [70]:
It is important to scale the variables before running the model so the following codes have been
run.
#scaling the variables before fitting model
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(x_train)
X_test = scaler.transform(x_test)
In [71]:
x_train.shape
Out[71]:
(1764L, 784L)
In [72]:
#fitting the model
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1,
multi_class='ovr',

penalty='l1', solver='newton-cg', tol=0.1)
clf.fit(X_train, y_train)
Out[72]:
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr',
penalty='l1', random_state=None, solver='newton-cg', tol=0.1,
verbose=0)
In [73]:
The next step is to predict the test data with the same model used for the training data
#predicting test data
pred_l1 = clf.predict(X_test)
In [74]:
#performance measures
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
print accuracy_score(y_test, pred_l1)
print precision_score(y_test, pred_l1)
print recall_score(y_test, pred_l1)
The evaluation of the prediction model is done on the basis of the accuracy, precision and the
recall. The results for all the three are shown below.
0.101851851852
0.102700912205
0.101851851852

As the results shows the value of accuracy is 0.10 which indicates that the model is only 10 %
accurate. The results for precision is 0.10 which is used to test the ability of model to reject the
false positive and true negative. Finally the value of recall is also 0.10 as the results show.
In [75]:
#For validation
from sklearn.cross_validation import ShuffleSplit
ss = ShuffleSplit(n=len(df_minst)-1000,n_iter=10, test_size=0.3, random_state=1)
dict_l1=[]
for i in [0.1, 1, 3, 10, 33, 100, 333, 1000, 3333, 10000, 33333]:
accu=[]
for train_index, test_index in ss:
#print list(train_index)
#print("%s %s" % (train_index, test_index))
clf = LogisticRegression(C=i,multi_class='ovr',penalty='l1', solver='newton-cg', tol=0.1)
clf.fit(X.ix[list(train_index)], Y.ix[list(train_index)])
pred=clf.predict(X.ix[list(test_index)])
accu.append( accuracy_score(pred,Y[test_index]))
print i,sum(accu)/float(len(accu))
dict_l1.append(sum(accu)/float(len(accu)))
0.1 0.812719298246
1 0.810087719298
3 0.809429824561
10 0.809429824561
33 0.809429824561
100 0.808771929825
333 0.809429824561
1000 0.809429824561
3333 0.809429824561
10000 0.809210526316

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

33333 0.808771929825
In [78]:
The accuracy of the training data has been shown below, which has been conducted to validate
the previous model.
#For training accuracy
from sklearn.cross_validation import ShuffleSplit
ss = ShuffleSplit(n=len(df_minst)-1000,n_iter=10, test_size=0.3, random_state=1)
dict_train_l1=[]
for i in [0.1, 1, 3, 10, 33, 100, 333, 1000, 3333, 10000, 33333]:
accu=[]
for train_index, test_index in ss:
#print list(train_index)
#print("%s %s" % (train_index, test_index))
clf = LogisticRegression(C=i,multi_class='ovr',penalty='l1', solver='newton-cg', tol=0.1)
clf.fit(X.ix[list(train_index)], Y.ix[list(train_index)])
pred=clf.predict(X.ix[list(train_index)])
accu.append( accuracy_score(pred,Y[train_index]))
print i,sum(accu)/float(len(accu))
dict_train_l1.append(sum(accu)/float(len(accu)))
0.1 1.0
1 1.0
3 1.0
10 1.0
33 1.0
100 1.0
333 1.0
1000 1.0
3333 1.0
10000 1.0

33333 1.0
In [86]:
#Plot the accuracy for training & validation
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(alp2,dict_l1,"r")
plt.plot(alp2,dict_train_l1,"b")
#plotting together
plt.show()
In [76]:
Lastly, after running the both the train and the test model and the validation, the next step is to
find the best alpha parameter. For the same following codes has been written and run.
#finding the best aplha parameter
alp2= [0.1, 1, 3, 10, 33, 100, 333, 1000, 3333, 10000, 33333]

opt_alp2=alp2[dict_l1.index(max(dict_l1))]
print opt_alp2
The values of alpha are shown above. On the basis of the results it can be said that the best alpha
is 0.1.
#opt_alp2 is the best hyper parameter
0.1
In [77]:
The best alpha has been identified in the previous section. So, the model should be run with the
best alpha. The results for the same are shown below.
#running the model with best hyper parameter
clf = LogisticRegression(C=opt_alp2,
multi_class='ovr',
penalty='l1', solver='newton-cg', tol=0.1)
clf.fit(X_train, y_train)
pred_l1 = clf.predict(X_test)
print accuracy_score(y_test, pred_l1)
print precision_score(y_test, pred_l1)
print recall_score(y_test, pred_l1)
0.0899470899471
0.0924919185535
0.0899470899471
After running the model with best alpha, the values of the accuracy, precision and recall are
shown above. The results shows that the value of accuracy is 0.0899, while the value of precision
is 0.0924 and finally the recall value is 0.0899. The accuracy of the model is not very high in this
case.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

References
Armstrong, J. S. (2012). Illusions in regression analysis. International Journal of Forecasting, 6,
689–694.
Belle, A., Thiagarajan, R., Soroushmehr, S. M. R., Navidi, F., Beard, D. A., & Najarian, K.
(2015). Big Data Analytics in Healthcare. BioMed Research International, 2015, 370194.
https://doi.org/10.1155/2015/370194
Cao, M., Chychyla, R., & Stewart, T. (2015). Big Data Analytics in Financial Statement Audits.
Accounting Horizons, 29(2), 423–449.
Cerrito, P. B. (2010). The Difference Between Predictive Modeling and Regression. Louisville.
Chen, H., Chiang, R., & Storey, V. (2012). Business intelligence and analytics: From big data to
big impact. MIS Quarterly, 36(4), 1165–1188.
Forbes. (2016). Roundup Of Analytics, Big Data & BI Forecasts And Market Estimates,
2016.
George, B., Seals, S., & Aban, I. (2014). Survival analysis and regression models. NCBI, 21(4),
686–694.
Jonker, J. and Pennink, B. (2010). The Essence of Research Methodology: A Concise Guide for
Master and PhD Students in Management Science. Springer Science & Business Media.
Kambatla, K., Kollias, G., Kumar, V., & Grama, A. (2014). Trends in big data analytics. Journal
of Parallel and Distributed Computing, 74(7), 2561–2573.

Kumar, R. (2014). Research Methodology: A Step-by-Step Guide for Beginners. SAGE
Publications.
Macdonald, S., & Headlam, N. (2010). Research Methods Handbook. Manchester.
Peisker, A., & Dalai, S. (2015). Data Analytics for Rural Development. Indian Journal of
Science and Technology, 8, 50–60.
Picciano, A. G. (2012). The Evolution of Big Data and Learning Analytics in American Higher
Education. Journal of Asynchronous Learning Networks, 16(3), 9–20.
Rajasekar, S., Philominathan, P., & Chinnathambi, V. (2013). Research methodology. Tamilnadu
India.

1 out of 27

Implementing Logistic Regression for Binary and Multiclass Classification

Contribute Materials

Secure Best Marks with AI Grader

Secure Best Marks with AI Grader

Paraphrase This Document

Secure Best Marks with AI Grader

Paraphrase This Document

Secure Best Marks with AI Grader

Paraphrase This Document

Secure Best Marks with AI Grader

Paraphrase This Document

Related Documents

Machine Learning Analysis of Wisconsin and Mnist Datasets

Classification Methods in Machine Learning

Machine Learning: KNN, Logistic Regression, SVM, Random Forest

Counting Bike Rentals by Season

Assignment on Statistics in R. Goals and Application

Intrusion Detection using WEKA Data Analytics Technique

+13062052269

info@desklib.com