logo

Implementing Logistic Regression for Binary and Multiclass Classification

   

Added on  2023-06-07

27 Pages6689 Words300 Views
Machine learning project
Contents
Introduction.................................................................................................................................................2
Task A: Binary Classification......................................................................................................................2
Section 1.1 : Data Munging.....................................................................................................................3
Section 1.2 : Logistic Regression train logistic regression models..........................................................8
Section 1.3 : Choosing the best hyper parameter...................................................................................10
Task B: Multiclass Classification..........................................................................................................17
Section 2.2 : Choosing the best hyper- parameter................................................................................19
References.................................................................................................................................................26
List of Tables
Table 1 Results for the summary statistics of the variables in the data set...................................................6
Table 2 Summary statistics after the mean value imputation for the missing values...................................8
Implementing Logistic Regression for Binary and Multiclass Classification_1
Introduction
The current project is aimed to implement the logistic regression techniques to solve the problem
of the real world. For the analysis purpose the two class and the multi-class classification
techniques have been used. Furthermore the concepts of the over fitting and under fitting have
also been used for the analysis purpose. The main purpose of the current project is to implement
the classification techniques learned in the class to solve the real world problem(Forbes, 2016;
Kambatla, Kollias, Kumar, & Grama, 2014; Picciano, 2012).
In the recent time with increase in the volume and the variety of the data generated, it has
become important part of every business and every sector to use the various data analysis
techniques to solve the business problems. It has been proved from the previous researches that
the data backed business process and the data backed business strategies are more effective and
cost efficient as compared to the traditional business strategies, which were mostly based on the
previous experience and intuitions of the decision maker. However this process have changed
and with the development of the various machine learning techniques and statistical modelling it
is now possible to predict the future values more accurately than ever before. In machine
learning techniques also the algorithms train on the historical data and once the optimal model is
finalized, same model is used for the test data. These techniques are very powerful and it has
been widely used in many organizations(Belle et al., 2015; Cao, Chychyla, & Stewart, 2015;
Chen, Chiang, & Storey, 2012; Peisker & Dalai, 2015).
Task A: Binary Classification
In this section the results for the binary classification has been shown. In the first part, the results
for the data munging has been shown followed by the implementation of the logistic regression
models. The third part of the first part is focused on the selecting the choosing the best hyper-
parameter.
For the analysis, in this section python has been used which has become one of the most widely
used software among both the data scientists and software developers as well as the data
analysts.
Implementing Logistic Regression for Binary and Multiclass Classification_2
Section 1.1 : Data Munging
The results for the data munging has been shown in the following analysis where the first step is
to import the required pandas.
import pandas as pd
import numpy as np
#importing the wisconsin_data
df_train = pd.read_csv("train_wbcd.csv")
df_test = pd.read_csv("test_wbcd.csv")
Now after importing the data set, the entire data set has to be divided into the training and the test
data. The model will be first run in the train data and the optimal model will be developed. Once
the optimal modeis developed the same model is implemented in the test data.
#dividing the data into training and the test data
train_test=df_train.append(df_test)
The next task is to find the target variable which is also known as the response variable. This is
the main variable of interest in the every model. Traditionally it is also known as the dependent
variable(Jonker, J. and Pennink, 2010; Kumar, 2014).
#indentification of the response varaible
df_train.columns
Out[43]:
Index([u'Patient_ID', u'Diagnosis', u'f1', u'f2', u'f3', u'f4', u'f5', u'f6',
u'f7', u'f8', u'f9', u'f10', u'f11', u'f12', u'f13', u'f14', u'f15',
u'f16', u'f17', u'f18', u'f19', u'f20', u'f21', u'f22', u'f23', u'f24',
u'f25', u'f26', u'f27', u'f28', u'f29', u'f30'],
dtype='object')
Implementing Logistic Regression for Binary and Multiclass Classification_3
While examining the frequency distribution of the data, it has been found that the data set is
balanced.
#examining the frequency distribution of the data
df_train['Diagnosis'].value_counts()
#It is a balanced distribution
B 58
M 42
dtype: int64
The next step is to examine whether there is missing data in the variables. This is because
sometimes the mssing data significantly affect the results of the data analysis.
#examining the missing values in the data
train_all_desc=train_test.describe()
train_all_desc
Summary statistics of the variables
In this section the results for the summary statistics of the variables has been shown. For the
descriptive st atistis the mean, standard deviation , minimum and maximum values have been
used.
Pa
tie
nt
_I
D
f1 f2 f3 f4 f5 f6 f7 f8 f9 ...
f
2
1
f2
2
f2
3
f2
4 f25 f2
6
f2
7
f2
8
f2
9
f3
0
co
un
t
1.2
000
00e
+02
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
.
.
.
11
7.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
12
0.0
00
00
0
m
ea
2.7
139
14.
06
19.
47
91.
67
65
6.7
0.0
95
0.1
05
0.0
89
0.0
48
0.1
81
.
.
16.
33
25.
80
10
7.8
90
1.9
0.1
33
0.2
64
0.2
75
0.1
14
0.2
93
0.0
84
Implementing Logistic Regression for Binary and Multiclass Classification_4
Pa
tie
nt
_I
D
f1 f2 f3 f4 f5 f6 f7 f8 f9 ...
f
2
1
f2
2
f2
3
f2
4 f25 f2
6
f2
7
f2
8
f2
9
f3
0
n 75e
+07
61
50
81
67
30
00
77
50
0
90
0
61
2
10
5
09
6
35
3 . 86
32
65
00
82
33
3
81
66
7
22
7
19
4
37
5
70
6
51
5
86
9
std
1.1
637
07e
+08
3.8
47
44
1
4.6
28
01
6
26.
74
62
11
39
1.5
81
71
4
0.0
13
67
2
0.0
56
22
7
0.0
85
17
3
0.0
42
44
9
0.0
28
15
1
.
.
.
5.4
20
70
0
6.6
62
57
6
37.
72
43
62
64
9.9
77
40
8
0.0
23
28
5
0.1
59
73
8
0.2
11
00
3
0.0
74
70
0
0.0
65
19
4
0.0
16
39
9
mi
n
8.6
700
00e
+03
7.7
29
00
0
10.
82
00
00
47.
98
00
00
17
8.8
00
00
0
0.0
68
83
0
0.0
23
44
0
0.0
00
00
0
0.0
00
00
0
0.1
06
00
0
.
.
.
8.9
52
00
0
14.
10
00
00
56.
65
00
00
24
0.1
00
00
0
0.0
71
17
0
0.0
27
29
0
0.0
00
00
0
0.0
00
00
0
0.1
56
60
0
0.0
59
05
0
25
%
8.6
584
65e
+05
11.
74
75
00
15.
80
00
00
75.
02
25
00
42
5.1
00
00
0
0.0
84
70
8
0.0
61
53
2
0.0
25
11
0
0.0
17
74
5
0.1
61
87
5
.
.
.
12.
82
00
00
19.
68
75
00
83.
72
25
00
51
0.2
75
00
0
0.1
18
32
5
0.1
45
92
5
0.0
93
13
2
0.0
60
36
2
0.2
49
37
5
0.0
73
96
0
50
%
9.0
432
75e
+05
13.
49
00
00
19.
03
00
00
86.
71
50
00
56
4.1
50
00
0
0.0
95
53
0
0.0
95
84
0
0.0
62
90
5
0.0
31
26
0
0.1
80
05
0
.
.
.
15.
29
00
00
26.
01
00
00
98.
24
50
00
72
7.1
00
00
0
0.1
34
30
0
0.2
37
70
0
0.2
49
60
0
0.0
91
63
0
0.2
82
05
0
0.0
81
64
5
75
%
8.7
368
26e
+06
15.
32
50
00
21.
94
75
00
10
0.5
25
00
0
73
0.9
25
00
0
0.1
04
32
5
0.1
30
42
5
0.1
31
95
0
0.0
69
86
2
0.1
96
70
0
.
.
.
18.
33
00
00
30.
92
25
00
12
3.5
00
00
0
10
29.
75
00
00
0.1
49
30
0
0.3
57
92
5
0.4
00
90
0
0.1
64
57
5
0.3
24
25
0
0.0
94
77
8
m
ax
9.1
129
62e
+08
27.
42
00
00
32.
47
00
00
18
6.9
00
00
0
25
01.
00
00
00
0.1
32
60
0
0.3
11
40
0
0.4
26
40
0
0.1
84
50
0
0.2
56
90
0
Table 1 Results for the summary statistics of the variables in the data set.
In [45]:
Implementing Logistic Regression for Binary and Multiclass Classification_5
#feature with missing values
train_all_desc.columns[train_all_desc.ix["count"]<len(train_test)]
Out[45]:
Index([u'f21'], dtype='object')
In [46]:
train_all_desc.columns[train_all_desc.ix["count"]<len(train_test)]
Out[46]:
Index([u'f21'], dtype='object')
In [47]:
On the basis of the missing value analysis, it has been found that there are some variables with
the missing values, so to handle such missing values, the missing values are being replaced by
the mean value of the series. This is because all the variables with the missing values are
numerical variables. Other than the mean value imputation, other imputations can also be run
such as the mode value imputation and the median value imputation(Macdonald & Headlam,
2010).
#imputing the missing values
missing_feature = train_all_desc.columns[train_all_desc.ix["count"]<len(train_test)]
In [48]:
print df_train[missing_feature].dtypes
f21 float64
dtype: object
In [49]:
print df_train[missing_feature].dtypes
f21 float64
dtype: object
In [50]:
# impute with mean value of the series
df_train[missing_feature]=df_train[missing_feature].fillna(train_test[missing_feature].mean())
Implementing Logistic Regression for Binary and Multiclass Classification_6

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Machine Learning Analysis of Wisconsin and Mnist Datasets
|16
|2678
|23

Classification Methods in Machine Learning
|10
|1387
|220

Machine Learning: KNN, Logistic Regression, SVM, Random Forest
|15
|3992
|132

Report on Bike Sharing Assignment
|3
|1279
|419

Assignment on Statistics in R. Goals and Application
|13
|1059
|18

Intrusion Detection using WEKA Data Analytics Technique
|12
|2158
|211