How to Overcome Procrastination and Boost Productivity

Verified

Added on  2023/04/26

|6
|1063
|126
AI Summary
Derivation of gradient of negative log-likelihood in logistic regression. Calculation of gradient using sum of (1-yi) and yi with sigmoid. Calculation of W for logistic regression. Probability of positive outcome of patient's visit and coefficient vector. Model assessment scheme and physician feedback label. Time complexity of update rule and learning rate bounds.
tabler-icon-diamond-filled.svg

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Table of Contents
1.1 Batch Gradient Descent........................................................................................................3
a) logistic Regression of W setting........................................................................................3
1.2 Stochastic Gradient Descent................................................................................................4
Document Page
1.1 Batch Gradient Descent
Considering a binary classification problem with data NLL (D, w) = − PN i=1 (1 − yi) log(1
− σ(wT xi)) + yi log σ(wT xi) Given the following definitions:
f(x)= wT σ
L(W)= (D, w) = − PN i=1 (1 − yi) log(1 − σ(wT xi)) + yi log σ(wT xi)
L(W)=
i=1
PN
(1 yi)log ¿ ¿
Where wT xi is a vector. p(x) p(y) is a short-hand for p(y=1 | x)p(y=1 | x).
L(w)=
i=1
PN
(1 yi)log ¿ ¿

w L(w)=
i=1
PN
¿¿

w yilog σ (wT xi)=¿yi).log σ (wT xi)+yi.¿yi))
=0.log σ (wT xi)+yi. ¿yi))
=yi. (p(yi) ( 1p(wT xi))

w ¿yi)log σ (1 p ( wT xi ) )=(1-yi).
w ¿)
=(1-yi). 1
1 p ( wT xi ) . p ( wT xi ).1 p ( wT xi )
=(1-yi). p ( wT xi )

w L(w)=
i=1
PN
(1 yi ) . p ( wT xi ) + yi . p ( wT xi )
The derivation of gradient of the negative log-likelihood in terms of
W=
i=1
PN
(1 yi+ yi). p ( wT xi )
Document Page
a) logistic Regression of W setting
1.2 Stochastic Gradient Descent
a) The every patient confirmation is likewise connected with a parallel name y 2 {+1, 1}.
Every day of a visit in which the patient in the long run tests positive is named +1 and 1
generally. In this manner every patient confirmation p(i) comprises of mi (include vector,
mark) sets:
p(i) = {(x(i) t , y(i) t )}mi t=1
Xt=
i=1
PN
p ( wT xi )+
P(yt/xt,w)= 1
2 πσ2 e
1
1 p ( wT xi )
wML = arg max w {p(y|X, w)}
b)
These outcomes in different expectations for every patient confirmation of the
coefficient vector Xt , one comparing to every day of the affirmation.
We consider a model assessment conspire that takes these expectations into thought,
yet still yields a solitary proportion of execution can be consider as physician
feedback label Y t .
One could envision an approval plot in which the execution of a classifier is assessed
for every day autonomously.
While complete, this assessment still needs importance from a clinical point of view
since it isn't clear how to decipher the utility of a classifier that accurately
characterizes a patient m days out of a sum of n days of the time period can be
calculate asW t 1.
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Derivation
=||Xt- wt+1||<||yt- wt−1|| ¿Xt wt +1¿2<¿ ytwt1¿2
= ¿Xt ¿2-2xtT ytT +¿wt +1¿2-¿ y t¿2-2xtT ytT +¿wt 1¿2
=wT +1(xt)- = ¿Xt +1¿2/2>wT1(yt)-¿ y t1¿2/2
=((Xt +1)-¿))T xt+ 1
2|| ¿Xt +1¿2¿ y t1¿2>0
c) The time complexity of the update rule from b if Xt is very sparse.
The time complexity of the Xt need to compute the equation is,
Xt =
i=1
PN
xi ( Xt
T 1 yi )+λ X i
Xt =
i=1
PN
xi ¿ ¿
Sparse matrix of the each column of X has non zero entries (P Xt =(X¿) is the very sparse.
d).Large η can be denoted as oscillations, instability.
Small η can be denoted as slow convergence.
Bounds over the learning rate, η: 0 < η< 2/( η max )
There are clear tradeoffs between utilization of a vast versus small η. large η accelerate
adapting yet can be precarious. Small η is steady however results in slower learning. Along
these lines, it is alluring in the first place a Large η and decrease it with time. Averaging of
past sources of info prompts stable weight elements, which requires little η • Fast adjustment
requires substantial η .Learning guideline can likewise be gotten from a blunder work is
derivation is,
Xt =
i=1
PN
xi ( Xt
T 1 yi )+ λ X i +η: 0 < η< 2/( η max )
Document Page
E denotes the squared error over all patterns.
e). A relapse demonstrate that utilizes L1 regularization system is called Lasso Regression
and model which utilizes L2 is called Ridge Regression. The key distinction between these
two is the punishment term. Edge relapse includes "squared greatness" of coefficient as
punishment term to the misfortune work. The Time complexity is,
l- μ¿W ¿¿2
2 ¿, μ isthe constant values
The W t can be calculate the values of L1 regularization value is zero and l-μ¿W ¿¿2
2 ¿,
is denoted as time complexity of L1 (W t ) derivation is.,
μ¿W ¿¿2
2 ¿=
i=1
PN
λwt|=
i=1
PN
μwt|
μ¿W ¿¿2
2 ¿,=argminw
i=1
Pn
¿¿
W t is the values of time complexity.
chevron_up_icon
1 out of 6
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]