Foundations of Machine Learning

Added on 2022-08-25

4 Pages724 Words38 Views

Question 1: Foundations of Machine Learning
a) Occam’s razor is a problem-solving code which serves as a valuable mental model. It
cuts off improbable clarifications.
Occam’s razor debunks that the simplest clarification is preferable to one that is more
complex. This is due to their simplicity in verification and execution. Thus, Occam’s
razor explains that we should desist from looking for complex solutions to a problem and
aim at what works given the conditions. For instance, we should accept that God created
the earth rather than believing the big bang theory.
b) Consider the set of training data illustrated below. Is it possible to get a 0% training error
using a linear classifier?
Yes
c) Using the same data for question 1 (b), assume you’ve been asked to train a Linear
Discriminant Analysis classifier (basic linear classifier). Give the formula for this
hypothesis, making clear what the intrinsic variables are to train. Given a set of labels,
how would you determine the values of the intrinsic variables?
Here, the goal is to obtain two groups whose means vary as big as possible in relation to
how they are spread. Taking μ j to represent the mean of X for Y = j , j=0 , 1 and ∑ the
covariance matrix of X. It follows that, for j =0 , 1, E ( YU⃓ = j ) =E ( ωT YX⃓ = j ) =ωT μ j and
Var ( U ) =ωT ∑ ω. The separation will be given by
J ( ω )=¿ ¿
¿ ( ωT μ0−ωT μ1 )
2
ωT ∑ ω
d) For a linear regressor working on 2-dimensional feature data, give pseudo code for
gradient descent optimization of the intrinsic parameters. Include the correct partial
derivative of the cost function for each of the intrinsic variables to optimize.
We determine the gradient of the cost function with respect to each model parameter θ j.
The partial derivatives of the cost function yields
∂
∂θ j
MSE ( θ ) = 2
n ∑
i =1
n
( θT x ( i ) − y ( i ) ) X j
( i )

e) Is this data separable, yes or no. if it is, is it linearly or non-linearly separable?
Yes. Non-linearly separable.
f) What do you estimate is the Bayes Error for separating the stars and dots with a linear
classifier?
The Bayes Error is approximately 0.5 since the data is non-linearly separable.
g) Suppose you were to train a standard decision tree, with branching factor two and
monothetic evaluation functions at each node. Is it possible to train the hypothesis until
all leaf nodes are pure? How many levels would you expect the trained tree to have?
It is possible to train the hypothesis until all leaf nodes are pure. The expected levels are
five.
h) How many leaf nodes would you expect to have if you trained a standard decision tree on
this data? Draw the partitioning of the input space that is the result of this tree onto the
answer sheet
12 leaf nodes are expected.
Question 2
a) Give the formal definition of overfitting and two ways to avoid overfitting.
Overfitting is a modelling fault that arises when a function is too closely fit to a restricted
set of data points.

End of preview

Want to access all the pages? Upload your documents or become a member.