Foundations of Machine Learning

Verified

Added on 2022/08/25

AI Summary

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Question 1: Foundations of Machine Learning
a) Occam’s razor is a problem-solving code which serves as a valuable mental model. It
cuts off improbable clarifications.
Occam’s razor debunks that the simplest clarification is preferable to one that is more
complex. This is due to their simplicity in verification and execution. Thus, Occam’s
razor explains that we should desist from looking for complex solutions to a problem and
aim at what works given the conditions. For instance, we should accept that God created
the earth rather than believing the big bang theory.
b) Consider the set of training data illustrated below. Is it possible to get a 0% training error
using a linear classifier?
Yes
c) Using the same data for question 1 (b), assume you’ve been asked to train a Linear
Discriminant Analysis classifier (basic linear classifier). Give the formula for this
hypothesis, making clear what the intrinsic variables are to train. Given a set of labels,
how would you determine the values of the intrinsic variables?
Here, the goal is to obtain two groups whose means vary as big as possible in relation to
how they are spread. Taking μ j to represent the mean of X for Y = j , j=0 , 1 and ∑ the
covariance matrix of X. It follows that, for j=0 , 1, E ( YU⃓ = j ) =E ( ωT YX⃓ = j ) =ωT μ j and
Var ( U ) =ωT ∑ ω. The separation will be given by
J ( ω )=¿ ¿
¿ ( ωT μ0−ωT μ1 )
2
ωT ∑ ω
d) For a linear regressor working on 2-dimensional feature data, give pseudo code for
gradient descent optimization of the intrinsic parameters. Include the correct partial
derivative of the cost function for each of the intrinsic variables to optimize.
We determine the gradient of the cost function with respect to each model parameter θ j.
The partial derivatives of the cost function yields
∂
∂θ j
MSE ( θ ) = 2
n ∑
i =1
n
( θT x ( i ) − y ( i ) ) X j
( i )

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

e) Is this data separable, yes or no. if it is, is it linearly or non-linearly separable?
Yes. Non-linearly separable.
f) What do you estimate is the Bayes Error for separating the stars and dots with a linear
classifier?
The Bayes Error is approximately 0.5 since the data is non-linearly separable.
g) Suppose you were to train a standard decision tree, with branching factor two and
monothetic evaluation functions at each node. Is it possible to train the hypothesis until
all leaf nodes are pure? How many levels would you expect the trained tree to have?
It is possible to train the hypothesis until all leaf nodes are pure. The expected levels are
five.
h) How many leaf nodes would you expect to have if you trained a standard decision tree on
this data? Draw the partitioning of the input space that is the result of this tree onto the
answer sheet
12 leaf nodes are expected.
Question 2
a) Give the formal definition of overfitting and two ways to avoid overfitting.
Overfitting is a modelling fault that arises when a function is too closely fit to a restricted
set of data points.

In order to avoid overfitting, the following ways can be adopted:
i) Simplifying the model
Here, the complexity of the model is lowered. This is done by removing or reducing the
number of elements to make the network smaller.
ii) Adding dropout layers
Dropout layers randomly eliminates some of the networks between layers. This in return
prevents overfitting.
b) Compare and contrast the benefits and disadvantages between using cross-validation and
a fixed train/validation/test partition data split to obtain the generalization error.
Cross-validation is adopted to liken the performances of the prediction algorithms that
were formed grounded on the training set. We pick the algorithm that generates the best
output. Test set on the other hand works on the picked best prediction algorithm which
still hasn’t been evaluated in terms of its performance on unseen real-world data.
c) Ruben is designing a new binary classifier. So far, he’s got as classifier the function
h ( x ) =wT x
he tries this on a datapoint x= [ 1 ,2 ]
T with weights w= [ 4.5 , 0.25 ]
T . What is the prediction
made by this classifier?
h ( x )= [ [ 4.5 ,0.25 ]T
]T
∗[1 , 2 ]T
¿ [ 4.5 0.25 ] [ 1
2 ]
¿ ( 4.5∗1 ) + ( 0.25∗2 )
¿ 5
d) Finish Ruben’s equation to return a true classifier. Given the same datapoint and weights,
what is now the prediction?
h ( x )= [ [ 4.5 ,0.25 ]T
]T
∗[1 , 2 ]T
¿ [ 4.5 0.25 ] [ 1
2 ] + 4.5
¿ ( 4.5∗1 ) + ( 0.25∗2 ) +4.5

¿ 9.5
e)

1 out of 4