Your contribution can guide someone’s learning journey. Share your
documents today.
Question 1: Foundations of Machine Learning a)Occam’s razor is a problem-solving code which serves as a valuable mental model. It cuts off improbable clarifications. Occam’s razor debunks that the simplest clarification is preferable to one that is more complex. This is due to their simplicity in verification and execution. Thus, Occam’s razor explains that we should desist from looking for complex solutions to a problem and aim at what works given the conditions. For instance, we should accept that God created the earth rather than believing the big bang theory. b)Consider the set of training data illustrated below. Is it possible to get a 0% training error using a linear classifier? Yes c)Using the same data for question 1 (b), assume you’ve been asked to train a Linear Discriminant Analysis classifier (basic linear classifier). Give the formula for this hypothesis, making clear what the intrinsic variables are to train. Given a set of labels, how would you determine the values of the intrinsic variables? Here, the goal is to obtain two groups whose means vary as big as possible in relation to how they are spread. Takingμjto represent the mean of X forY=j,j=0,1and∑the covariance matrix of X. It follows that, forj=0,1,E(YU⃓=j)=E(ωTYX⃓=j)=ωTμjand Var(U)=ωT∑ω. The separation will be given by J(ω)=¿¿ ¿(ωTμ0−ωTμ1) 2 ωT∑ω d)For a linear regressor working on 2-dimensional feature data, give pseudo code for gradient descent optimization of the intrinsic parameters. Include the correct partial derivative of the cost function for each of the intrinsic variables to optimize. We determine the gradient of the cost function with respect to each model parameterθj. The partial derivatives of the cost function yields ∂ ∂θj MSE(θ)=2 n∑ i=1 n (θTx(i)−y(i))Xj (i)
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
e)Is this data separable, yes or no. if it is, is it linearly or non-linearly separable? Yes. Non-linearly separable. f)What do you estimate is the Bayes Error for separating the stars and dots with a linear classifier? The Bayes Error is approximately 0.5 since the data is non-linearly separable. g)Suppose you were to train a standard decision tree, with branching factor two and monothetic evaluation functions at each node. Is it possible to train the hypothesis until all leaf nodes are pure? How many levels would you expect the trained tree to have? It is possible to train the hypothesis until all leaf nodes are pure. The expected levels are five. h)How many leaf nodes would you expect to have if you trained a standard decision tree on this data? Draw the partitioning of the input space that is the result of this tree onto the answer sheet 12 leaf nodes are expected. Question 2 a)Give the formal definition of overfitting and two ways to avoid overfitting. Overfitting is a modelling fault that arises when a function is too closely fit to a restricted set of data points.
In order to avoid overfitting, the following ways can be adopted: i)Simplifying the model Here, the complexity of the model is lowered. This is done by removing or reducing the number of elements to make the network smaller. ii)Adding dropout layers Dropout layers randomly eliminates some of the networks between layers. This in return prevents overfitting. b)Compare and contrast the benefits and disadvantages between using cross-validation and a fixed train/validation/test partition data split to obtain the generalization error. Cross-validation is adopted to liken the performances of the prediction algorithms that were formed grounded on the training set. We pick the algorithm that generates the best output. Test set on the other hand works on the picked best prediction algorithm which still hasn’t been evaluated in terms of its performance on unseen real-world data. c)Ruben is designing a new binary classifier. So far, he’s got as classifier the function h(x)=wTx he tries this on a datapointx=[1,2] Twith weightsw=[4.5,0.25] T. What is the prediction made by this classifier? h(x)=[[4.5,0.25]T ]T ∗[1,2]T ¿[4.50.25][1 2] ¿(4.5∗1)+(0.25∗2) ¿5 d)Finish Ruben’s equation to return a true classifier. Given the same datapoint and weights, what is now the prediction? h(x)=[[4.5,0.25]T ]T ∗[1,2]T ¿[4.50.25][1 2]+4.5 ¿(4.5∗1)+(0.25∗2)+4.5