Introduction to Statistical Learning: Problem Set 1 Solutions
Verified
Added on  2023/06/10
|10
|1809
|365
AI Summary
This article provides solutions to Problem Set 1 of Introduction to Statistical Learning course offered by Stanford University in Summer 2018. The solutions cover topics such as supervised and unsupervised learning, regression models, bias-variance tradeoff, and linear regression. The article also includes R code and references to relevant textbooks.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
STATS216v Introduction to Statistical Learning Stanford University, Summer 2018 Problem Set 1 Question 1: 1: (a) This is supervised learning model. It is example of regression. We are interested in prediction. We predict the most promising spot to dig.Here number of observations n = 80 and number of predictors p = 24. 1: (b) This is supervised learning model. It is example of classification. Here we are interested in prediction. We predict the whether to display advertisement A or advertisement B to each customer. Here number of observations n = 300 and number of predictors p = 3 (age, zip code, and gender). 1: (c) This is supervised learning model. It is example of regression. Here we are interested in inference. We are interested in discovering factors that are associated with the unemployment rate across different U.S. cities. Here number of observations n = 400 and number of predictors p = 6 (the population, state, average income, crime rate, percentage of students who graduate high school and unemployment level.). 1: (d) This is unsupervised learning model. For the each students we don’t have responses (different subtypes) of students in the application pool.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
1 (e): This is supervised learning model. It is example of classification. Here we are interested in prediction. We predict the type of cells based on a few measurements. Here number of observations n = 68 and number of predictors p = 3 (the number of branch points, the number of active processes, and the average process length). Question 2: 2 (a): We preferred inflexible regression model as number of predictors (number of genes) p is extremely large, and the number of observations (number of patients) n is small. 2 (b): We preferred flexible regression model as number of predictors (math, science and history grades in the 7th grade) p is small, and the number of observations (number of students) n is extremely large. 2 (c): We preferred inflexible regression model as we variation in data is more. 2 (d): We preferred flexible regression model as we variation in data is less. Flexible model will perform better to find non-linear effect also.
Question 3: 3 (a): Flexible model performs better. A flexible models fits the data well with larger sample size and performs better than inflexible model. 3 (b): Flexible model performs worse. A flexible models overfit small number of observation. 3 (c): Flexible model performs worse. A flexible method would fit to the noise in the error terms and increase variance. 3 (d): Flexible model performs better. A flexible models gets more degrees of freedom fits the data well. 3 (e): Flexible model performs worse as it would fit to the noise in the error terms and increase variance.
Question 4: Solution is obtained from R studio. Here we have given the some output of the program. > summary(College) PrivateAppsAcceptEnrollTop10perc No :212Min.:81Min.:72Min.:35Min.: 1.00 Yes:5651st Qu.:7761st Qu.:6041st Qu.: 2421st Qu.:15.00 Median : 1558Median : 1110Median : 434Median :23.00 Mean: 3002Mean: 2019Mean: 780Mean:27.56 3rd Qu.: 36243rd Qu.: 24243rd Qu.: 9023rd Qu.:35.00 Max.:48094Max.:26330Max.:6392Max.:96.00 Top25percF.UndergradP.UndergradOutstateRoom.Board Min.:9.0Min.:139Min.:1.0Min.: 2340Min.:1780 1st Qu.: 41.01st Qu.:9921st Qu.:95.01st Qu.: 73201st Qu.:3597 Median : 54.0Median : 1707Median :353.0Median : 9990Median :4200 Mean: 55.8Mean: 3700Mean:855.3Mean:10441Mean:4358 3rd Qu.: 69.03rd Qu.: 40053rd Qu.:967.03rd Qu.:129253rd Qu.:5050 Max.:100.0Max.:31643Max.:21836.0Max.:21700Max.:8124 BooksPersonalPhDTerminalS.F.Ratio Min.:96.0Min.: 250Min.:8.00Min.: 24.0Min.: 2.50 1st Qu.: 470.01st Qu.: 8501st Qu.: 62.001st Qu.: 71.01st Qu.:11.50 Median : 500.0Median :1200Median : 75.00Median : 82.0Median :13.60 Mean: 549.4Mean:1341Mean: 72.66Mean: 79.7Mean:14.09 3rd Qu.: 600.03rd Qu.:17003rd Qu.: 85.003rd Qu.: 92.03rd Qu.:16.50 Max.:2340.0Max.:6800Max.:103.00Max.:100.0Max.:39.80 perc.alumniExpendGrad.Rate Min.: 0.00Min.: 3186Min.: 10.00 1st Qu.:13.001st Qu.: 67511st Qu.: 53.00 Median :21.00Median : 8377Median : 65.00 Mean:22.74Mean: 9660Mean: 65.46 3rd Qu.:31.003rd Qu.:108303rd Qu.: 78.00 Max.:64.00Max.:56233Max.:118.00 >
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
>length(which(College$Private=="Yes")) [1] 565 Out of 777, 565 colleges are private whereas 212 are non-private. >length(which(College$Elite=="Yes")) [1] 78 There are 78 elite universities.
R Code: # Set the working directory # Read the data College=read.csv('College.csv',header=1) #To view the data View(College) #For removing first column rownames(College)= College[,1] College=College[,-1] View(College ) #Summary of Data summary(College) #scatterplot of the column PhDversus the column Grad.Rate. plot(College$Grad.Rate,College$PhD,xlab="Grad.Rate",ylab="PhD",main="Scatter Plot") #Number of Private Colleges length(which(College$Private=="Yes")) #(g) Elite=rep("No",nrow (College)) Elite[College$Top10perc>50]="Yes" Elite=as.factor(Elite)
College=data.frame(College,Elite) #How many elite universities are there? length(which(College$Elite=="Yes")) #Box Plot plot(College$Elite, College$Outstate, xlab = "Elite University", ylab ="Out of State tuition in USD", main = "Outstate Tuition Plot") # (h) Histogram par(mfrow=c(2,2)) hist(College$Top10perc, col = 4, xlab = "Top 10%", ylab = "Count",main="") hist(College$Top25perc, col = 6, xlab = "Top 25%", ylab = "Count",main="") hist(College$Books, col = 2, xlab = "Books", ylab = "Count",main="") hist(College$PhD, col = 3, xlab = "PhD", ylab = "Count",main="") Question 5: One can see that variable is negatively skewed.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
One can study the relation between different predictors by observing the above scatter plots of predictors. (e) >coef(summary(lm.model))[1:20,1:4] EstimateStd. Error (Intercept)5.2523261623 1.005248e+01 Onset.Delta-0.0004375411 4.516149e-05 Symptom.Speech-0.0225458435 9.120369e-02 Symptom.WEAKNESS-0.1030724618 8.140383e-02 Site.of.Onset.Onset..Bulbar -0.3672390249 2.492628e-01 Site.of.Onset.Onset..Limb-0.2722652619 2.516364e-01 Race...Caucasian-0.1472308983 9.490391e-02 Age-0.0005163468 1.814937e-03 Sex.Female-0.0605416162 9.573716e-02 Sex.Male0.0195376288 8.868793e-02 Mother-0.0462694066 7.581925e-02 Family0.0067961369 5.629768e-02 Study.Arm.PLACEBO-3.1549354316 1.918113e+00 Study.Arm.ACTIVE-3.0519454033 1.917047e+00 max.alsfrs.score0.0622267218 7.830124e-02 min.alsfrs.score-0.1666353650 8.465840e-02 last.alsfrs.score0.4090457993 2.074172e-01 mean.alsfrs.score-0.1208903623 3.538618e-01 num.alsfrs.score.visits-0.0058420080 7.139877e-01 sum.alsfrs.score-0.0778763045 7.899171e-02 t valuePr(>|t|) (Intercept)0.522490777 6.014605e-01 Onset.Delta-9.688368476 3.700894e-21 Symptom.Speech-0.247203188 8.048088e-01 Symptom.WEAKNESS-1.266186858 2.057820e-01 Site.of.Onset.Onset..Bulbar -1.473300789 1.410284e-01 Site.of.Onset.Onset..Limb-1.081979013 2.795589e-01
Race...Caucasian-1.551368083 1.211739e-01 Age-0.284498398 7.760955e-01 Sex.Female-0.632373219 5.273077e-01 Sex.Male0.220296371 8.256916e-01 Mother-0.610259353 5.418479e-01 Family0.120717895 9.039421e-01 Study.Arm.PLACEBO-1.644811989 1.003665e-01 Study.Arm.ACTIVE-1.592003115 1.117439e-01 max.alsfrs.score0.794709319 4.269974e-01 min.alsfrs.score-1.968326306 4.934477e-02 last.alsfrs.score1.972092254 4.891257e-02 mean.alsfrs.score-0.341631529 7.327100e-01 num.alsfrs.score.visits-0.008182225 9.934735e-01 sum.alsfrs.score-0.985879479 3.244639e-01 We observed that R2is 0.46 suggest that fitting is not too good. We observed that RMSE is 0.4138632. The error rate produced by using a simple linear regression on this data is much too high. It could be useful for variables selection. Bias–variance tradeoff shows that predictor with less bias has larger variance. R Code: #(a) a=load('als.rData') #(b) length(train.X) length(train.y) length(test.X) length(test.y) #(c) summary(train.y)#Summary oftrain.y hist(train.y,breaks = 40) #Histogram #(d)
colnames(train.X)[1:20] pairs(train.X[,21:25]) #Fitting of Regression model lm.model=lm(train.y ~., data=data.frame(train.y,train.X)) #First 20 coefficients coef(summary(lm.model))[1:20,1:4] #R squared summary(lm.model)$r.squared pred=predict(lm.model) #RMSE RMSE=sqrt(mean((pred-train.y)^2)) RMSE Reference: James, G., Witten, D., Hastie, T. and Tibshirani, R., 2013.An introduction to statistical learning (Vol. 112). New York: springer. Montgomery, D.C., Peck, E.A. and Vining, G.G., 2012.Introduction to linear regression analysis (Vol. 821). John Wiley & Sons.