This document is an R-Programming assignment that includes questions related to data analysis, normality tests, multivariate analysis, MANOVA, PCA, and factor analysis. The assignment requires the use of R code and interpretation of the results.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
Running head: R-PROGRAMMING1 R – Programming Assignment By (Name of Student) (Institutional Affiliation) (Date of Submission) Page1of5
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
R-PROGRAMMING2 Question 1 (25 marks) a)Describe the structure of the film.txt data. (2 marks) The film data is a data set that contains five variables namely; TopLeft, TopRight, BottomRight, BottomLeft. The five variables have numerical values of measurements b)Produce and interpret univariate QQ plots and histograms and univariate ShapiroWilks tests of normality for each of the four film thickness variables. Which is the most non-normally distributed variable? (5 marks) Histogram of TopRight TopRight F re q u e n c y 020040060080010001200 024681 0 c)Produce and interpret perspective and contour plots for the top-right and top-left film thickness variables. What is an inherent problem with using these plots to assess MVN? (3 marks) d)Do the analysis necessary to provide the results of the Mardia, Henze-Zirkler and Royston tests of MVN based on all four film thickness variables. Include in your interpretation: (10 marks) •The Chi-Square QQ plot and describe how it is constructed and its relationship to the univariate normal QQ plots as part of your interpretation. •What is a key limitation of these MVN statistical tests? e)One way to try and meet the MVN assumption could be to remove some of the variables from the multivariate analysis (do not perform this analysis). Suggest three Page2of5
R-PROGRAMMING3 additional ways that you might improve univariate and multivariate normality for data sets in general. (3 marks) f)In part e) we suggested removing some variables to try and help the data approach MVN. Suggest one other reason why reducing the number of variables used in multivariate analysis may be important (this question does not relate to this particular data set)?(2 marks) Question 2 (25 marks): The data file ‘iris.txt’ contains data for four flower characteristics variables for three species of iris. Provide R code, output and written interpretation for parts a) to f) of this question. a)Produce a draftsman display for the 4 flower characteristics variables. Interpret these plots, relating back to the original data where it may add to the interpretation. What are the y and x axes on plot [3,2] of the draftsman plot? (4 marks) b)In the context of MANOVA, list the dependent and independent variables and define the relationship that the MANOVA would test. (2 marks) c)Produce the correlation matrix for the flower characteristics variables. Provide an interpretation of the correlations and indicate what they suggest about the potential for the variables to be MVN distributed?(do not test for MVN) (4 marks) d)Using MANOVA in R, test for differences in ‘flower characteristics’ between the three species. Include tests using all four test statistics covered in this course and interpret output (assume the assumption of MVN is met). (5 marks) e)Why is a small Wilks’ lambda statistic likely to indicate significant differences between at least some groups? Which of the four tests used in part d) would be the best to interpret if there are concerns about multivariate normality or covariance equality? (5 marks) f)Produce output that specifically compares each of the Groups with each other (you should have 3 comparisons) using Hotelling’s T2t-test equivalent and a significance level of 0.05. Determine the multiple test corrected significance level. Do not provide R output; instead reproduce and complete the following table for all comparisons and interpret. How may sample sizes have affected these results and those in part c)? (5 marks) ComparisonHotelling’s p-value Significant (Y/N) Significant after correction (Y/N)Species 1Species 2 Page3of5
R-PROGRAMMING4 Question 3 (25 marks): The data file ‘usair.dat’ contains data for seven air quality variables measured across 41 United States cities. Provide R code, output and written interpretation for all analyses. a)Produce the correlation and covariance matrices. Explain the difference between these matrices in detail (i.e. explain clearly how the values are adjusted mathematically and the effect of these changes). Would using the covariance matrix in PCA on the USair data be appropriate? Why? (5 marks) b)Perform PCA analysis on the 7 variables using the prcomp function. Discuss the eigenvalues, %variation and scree plot and how they influence your decision on how many PCs to interpret from this analysis. Remember to keep in mind the overall purpose of PCA (5 marks). c)Interpret the first PC. Include the Z equation and a plot of the loadings on the first PC in your answer. (4 marks) d)What is the correlation between the first and second PCs and what does this tell you? (2 marks) e)Produce and interpret a biplot based on the first 2 PCs. In particular, explain your interpretation of the air quality variables in city 1 compared to city 11 and city 9. Relate your interpretation back to the original data. (5 marks) f)Was this a useful analysis for this data set? Explain. (4 marks) Question 4 (25 marks): For this question you will continue to use the data file ‘usair.dat’ from Question 3. Provide R code, output and written interpretation for all analyses. a)Perform parallel analysis and evaluate how many PC’s should be used in FA. Compare to your choice of number of PC’s in Q3b). (3 marks) b)Explain in your own words how the parallel analysis works. (5 marks) c)Perform a Factor Analysis on all 7 variables (apply no rotation) using the number of factors you identified in part a). Interpret the output including the (10 marks): •Variance explained •Chi-square test •Variable loadings •Difference in uniqueness values for the variables wind.speed and annual.precip Page4of5
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
R-PROGRAMMING5 d)Repeat the FA with a varimax rotation and calculate the communalities. Interpret (7 marks): •Explain the aim and features of a varimax rotation •Changes in the variable loadings •The communalities. Page5of5