Data Analysis Report of Causes of Death in Queensland
Verified
Added on 2023/03/31
|13
|2299
|177
AI Summary
This report presents the analysis of deaths due to various reasons from 1997 to 2017 in Queensland. The report tries to draw significant conclusions that can be useful for practical life and further studies.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
Data Analysis report of Causes of Death in Queensland Name of the Student: Name of the University: Author Note:
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
1 Author Name Data analysis report of Causes of death in queensland Table of Contents Introduction...............................................................................................................................2 Data setup..................................................................................................................................2 Explanatory Data Analysis..........................................................................................................2 One Variable Analysis.............................................................................................................2 Two Variable Analysis............................................................................................................4 Advanced Analysis......................................................................................................................5 K-means Cluster Analysis.......................................................................................................5 Linear regression Analysis......................................................................................................7 Conclusion................................................................................................................................11 Reflections................................................................................................................................11 Reference and Bibliography.....................................................................................................12
2 Author Name Data analysis report of Causes of death in queensland Introduction This report presents the analysis of deaths due to various reasons from 1997 to 2017 in Queensland. The report tries to draw significant conclusions that can be useful for practical life and further studies. The paper deals with a k means cluster analysis which makes sub groups of year to present significant differences among average death rates due to several reasons. The linear regressions shows how the number of death is changing over the year. The data is collected from the Australian government website to analyse the reasons of death in Queensland. The data cleaning is completed in Excel 2014 and the analysis is done with help of open source statistical tool pack R. Data setup The data was not prepared for the analysis so some changes were made using excel. In this stage the name of the variables are edited, the transpose of the data set is taken for the analysis to describe the variables across year. After all these steps the excel file was imported in R for the further analysis. The following codes were used accordingly. ucdq<- readxl::read_xlsx(file.choose()) # import and read the excel file in R The following codes are to upload the library which were used for the analysis. library(RColorBrewer) library(ggplot2) library(stats4) library(cluster) Now, to omit the missing variables na.omit function is used some them are mentioned below and thus the data is prepared for the further analysis (Little and Rubin 2019). na.omit(ucdq$`Cause of death`) na.omit(ucdq$`Certain infectious and parasitic diseases`) na.omit(ucdq$`Neoplasms (cancer)`) na.omit(ucdq$`Trachea, bronchus and lung`) na.omit(ucdq$`Melanoma of skin`) na.omit(ucdq$Breast) Explanatory Data Analysis One Variable Analysis The variables contains information about an observation. In this section, diseases of the nervous system and mental and behavioural disorders are chosen for one variable analysis. R code for summary statistics and boxplot of disease of the nervous system is mentioned below: summary(ucdq$`Diseases of the nervous system`) #summary statistics boxplot(ucdq$`Diseases of the nervous system`, col = “red”) #boxplot Table 1: Summary statistics of deaths due to diseases of the nervous system
3 Author Name Data analysis report of Causes of death in queensland The table 1 describes the minimum and maximum value of diseases of the nervous system. The mean of variable 988.3, the 1stquartile 697.0 and 2ndquartile 1227. The boxplot also shows the outliers. However, the boxplot in figure 1 does not show any outlier. This simply implies that on an average 988 people die every year due to diseases of the nervous system. Figure 1: Box plot for deaths due to diseases of the nervous system R code for summary statistics and boxplot of mental and behavioural disorders is mentioned below: summary(ucdq$`Mental and behavioural disorders`) #Summary statistics boxplot(ucdq$`Mental and behavioural disorders`,col = "Green") #Boxplot Table 2: Summary statistics of deaths due to mental and behavioural disorders The table 2 describes the median and mean value of mental and behavioural disorders. The mean of variable is 951, the 1stquartile 468 and 2ndquartile 1342. The boxplot also shows the outliers. However, the boxplot in figure 2 does not show any outlier. This simply implies that on an average 951 people die every year due to mental and behavioural disorders (Brasteinet al.2018).
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
4 Author Name Data analysis report of Causes of death in queensland Figure 2: Box plot for deaths due to mental and behavioural disorders Two Variable Analysis Here, in each case two variable analysis deals with a pair of variables. The two pairs of variables that are chosen are mental and behavioural disorders with year and the diseases of the nervous system with year. The required R codes are mentioned below for the first pair: cor(ucdq$`Cause of death`,ucdq$`Mental and behavioural disorders`) # correlation=0.9739592 plot(ucdq$`Cause of death`,ucdq$`Mental and behavioural disorders`, main = "Number of serious mental disorder over the year", xlab = "Year", ylab = "Mental and behavioural disorders") #Scatter plot The correlation between these variables is quite high as the correlation coefficient is close to 1. The figure 3 presents a scatter plot which presents a flatter linear relation of mental and behavioural disorders against year up to 2005 and after that the linear relation is steeper. This implies that the rate of death has increased for mental and behavioural disorder after 2005 (Hoet al.2018). Figure 3: Scatter plot for mental and behavioural death against year The required R codes for the second pair is mentioned below: cor(ucdq$`Cause of death`,ucdq$`Diseases of the nervous system`) #correlation=0.970241 plot(ucdq$`Cause of death`,ucdq$`Diseases of the nervous system`, main = "Number of serious diseases of the nervous systemr over the year") # scatter plot The correlation between these variables is quite high as the correlation coefficient is close to 1. The figure 4 presents the death due to nervous system over the year. The association of the two variables present a linear relation. The amount of death has increased over the year due to disease of the nervous system (Muktadaret al.2018).
5 Author Name Data analysis report of Causes of death in queensland Figure 4: Scatter plot for deaths due to diseases of the nervous system against year Advanced Analysis K-means Cluster Analysis Cluster analysis is generally used to group the variables depending on the mean and variances on the basis of target variable. The number of cluster is equal to the number of groups of target variable. Here the chosen target variable is year (Gaudet, Begon and Tremblay, 2019). The R codes used to conduct the k means cluster analysis is presented below: clustucdq$`Cause of death`<-NULL #setting the target variable clustucdq.stand<- scale(clustucdq[-1]) #Normalising all the variables wssplot<- function(data, nc=15,seed=1234) #Function to optimize the number of clusters { wss<- (nrow(data)-1)*sum(apply(data,2,var)) for(i in 2:nc){ set.seed(seed) wss[i]<-sum(kmeans(data,centers = i)$withinss) } plot(1:nc,wss,type="b",xlab="Numbber of Clusters", ylab="within groups sum of squares") } wssplot(clustucdq.stand) #presenting the optimum number of clusters resultclust= kmeans(clustucdq.stand,2) #K means cluster analysis resultclust$centers #table of normalized centre values for each cluster plot(resultclust$centers) #Group mean scatter plot
6 Author Name Data analysis report of Causes of death in queensland Figure 5: WSS figure to determine the clusters The figure 5 shows the optimum number of cluster is 2 and the below table presents the centre values of each variable for each cluster (Adolfsson, Ackerman and Brownstein 2019). Table 3: Normalized centre values of the variables for the 2 clusters Variables\Cluster12 Neoplasms (cancer)-0.8540.939 Trachea bronchus and lung-0.8450.930 Melanoma of skin-0.7830.861 Breast-0.7530.828 Female genital organs-0.7640.840 Male genital organs-0.8210.903 Diseases of blood and blood-forming organs and certain disorders involving the immune mechanism -0.2350.259 Endocrine-0.8530.938 nutritional and metabolic diseases Diabetes mellitus-0.8620.948 Mental and behavioural disorders-0.8270.910 Diseases of the nervous system-0.8140.895 Diseases of the circulatory system0.491-0.540 Ischaemic heart disease0.732-0.805 Cerebrovascular disease (stroke)0.380-0.418 Diseases of the respiratory system-0.6550.720 Influenza and pneumonia0.157-0.173 Chronic lower respiratory diseases-0.7620.839 Diseases of the digestive system-0.7680.845 Diseases of skin and subcutaneous tissue-0.6570.723 Diseases of musculoskeletal system and connective tissue-0.8200.902 Diseases of the genitourinary system-0.6230.685 Certain conditions originating in the perinatal period-0.4120.453 Congenital malformations, deformations and chromosomal-0.4730.520
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
7 Author Name Data analysis report of Causes of death in queensland abnormalities Symptoms, signs and abnormal clinical and laboratory findings, not elsewhere classified -0.1980.218 External causes of morbidity and mortality-0.8100.891 Transport accidents0.422-0.465 Falls-0.7900.869 Accidental drowning and submersion0.001-0.001 Intentional self-harm (suicide)-0.6820.750 Assault (includes homicide)0.243-0.267 Figure 6: Scatter plot showing the group mean of two clusters The figure presents the two clusters’ group means for the variables. The year is grouped in 2 sub categories. 1stgroup contains year from 1997 to 2007 and the 2ndgroup contains the year from 2008 to 2017. Linear regression Analysis A linear regression analysis simply shows the association of a dependent variable with the independent variables that helps in prediction. In this section, for two linear regression analysis two pairs of variables are chosen (Melie-Garciaet al.2018). The pairs of variables are certain infectious and parasitic diseases against year and neoplasms (cancer) against year. The two linear regressions are discussed below: The R codes used to complete the linear regression of deaths due to certain infectious and parasitic diseases against year are presented below: plot(ucdq$`Cause of death`,ucdq$`Certain infectious and parasitic diseases`) #scatter plot cor(ucdq$`Cause of death`,ucdq$`Certain infectious and parasitic diseases`) # correlation=0.9270985 lreg<-lm(ucdq$`Certain infectious and parasitic diseases`~ucdq$`Cause of death`) #Linear regression summary(lreg) #Summary of Linear regression
8 Author Name Data analysis report of Causes of death in queensland abline(lreg,col="red",lty=2, lwd=2) #Regression line Figure 7: Scatter plot for deaths due to certain infectious and parasitic diseases against year The above figure shows the upward rising linear relation between two variables. The below table presents the regression result (Minoglou and Komilis 2018). Table 4: Regression result for regressing certain infectious and parasitic diseases on year Table4showsthemodelisabletopredictwith85.95%accuracyandthe independent variable is statistically significant. The p-value of the f-stat states that the model is significant with the incorporated independent variable at 0% significant level. On the basis of above result the figure 8 presents the regression line.
9 Author Name Data analysis report of Causes of death in queensland Figure 8: Regression line for deaths due to certain infectious and parasitic diseases against year The R codes used to complete the linear regression of deaths due to neoplasms (cancer) against year are presented below: plot(ucdq$`Cause of death`,ucdq$`Neoplasms (cancer)`) #Scatter plot cor(ucdq$`Cause of death`,ucdq$`Neoplasms (cancer)`) # Correlation=0.9847033 lreg1<-lm(ucdq$`Neoplasms (cancer)`~ucdq$`Cause of death`) #linear regression summary(lreg1) #Regression Result abline(lreg1,col="red",lty=2, lwd=2) #Regression line Figure 9: Scatter plot for deaths due to neoplasms (cancer) against year Similar to the previous regression, the above figure shows the upward rising linear relation between two variables. The below table presents the regression result. Table 5: Regression result for regressing neoplasms (cancer) on year
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
10 Author Name Data analysis report of Causes of death in queensland Table5showsthemodelisabletopredictwith96.96%accuracyandthe independent variable is statistically significant. The p-value of the f-stat states that the model is significant with the incorporated independent variable at 0% significant level. On the basis of above result the figure 10 presents the regression line Figure 10: Regression line for deaths due to neoplasms (cancer) against year Conclusion The one variable analysis presents the average death due todiseases of the nervous system and mental and behavioural disorders in a year. There was no outliers which indicates that there is no influence in those variables. The two variable analysis presents the linear association between mental and behavioural disorders with year which is presented in the figure 3 and the linear relation of diseases of the nervous system with year which is presented in the figure. Both the analysis presents the positive linear relation. The k means cluster analysis presents two groups of a year which states the average death in that two age groups. The linear regressions presents the certain infectious and parasitic diseases against year and also for neoplasms (cancer) against year. Both the linear regression analysis
11 Author Name Data analysis report of Causes of death in queensland says the relationship is statistically significant at % significance level. This means over the year average death due to both the reasons incorporated in the analysis is increasing. Reflections There are a huge number of techniques to analyse the qualitative and quantitative variable. The set of data used in the analysis is a time series data where time series analysis can be done to predict the future disease for which the number of deaths will be higher and the lower. This can help the scientists to do research on that disease to control its effects by providing a plenty of time.
12 Author Name Data analysis report of Causes of death in queensland Reference and Bibliography Adolfsson, A., Ackerman, M. and Brownstein, N.C., 2019. To cluster, or not to cluster: An analysis of clusterability methods.Pattern Recognition,88, pp.13-26. Brastein, O.M., Perera, D.W.U., Pfeifer, C. and Skeie, N.O., 2018. Parameter estimation for grey-box models of building thermal behaviour.Energy and Buildings,169, pp.58-68. Chatfield, C., 2018.Introduction to multivariate analysis. Routledge. Cox, D.R., 2018.Analysis of survival data. Routledge. Gaudet, S., Begon, M. and Tremblay, J., 2019. Cluster analysis using physical performance and self-report measures to identify shoulder injury in overhead female athletes.Journal of science and medicine in sport,22(3), pp.269-274. Ho, J., Tumkaya, T., Aryal, S., Choi, H. and Claridge-Chang, A., 2018. Moving beyond P values: Everyday data analysis with estimation plots.BioRxiv, p.377978. Little, R.J. and Rubin, D.B., 2019.Statistical analysis with missing data(Vol. 793). Wiley. Melie-Garcia, L., Draganski, B., Ashburner, J. and Kherif, F., 2018. Multiple Linear Regression: Bayesian Inference for Distributed and Big Data in the Medical Informatics Platform of the Human Brain Project.BioRxiv, p.242883. Minoglou, M. and Komilis, D., 2018. Describing health care waste generation rates using regression modeling and principal component analysis.Waste management,78, pp.811- 818. Muktadar, A.K., Gangaiah, M., Chrcanovic, B.R. and Chowdhary, R., 2018. Evaluation of the effect of self cutting and nonself cutting thread designed implant with different thread‐‐ depth on variable insertion torques: An histomorphometric analysis in rabbits.Clinical implant dentistry and related research,20(4), pp.507-514.