This report presents the analysis of road accidents from 1989 to 2019 using techniques like cluster analysis and linear regression analysis. The data contains mostly qualitative data and can be used for further advanced studies. The report describes the data using one variable analysis and two variable analysis.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
Data Analysis report of Road Crashes from 1989 to 2019 Name of the Student: Name of the University: Author Note:
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Table of Contents Introduction...............................................................................................................................2 Data setup..................................................................................................................................2 Explanatory Data Analysis..........................................................................................................2 One variable Analysis “AGE”..................................................................................................2 One variable Analysis “SPEED LIMIT”.....................................................................................4 Two Variable Analysis............................................................................................................5 Two variable analysis “GENDER & CRASH TYPE”...............................................................5 Two variable analysis “CRASH TYPE & ROAD USER”..........................................................5 Advanced Analysis......................................................................................................................6 Clustering...............................................................................................................................6 Concept of k-means Cluster Analysis.................................................................................6 Clustering Analysis.............................................................................................................6 Linear regression Analysis......................................................................................................9 Conclusion................................................................................................................................10 Reflections................................................................................................................................10 Reference and Bibliography:....................................................................................................12 Page1 Data analysis report of road crashes from 1989 to 2019
Introduction The report presents the analysis of road accidents form 1989 to 2019. There are techniques like cluster analysis and linear regression analysis which are used here. The data contains mostly qualitative data and these can be used for further advance studies like prediction of speed limit that encourages multiple crashes (Cox 2018). The paper describes the data using one variable analysis and two variable analysis. The data is available on the Australian government website. Excel is used to clean and R is used to perform the analysis. Data setup The data file used in the analysis was copied in a new excel workbook and saved as xlsx file. The data was then edited by using EXCEL 2013 where the observations were deleted which contains the value -9 as it indicates missing value and unknown information. After cleaning the data xlsx file was uploaded in the R workspace for further steps towards analysis. The command used to upload the file is mentioned below: Uploading data file:crash <- readxl::read_xlsx(file.choose()) There are three libraries that were loaded for the analysis. The libraries are “stats”, “RColorBrewer” and “ggplot2” for statistical functions and data visualisation. library(stats) library(RColorBrewer) library(ggplot2) Before going to the analysis part, the initial step was to remove the missing values from the data set which were automatically assigned as NA. The following command was used to omit the NA values from the data set. na.omit(mydata$`Crash Type`) na.omit(mydata$`Bus Involvement`) na.omit(mydata$`Articulated Truck Involvement`) na.omit(mydata$`Speed Limit`) na.omit(mydata$`Road User`) na.omit(mydata$Gender) na.omit(mydata$Age) The above commands ensures that the further steps were not going to be disturbed by the missing values. Hence, the analysis proceeded for the one variable analysis and so on. Explanatory Data Analysis One variable Analysis “AGE” summary(mydata$Age) boxplot(mydata$Age,col = "blue") Page2 Data analysis report of road crashes from 1989 to 2019
Figure 1: Histogram of Age The figure 1 presents the frequency distribution of age and the shape of distribution which is left skewed. The table 1 shows the average age of the observed sample which is 41. The quartiles are presented by the box plot too. Table 1: Summary statistics of age Figure 2: Box plot of age The figure 2 and table 1 presents the lower upper quartile which is 23 and 57 respectively. The additional feature of the box plot is showing the outliers which is not present in the age variable (Wickham 2016). Page3 Data analysis report of road crashes from 1989 to 2019
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
One variable Analysis “SPEED LIMIT” hist(mydata$`Speed Limit`,col = "green") summary(mydata$`Speed Limit`) Figure 3: Histogram of speed limit The figure 3 presents the histogram for the frequency distribution for the speed limit which is uneven. The histogram shows that the most of the individuals’ speed limit is around 100. The table 2 presents the average speed limit that is 82.93. Table 2: Summary statistics of speed limit The lower quartile and the upper quartile of speed limit is 60 and 100 which means most of the people drive with a speed limit that ranges between 60 and 100. This is presented in the box plot of speed limit. Figure 4: Boxplot for the speed limit Page4 Data analysis report of road crashes from 1989 to 2019
Two Variable Analysis Two variable analysis “GENDER & CRASH TYPE” GenderCrash<- table(mydata$Gender,mydata$`Crash Type`) barplot(table(mydata$Gender),col="yellow") Figure 5: Distribution of gender in the sample The figure 5 presents the gender frequency through the bar plot. This presents that the male are more likely prone to the crashes. The table 3 presents the frequency for male and female across the 3 categories of crash type that presents that male are mostly engaged in single crashes. The number of single crashes across gender is higher than multiple and pedestrian crash. Table 3: Frequency of gender against crash type Two variable analysis “CRASH TYPE & ROAD USER” CrashRoad<- table(mydata$`Crash Type`,mydata$`Road User`) barplot(table(AA$`Crash Type`),col="sea green") Page5 Data analysis report of road crashes from 1989 to 2019
Figure 6: Distribution of crash type in the sample The figure 6 presents the crash type frequency through the bar plot. This presents that the single and multiple crashes are more likely. The table 4 presents the frequency for crash types across the categories of road users that presents that drivers are mostly engaged in single and multiple crashes. The number of single crashes is higher than multiple crashes for the driver where it is lower in case off motorcycle pillion passenger and motorcycle rider. Table 4: Frequency of crash against road user Advanced Analysis Clustering Concept of k-means Cluster Analysis K means clustering separates n number of observations into k number of clusters with the nearest mean value of the k cluster (Cohenet al.2015). The cluster contains the observations that has similar mean and the different cluster contains the observations that has different mean from other observations belongs to the other cluster. In k means method, Euclidean distance is presented as metric and the variance is presented as the measure of scatter of cluster (Shukriet al.2018). The k is set manually so it is important to follow proper process to determine the k or conduct a test for the specific value of k to check the reliability of the k, otherwise the cluster analysis will not be able to perform well. Clustering Analysis The wssplot function is generated to plot the wss to determine the numbers of cluster (Kassambara 2017). The required command is mentioned below: wssplot<- function(data, nc=15,seed=1234) #To determine the optimal number of cluster. { wss<- (nrow(data)-1)*sum(apply(data,2,var)) Page6 Data analysis report of road crashes from 1989 to 2019
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
for(i in 2:nc){ set.seed(seed) wss[i]<-sum(kmeans(data,centers = i)$withinss) } plot(1:nc,wss,type="b",xlab="Numbber of Clusters", ylab="within groups sum of squares") } wssplot(clustdata) The following diagram presents the group sum of squares to the number of clusters. The kinked point of the graph indicates the number of cluster. Here the graph is kinked where the number of cluster is equal to 4 and it is clearly seen in the below figure. Figure 7: Histogram of Age Now, the k means clustering can be applied with the 4 clusters. Thee command in R for k means cluster is mentioned below where the command for the k means of the two variables is also presented which gives a k means table and the plot of k means cluster: kmeansclustdata<- kmeans(clustdata,4)#k means cluster analysis kmeansclustdata$centers#k means table plot(kmeansclustdata$centers)#plot for the k means cluster Page7 Data analysis report of road crashes from 1989 to 2019
The below table shows the cluster means that is generated from the cluster analysis. The table shows 4 cluster means for each variable. This implies that in 1stcluster the average speed and average age is 64.65 and 27.103 respectively. Table 5: Cluster means of speed limit and age The below figure presents 4 points for 4 clusters which represents that the observations will be close to these points or the points are the mean of speed limit across average age of each cluster. Figure 8: Cluster mean plot of speed limit and age Page8 Data analysis report of road crashes from 1989 to 2019
Linear regression Analysis Linear regression presents the relation between two or more variable where a variable is dependent and the other variables are independent (Berger, Maurer and Celli 2018). The linear model is presented is as below: Y=a + b*X Where Y is the dependent variable and X is independent variable. The “a” is the intercept term and “b” is the intercept term of the model. This parameters have their own interpretation. The intercept term presents the value of Y when X is zero and the slope term presents the change in Y due to one unit of change in X. This analysis is mainly used to check find relation between variables and to predict the dependent variable. x=1:28973# X is assigned as the age y=clustdata$`Speed Limit`#Y is assigned as the speed limit linearreg<-lm(y~x)# linear regression summary(linearreg)#Regression Result abline(linearreg,col="green",lty=2, lwd=2)#Plot for the estimated trend line The regression result is presented using the summary function in the following table where the residuals summary statistics, estimated coefficients and corresponding standard error and p-value, adjusted R2and F-stat is available (Faraway 2016). The coefficients standard errors and the p-values are very small which indicates that the coefficients are statistically significant with minimum error. F-statistics is F (28971, 0.0094) = 6.74 is significant that indicates the model is significant with the variables (Hanley 2016). However the R2is 0.00023 which is very small that says the goodness of fit of the model is worse or the age cannot explain the speed limit. Table 6: Regression Result The eestimated regression equation is presented in the below figure which is in green colour showing a linear relation between age and speed limit. The intercept and the slope of the trendline is significant. Page9 Data analysis report of road crashes from 1989 to 2019
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Figure 9: Trend line obtained from the regression Conclusion The variables age and speed limit is described through the one variable analysis where the location of variables through mean, shape of the variables through histogram and the outliers through boxplot is presented. Gender vs. crash type and crash type vs. road users is described through two variable analysis where the frequency is presented for gender against crash type and the crash type against road users. The conclusion from two variable analysis says that male are most likely to engage in crashes and the males engage in single crashes mostly. Most of the road users are driver who engage in crash and they mostly engage in single crash. The linear regression analysis presents a significant relation between age and speed limit. However the model cannot be used to predict the speed limit for a specific age. Reflections The cluster analysis groups the observations in 4 clusters here. The wssplot helped to determine the number of optimum clusters.Statistical techniques are used for qualitative and quantitative analysis and also used for the prediction. The data set used in this analysis is mostly qualitative and this paper is focused on the quantitative analysis. Hence there is a lot to analyse. For example, a logit and tobit model can be used for the binary response variables and ordered response variables. The binary response variable in this model is gender and the ordered response variable is crash type. Page10 Data analysis report of road crashes from 1989 to 2019
Page11 Data analysis report of road crashes from 1989 to 2019
Reference and Bibliography: Berger,P.D.,Maurer,R.E.andCelli,G.B.,2018.IntroductiontoSimpleRegression. InExperimental Design(pp. 483-503). Springer, Cham. Cohen, M.B., Elder, S., Musco, C., Musco, C. and Persu, M., 2015, June. Dimensionality reduction for k-means clustering and low rank approximation. InProceedings of the forty- seventh annual ACM symposium on Theory of computing(pp. 163-172). ACM. Cox, D.R., 2018.Analysis of binary data. Routledge. Faraway, J.J., 2016.Linear models with R. Chapman and Hall/CRC. Hanley, J.A., 2016. Simple and multiple linear regression: sample size considerations.Journal of clinical epidemiology,79, pp.112-119. Kassambara, A., 2017.Practical guide to cluster analysis in R: unsupervised machine learning(Vol. 1). STHDA. Lem, S., Onghena, P., Verschaffel, L. and Van Dooren, W., 2017. The power of refutational text:changingintuitionsabouttheinterpretationofboxplots.EuropeanJournalof Psychology of Education,32(4), pp.537-550. Neuendorf, K.A., 2016.The content analysis guidebook. Sage. Shukri, S., Faris, H., Aljarah, I., Mirjalili, S. and Abraham, A., 2018. Evolutionary static and dynamic clustering algorithms based on multi-verse optimizer.Engineering Applications of Artificial Intelligence,72, pp.54-66. Wickham, H., 2016.ggplot2: elegant graphics for data analysis. Springer. Page12 Data analysis report of road crashes from 1989 to 2019