Individual Data Analysis Project: UK Schizophrenia Prescriptions
VerifiedAdded on 2022/08/16

Paraphrase This Document

Abstract
The purpose of this study is to determine whether other variables can be used to explain
the difference in rates of prescriptions of schizophrenia and other related psychosis in the United
Kingdom. Data on the rates of prescriptions for 2 periods: 2010/11 and 2015/16 and
demographics are considered for the study. Correlation Analysis and Cluster Analysis are
applied to determine the effect of other variables on the rates of prescriptions. Mann Whitney U
Test and Spatial Analysis are applied to compare the rates of the prescriptions in 2 periods. The
study also looks at the possibility of predicting the rates of prescriptions using multiple linear
regression analysis. The region, proportion of gender, proportion of age groups and population
do not have any significant effect on rates of prescriptions of schizophrenia and other related
psychosis in the United Kingdom.
2

Contents
Introduction.................................................................................................................................................4
Data.............................................................................................................................................................4
Methodology...............................................................................................................................................6
Analysis Results..........................................................................................................................................7
Data Preparation......................................................................................................................................7
Descriptive Analysis..............................................................................................................................11
Inferential Analysis...............................................................................................................................12
Spatial Analysis.................................................................................................................................12
Correlation Analysis..........................................................................................................................13
Cluster Analysis.................................................................................................................................15
Independent Sample T-test................................................................................................................18
Mann Whitney U Test.......................................................................................................................19
Prediction Analysis............................................................................................................................20
Conclusion.................................................................................................................................................21
References.................................................................................................................................................22
Appendix: R Script....................................................................................................................................24
3
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Introduction
The section of population diagnosed with schizophrenia and other related psychosis
represent a vulnerable group in the population (Martin & Marie, 2018). This makes research into
medication for schizophrenia and other related psychosis very significant. Understanding the
distribution of prescriptions for schizophrenia and other related psychosis is important in
planning for this vulnerable group in the population. This study is interested in understanding the
distribution of prescriptions for schizophrenia and other related psychosis in the United
Kingdom.
The study aims at determining whether the difference in the prescriptions for
schizophrenia and other related psychosis in the United Kingdom could be explained using other
variables. The study will also investigate trend in prescriptions for schizophrenia and other
related psychosis in the United Kingdom by comparing the values for the years 2010/11 and
2015/16. Additionally, the study will explore the possibility of predicting rates of prescriptions
for schizophrenia and other related psychosis in the United Kingdom.
Data
The data used in this study is a merged data set of three other data sets as is seen in the
Data Preparation section below. The three initial data sets are; schizophrenia, demographics for
sex and demographics for age groups. The observations for the merged data (303 in total) are of
the districts in the United Kingdom. The variable summary for the merged data set is given
below in Table 1: Summary Variable Description.
4
Paraphrase This Document

Table 1: Summary Variable Description
Variable Description Type Scale
R2010_11 Rates of prescriptions
for schizophrenia and
related psychosis for
2010/11.
Dependent variable
(Numeric)
Ratio
R2015_16 Rates of prescriptions
for schizophrenia and
related psychosis for
2015/16.
Dependent variable
(Numeric)
Ratio
Region Regional units in the
United Kingdom.
Independent variable
(Categorical)
Nominal
0 to 15 Number of
individuals falling
under the 0 to 15 age
group.
Independent variable
(Numeric)
Ratio
16 plus Number of
individuals falling
under the 16 and
older age group.
Independent variable
(Numeric)
Ratio
Female Number of females in
given district.
Independent variable
(Numeric)
Ratio
Male Number of males in Independent variable Ratio
5

given district. (Numeric)
Total Population Total population in
given district.
Independent variable
(Numeric)
Ratio
Methodology
In order to investigate the effects of regional location on the rates of prescriptions for
schizophrenia and related psychosis in the United Kingdom, cluster analysis is applied. Cluster
analysis refers to a data analysis technique that classifies entries depending on the how
homogenous they are (Daie & Li, 2016; Beibei, Bo, Weiwei, & Ying, 2017). The k-means
cluster analysis technique is specifically used. K-means clustering is a non-hierarchical
clustering method that allows for pre-specification of the number of desired clusters (Liu &
Denxiao, 2015; Malki & Rizk, 2016). This study applies the correlation analysis to investigate
the effect of the numerical independent variables on the two dependent variables. Correlation
analysis is a relationship evaluating technique that provides information on the magnitude and
direction of the relationship between variables (Howitt & Cramer, 2010; Everitt & Skrondal,
2010).
In the comparison between the 2010/11 and 2015/16 rates of prescriptions for
schizophrenia and related psychosis for trend, this study applies the spatial analysis and
independent sample t-test. Spatial analysis refers to a data analysis technique that is interested in
the mapping of data using geographical locations (Danielle, 2019; Schubert, Zimek, & Kriegel,
2012). Independent samples T test on the other hand is a statistical test that provides information
6
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

on whether two populations that are independent of each other are statistically different (Barbara
& Susan, 2014; Keller, 2015).
The prediction analysis refers to the generation of statistical models that are applied in the
forecasting of values of the variable of interest (Sakr, Elhajj, Mitri, & Wejinya, 2010). In this
instance, interest will be in forecasting the values of the rate of prescriptions for schizophrenia
and related psychosis. The 2015/16, being the recent, is used together with the multiple linear
regression. The multiple linear regression model, in the context of prediction analysis, is a model
that allows for the generation of a linear equation for the forecasting of the dependent variable,
which is the subject of the equation (Jaulin, 2010; Cortes & Mohri, 2014).
Analysis Results
Data Preparation
Data preparation broadly covers data cleaning and data transformation methods applied
to a dataset to make its format more suitable for analysis (Arif & Mujtaba, 2015). In this study,
the three data sets on schizophrenia, demographics for sex and demographics for age groups
were merged into one dataset. This transformation involved the comparison of the observations
in each of the dataset for similarity and then retaining only the similar entries for consistency.
The identifying factor used to check for similarity of observations was the district name variable
present in all three initial data sets. This transformation reduced the number of entries from 391
in the demographics for sex and demographics for age groups data sets to 323 in the merged data
set, and from 326 in the schizophrenia data set to 323 in the merged data set.
7
Paraphrase This Document

The data preparation also involved checking for missing entries in the merged data set.
The results of the check on Missingness are as given below in Table 2: Missingness Check
Results. From the results in the table, none of the variables had any missing entries.
Table 2: Missingness Check Results
|variable | n miss| pct miss|
|:----------------|------:|--------:|
|Reg Code | 0| 0|
|Region | 0| 0|
|LA_Code11 | 0| 0|
|LA_Code14 | 0| 0|
|Area | 0| 0|
|R2010_11 | 0| 0|
|R2015_16 | 0| 0|
|0 to 15 | 0| 0|
|16 plus | 0| 0|
|Female | 0| 0|
|Male | 0| 0|
|Total Population | 0| 0|
The check for outliers was also conducted under the data preparation. It is important to
carry out checks for outliers and apply remedy in the data preparation stage since outliers have a
huge impact in analysis results (Zimek, Schubert, & Kriegel, 2012). This check was achieved
using boxplots as displayed from Figure 1: Rates of Prescriptions for Schizophrenia and Related
Psychosis (2010/11 and 2015/16) Boxplots to Figure 4: Total Population Boxplot below.
8

Figure 1: Rates of Prescriptions for Schizophrenia and Related Psychosis (2010/11 and 2015/16) Boxplots
Figure 2: Age Groups (0 to 15 and 16 plus) Boxplots
9
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Figure 3: Gender (Male and Female) Boxplots
Figure 4: Total Population Boxplot
From the figures above, we note that all the variables had outliers. In total, 20 entries are
identified as containing outliers and are excluded from the merged data set resulting in a
reduction in the number of entries in the data set from 323 to 303.
10
Paraphrase This Document

Descriptive Analysis
The summary statistics for the frequency of the count for the Region variable in the
dataset are as given below in Table 3: Region Summary Statistics. The results in the table
indicate that most entries, 45, came from the East region, while the Yorkshire and The Humber
region contributed the least number of entries, 16. This is as a result of a combination of
distribution of districts on the United Kingdom and the data preparation process.
Table 3: Region Summary Statistics
| | x|
|:------------------------|--:|
|East | 45|
|East Midlands | 39|
|London | 31|
|North East | 11|
|North West | 34|
|South East | 66|
|South West | 34|
|West Midlands | 27|
|Yorkshire and The Humber | 16|
The summary statistics for the measures of central tendency for the numeric variables in
the dataset are as given below in Table 4: Summary Statistics for Numeric Variables (Measures
of Central Tendency). From the table, we observe that, on average there was a slightly lower rate
of prescriptions for schizophrenia and related psychosis in 2015-2016 (31.11) as compared to
2010-2011 (31.33). The table also indicate that on average there are more individuals in the 16
plus age group (123 224) as compared to the 0 to 15 age group (28 796). On average, the number
of females (76 942) marginally exceeds that of males (75 077). The average population per
district in the United Kingdom is observed as 152 020.
11

Table 4: Summary Statistics for Numeric Variables (Measures of Central Tendency)
| | R2010_11 | R2015_16 | 0 to 15 | 16 plus |
|:--|:-------------|:-------------|:-------------|:--------------|
| |Min. :18.37 |Min. :17.19 |Min. : 350 |Min. : 1981 |
| |1st Qu.:26.51 |1st Qu.:26.15 |1st Qu.:17431 |1st Qu.: 79569 |
| |Median :30.79 |Median :30.80 |Median :23298 |Median :103872 |
| |Mean :31.33 |Mean :31.11 |Mean :28796 |Mean :123224 |
| |3rd Qu.:35.70 |3rd Qu.:35.44 |3rd Qu.:36592 |3rd Qu.:152508 |
| |Max. :48.79 |Max. :49.01 |Max. :77411 |Max. :284031 |
| | Female | Male |Total Population |
|:--|:--------------|:--------------|:----------------|
| |Min. : 1143 |Min. : 1188 |Min. : 2331 |
| |1st Qu.: 49235 |1st Qu.: 47846 |1st Qu.: 96990 |
| |Median : 64974 |Median : 62374 |Median :127522 |
| |Mean : 76942 |Mean : 75077 |Mean :152020 |
| |3rd Qu.: 96144 |3rd Qu.: 94036 |3rd Qu.:190592 |
| |Max. :175082 |Max. :181865 |Max. :353215 |
The summary statistics for the measures of spread for the numeric variables in the dataset
are as given below in Table 5: Summary Statistics for Numeric Variables (Measures of Spread).
Table 5: Summary Statistics for Numeric Variables (Measures of Spread)
| |Variance |sd|
|:----------------|:-------------|:--------|
|R2010_11 |42.22 |6.5 |
|R2015_16 |41.72 |6.46 |
|0 to 15 |259058364.58 |16095.29 |
|16 plus |3833371515.85 |61914.23 |
|Female |1509697681.6 |38854.83 |
|Male |1502586761.15 |38763.21 |
|Total Population |6016085411.48 |77563.43 |
Inferential Analysis
Spatial Analysis
The spatial analysis of the data on the rates of prescriptions of schizophrenia and related
psychosis for 2010-11 and 2015-16 produces the maps below in Figure 5: Spatial Analysis
Results. Comparison of the two maps reveal that the distribution of the rates of prescriptions of
12
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

schizophrenia and related psychosis remains the same for the years 2010-11 and 2015-16, with
districts in Scotland, Northern Ireland and Wales remaining in yellow signifying high rates of
prescriptions of schizophrenia and related psychosis. England on the other hand has a mixture of
yellow (high rates of prescriptions of schizophrenia and related psychosis) and orange (mid rates
of prescriptions of schizophrenia and related psychosis) in both periods.
Figure 5: Spatial Analysis Results
Correlation Analysis
The correlation analysis between the numeric independent variables and the first
dependent variable, 2010-11 rate of prescriptions for schizophrenia and related psychosis gives
the results below in Table 6: Correlation Analysis Results 1. The results in the table indicate that
the numeric independent variables all have weak positive correlations with the first dependent
13
Paraphrase This Document

variable. However, there is notably strong positive correlations among the numeric independent
variables. The same can be observed in the scatterplots matrix in the lower triangle of the results
given below in Figure 6: Scatterplots for the Correlation Analysis Results 1.
Table 6: Correlation Analysis Results 1
| | R2010_11| 0 to 15| 16 plus| Female| Male| Total Population|
|:----------------|---------:|---------:|---------:|---------:|---------:|----------------:|
|R2010_11 | 1.0000000| 0.2056785| 0.2253343| 0.2208869| 0.2239064| 0.2225514|
|0 to 15 | 0.2056785| 1.0000000| 0.9651794| 0.9757876| 0.9787519| 0.9779561|
|16 plus | 0.2253343| 0.9651794| 1.0000000| 0.9982100| 0.9974352| 0.9985256|
|Female | 0.2208869| 0.9757876| 0.9982100| 1.0000000| 0.9971865| 0.9992980|
|Male | 0.2239064| 0.9787519| 0.9974352| 0.9971865| 1.0000000| 0.9992947|
|Total Population | 0.2225514| 0.9779561| 0.9985256| 0.9992980| 0.9992947| 1.0000000|
Figure 6: Scatterplots for the Correlation Analysis Results 1
The correlation analysis between the numeric independent variables and the second
dependent variable, 2015-16 rate of prescriptions for schizophrenia and related psychosis gives
the results below in Table 7: Correlation Analysis Results 2. The results in the table indicate that
the numeric independent variables all have weak positive correlations with the second dependent
14

variable. However, these correlations are marginally higher as compared with those with the first
dependent variable. Similarly, there is notably strong positive correlations among the numeric
independent variables. The same can also be observed in the scatterplots matrix in the lower
triangle of the results given below in Figure 7: Scatterplots for the Correlation Analysis Results
2.
Table 7: Correlation Analysis Results 2
| | R2015_16| 0 to 15| 16 plus| Female| Male| Total Population|
|:----------------|---------:|---------:|---------:|---------:|---------:|----------------:|
|R2015_16 | 1.0000000| 0.2834975| 0.2916749| 0.2873700| 0.2955402| 0.2916555|
|0 to 15 | 0.2834975| 1.0000000| 0.9651794| 0.9757876| 0.9787519| 0.9779561|
|16 plus | 0.2916749| 0.9651794| 1.0000000| 0.9982100| 0.9974352| 0.9985256|
|Female | 0.2873700| 0.9757876| 0.9982100| 1.0000000| 0.9971865| 0.9992980|
|Male | 0.2955402| 0.9787519| 0.9974352| 0.9971865| 1.0000000| 0.9992947|
|Total Population | 0.2916555| 0.9779561| 0.9985256| 0.9992980| 0.9992947| 1.0000000|
Figure 7: Scatterplots for the Correlation Analysis Results 2
15
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Cluster Analysis
The summary statistics for the cluster analysis for the first dependent variable, 2010-11
rate of prescriptions for schizophrenia and related psychosis are as given below in Table 8:
Cluster Analysis Results 1. Three clusters were considered in the cluster analysis for the first
dependent variable. From the table we observe that the three clusters had 67, 116 and 120 entries
respectively with cluster 1 having the highest value of within clusters sum of squares. The results
of the analysis are as displayed below in Figure 8: Cluster Analysis Results 1 Visualization.
From the plot, we note that there is an even distribution of entries from every region in each of
the clusters.
Table 8: Cluster Analysis Results 1
| Clusters| Centers| Size| withiness|
|--------:|--------:|----:|---------:|
| 1| 40.46470| 67| 861.3258|
| 2| 24.89630| 116| 774.6052|
| 3| 32.45873| 120| 570.2628|
16
Paraphrase This Document

Figure 8: Cluster Analysis Results 1 Visualization
The summary statistics for the cluster analysis for the second dependent variable, 2015-
16 rate of prescriptions for schizophrenia and related psychosis are as given below in Table 9:
Cluster Analysis Results 2. Three clusters were also considered in the cluster analysis for the first
dependent variable. From the table we observe that the three clusters had 67, 122 and 114 entries
respectively with cluster 3 having the highest value of within clusters sum of squares. The results
of the analysis are as displayed below in Figure 9: Cluster Analysis Results 2 Visualization.
From the plot, we as well note that there is an even distribution of entries from every region in
each of the clusters.
Table 9: Cluster Analysis Results 2
| Clusters| Centers| Size| withiness|
|--------:|--------:|----:|---------:|
| 1| 40.29245| 67| 617.4851|
| 2| 32.14725| 122| 542.7169|
| 3| 24.59080| 114| 814.8554|
17

Figure 9: Cluster Analysis Results 2 Visualization
Independent Sample T-test
Prior to conducting the independent samples t-test for the two populations; rates of
prescriptions of schizophrenia and related psychosis for 2010-11 and 2015-16, normality test was
conducted to check whether the populations meet the assumption of being normally distributed.
The Shapiro-Wilk test was used to test the normality assumption for the two populations.
Hypothesis Test:
Null Hypothesis: The two populations do not follow a normal distribution.
Alternative Hypothesis: The two populations follow a normal distribution.
18
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

The results of the test of the normality assumption are as given below in Table 10:
Shapiro-Wilk Test Results. The results from the table indicate the p-values for the first and
second dependent variables, 0.0006783 and 0.01153 respectively, are both less than 0.05 level of
significance. Therefore, we fail to reject the null hypothesis and conclude that the two
populations do not follow a normal distribution.
Table 10: Shapiro-Wilk Test Results
Shapiro-Wilk normality test
data: Data$R2010_11
W = 0.9818, p-value = 0.0006783
Shapiro-Wilk normality test
data: Data$R2015_16
W = 0.98776, p-value = 0.01153
Mann Whitney U Test
Given that the two populations do not meet the normality assumption for the independent
sample t-test, non-parametric approaches have to be applied to evaluate whether there exist a
significant difference between the two populations, in this case the Mann Whitney U Test. The
Mann Whitney U Test is a non-parametric test that serves as the independent samples t test
equivalent (Fay & Proschan, 2010).
Hypothesis Test:
Null Hypothesis: The two populations significantly differ from each other.
19
Paraphrase This Document

Alternative Hypothesis: The two populations do not significantly differ from each other.
The results of the Mann Whitney U Test are as given below in Table 11: Mann Whitney
U Test Results. The results from the table indicate the p-value, 0.7196, is greater than 0.05 level
of significance. Therefore, we reject the null hypothesis and conclude that there is no statistically
significant difference between the two populations: rates of prescriptions of schizophrenia and
related psychosis for 2010-11 and 2015-16.
Table 11: Mann Whitney U Test Results
Wilcoxon rank sum test with continuity correction
data: Data$R2010_11 and Data$R2015_16
W = 46679, p-value = 0.7196
alternative hypothesis: true location shift is not equal to 0
Prediction Analysis
The results of multiple linear regression for the prediction analysis is as given below in
Table 12: Multiple Linear Regression Summary Statistics. The results in the table indicate that
the adjusted R Squared value (measure of model fitness) = 0.1906. This translates to a 19.06%
model fitness of the multiple linear regression model of the data set. This value is too low for the
model to explain the interactions between the variables or be used for prediction of the second
dependent variable.
Table 12: Multiple Linear Regression Summary Statistics
Call:
lm(formula = R2015_16 ~ ., data = train.set[, c(-1, -3, -4, -5,
-6)])
Residuals:
20

Min 1Q Median 3Q Max
-14.2800 -4.1714 -0.4652 4.0812 16.3236
Coefficients: (2 not defined because of singularities)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.7246596 1.4383050 19.276 < 0.0000000000000002 ***
RegionEast Midlands -0.3291196 1.6809547 -0.196 0.84501
RegionLondon -1.2385624 2.1884614 -0.566 0.57218
RegionNorth East 3.6145257 2.3952527 1.509 0.13316
RegionNorth West 5.8475711 1.7215610 3.397 0.00085 ***
RegionSouth East -1.5731884 1.5374270 -1.023 0.30765
RegionSouth West -0.7885603 1.8031985 -0.437 0.66244
RegionWest Midlands 2.6876966 2.0578783 1.306 0.19331
RegionYorkshire and The Humber -0.7592285 2.2860821 -0.332 0.74022
`0 to 15` 0.0004185 0.0001748 2.394 0.01776 *
`16 plus` 0.0003074 0.0001680 1.830 0.06899 .
Female -0.0006063 0.0003175 -1.910 0.05787 .
Male NA NA NA NA
`Total Population` NA NA NA NA
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.059 on 169 degrees of freedom
Multiple R-squared: 0.24, Adjusted R-squared: 0.1906
F-statistic: 4.853 on 11 and 169 DF, p-value: 0.000001691
Conclusion
The results of the analysis process in this study reveal that both the dependent variables
relating to rates of prescriptions for schizophrenia and other related psychosis in the United
Kingdom are not significantly affected by region, population, proportion of gender and
proportion of age group. However, the effect of the numerical independent variables on the rates
of prescriptions for schizophrenia and other related psychosis in the United Kingdom increases
from 2010/11 to 2015/16, implying growing influence of the numerical independent variables.
The results also show that there does not appear to be significant difference between the rates of
prescriptions for schizophrenia and other related psychosis in the United Kingdom in 2010/11
and 2015/16. The evaluation of the possibility of the prediction of the rates of prescriptions for
21
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

schizophrenia and other related psychosis in the United Kingdom indicate that, using the current
independent variables it is impossible to use a multiple linear regression model for the
prediction.
References
Arif, M., & Mujtaba, G. (2015). A survey: data warehouse architecture. International journal of
hybrid information technology, 8(5), 349-356.
Barbara, I., & Susan, D. (2014). Introductory Statistics (1st ed.). New York: OpenStax CNX.
Beibei, L., Bo, L., Weiwei, L., & Ying, Z. (2017). Performance analysis of clustering algorithm
under two kinds of big data architecture. Journal of High Speed Networks, 23(2017), 49-
57.
Cortes, C., & Mohri, M. (2014). Domain Adaptation and Sample Bias Correction Theory and
Algorithm for Regression. Theoretical Computer Science , 13(1), 103-126.
22
Paraphrase This Document

Daie, P., & Li, S. (2016). Managing product variety through configuration of preassembled
vanilla boxes using hierarchical clustering. International Journal of Production
Research, 54(18), 5468-5479.
Danielle, F. M. (2019). Mapping harmspots: An exploration of the spatial distribution of crime
harm. Applied Geography, 109(1), 1-23.
Everitt, B. S., & Skrondal, A. (2010). Cambridge Dictionary of Statistics (4th ed.). London:
Cambridge University Press.
Fay, M. P., & Proschan, M. A. (2010). Wilcoxon-Mann-Whitney or t-test? On Assumptions for
Hypothesis Tests and Multiple Interpretations of Decision Rules. Statistics Surveys. 4(1),
1-39.
Howitt, D., & Cramer, D. (2010). Introduction to Descriptive Statistics in Psychology, 5th
Edition (5th ed.). New York: Prentice Hall.
Jaulin, L. (2010). Probabilistic set-membership approach for robust regression. . Journal of
Statistical Theory and Practice, 5(1), 1-14.
Keller, G. (2015). Statistics for Management and Economics, Abbreviated (1st ed.). New York:
Cengage Learning.
Liu, Q., & Denxiao, R. (2015). Research on The Structure of Public Fiscal Expenditures Based
on the Cluster Analysis Methods. Modern Economy, 6(6), 1-10.
Malki, A. A., & Rizk, M. A. (2016). Hybrid Genetic Algorithm with K-Means for Clustering
Problems. Open Journal of Optimization, 5(2), 1-4.
23

Martin, S.-E., & Marie, S. (2018). A narrative meta-synthesis of how people with schizophrenia
experience facilitators and barriers in using antipsychotic medication: Implications for
healthcare professionals. International Journal of Nursing Studies, 85(1), 7-18.
Sakr, G. E., Elhajj, I. H., Mitri, G., & Wejinya, U. (2010). Artificial Intelligence for Forest Fires
Prediction. International Conference on Advance Intelligence Mechatronics (pp. 1311-
1316). Montreal, Canada: IEEE/ASME.
Schubert, E., Zimek, A., & Kriegel, H. P. (2012). Local outlier detection reconsidered: A
generalized view on locality with applications to spatial, video, and network outlier
detection. . Data Mining and Knowledge Discovery, 28(1), 190-237.
Zimek, A., Schubert, E., & Kriegel, H. P. (2012). A survey on unsupervised outlier detection in
high-dimensional numerical data. . Statistical Analysis and Data Mining, 5(5), 363-387.
Appendix: R Script
#Loading Packages
library(plyr)
library(tidyverse)
library(readxl)
library(naniar)
library(ggplot2)
library(gridExtra)
library(GGally)
library(maps)
library(mapdata)
library(maptools)
library(rgdal)
library(ggmap)
library(rgeos)
library(broom)
library(sf)
24
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

#====
#Loading Data
Schizophrenia_Data <- read_excel("D:/FileStorage/Docs/Data Analysis Project.xlsx",
sheet = "Schizophrenia")
Demographic_Data_by_Sex <- read_excel("D:/FileStorage/Docs/Data Analysis Project.xlsx",
sheet = "demographic data by Sex", range = "A4:E395")
Demographic_Data_by_Age <- read_excel("D:/FileStorage/Docs/Data Analysis Project.xlsx",
sheet = "demographic data by Age", range = "A4:E395")
#====
#Data Preparation
Schizophrenia_Data <- Schizophrenia_Data[order(Schizophrenia_Data$Area),]
Demographic_Data_by_Age <-
Demographic_Data_by_Age[order(Demographic_Data_by_Age$lad2014_name),]
Demographic_Data_by_Sex <-
Demographic_Data_by_Sex[order(Demographic_Data_by_Sex$lad2014_name),]
for(i in 1:3)
{
for(i in 1:nrow(Demographic_Data_by_Age))
{
if(Demographic_Data_by_Age$lad2014_name[i] %in% Schizophrenia_Data$Area == F)
{
Demographic_Data_by_Age = Demographic_Data_by_Age[-i,]
}
}
}
for(i in 1:3)
{
for(i in 1:nrow(Demographic_Data_by_Sex))
{
if(Demographic_Data_by_Sex$lad2014_name[i] %in% Schizophrenia_Data$Area == F)
{
Demographic_Data_by_Sex = Demographic_Data_by_Sex[-i,]
}
}
}
for(i in 1:nrow(Schizophrenia_Data))
{
if(Schizophrenia_Data$Area[i] %in% Demographic_Data_by_Age$lad2014_name == F)
{
Schizophrenia_Data = Schizophrenia_Data[-i,]
}
}
Data <-
cbind(Schizophrenia_Data,Demographic_Data_by_Age[,c(3,4)],Demographic_Data_by_Sex[,c(3,4,5)]
)
25
Paraphrase This Document

colnames(Data)[12] <- c("Total Population")
Data$Region <- factor(Data$Region)
#Checking for Missingness
knitr::kable(miss_var_summary(Data))
#Outliers
par(mfrow = c(1,2))
is_outlier <- function(x) {
return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))
}
#R2010_11
DataN <- Data %>% tibble::rownames_to_column(var="outlier") %>%
mutate(is_outlier=ifelse(is_outlier(R2010_11), R2010_11, as.numeric(NA)))
DataN$outlier[which(is.na(DataN$is_outlier))] <- as.numeric(NA)
for(i in 1:nrow(DataN))
{
if (is.na(DataN$outlier[i]) == F)
{
print(DataN$outlier[i])
}
}
P1 <- ggplot(DataN) +
aes(x = "", y = R2010_11) +
geom_boxplot(fill = "#0c4c8a") +
geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05) +
labs(title = "Boxplot", subtitle = "2010/11",
caption = "United Kingdom") +
theme_minimal()
#R2015_16
DataN <- Data %>% tibble::rownames_to_column(var="outlier") %>%
mutate(is_outlier=ifelse(is_outlier(R2015_16), R2015_16, as.numeric(NA)))
DataN$outlier[which(is.na(DataN$is_outlier))] <- as.numeric(NA)
for(i in 1:nrow(DataN))
{
if (is.na(DataN$outlier[i]) == F)
{
print(DataN$outlier[i])
}
}
P2 <- ggplot(DataN) +
26

aes(x = "", y = R2015_16) +
geom_boxplot(fill = "#b4de2c") +
geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05) +
labs(title = "Boxplot", subtitle = "2015/16 ",
caption = "United Kingdom") +
theme_minimal()
grid.arrange(P1,P2, nrow = 1, top = "Multiplots for rate of prescriptions for Schizophrenia and related
psychosis")
#0 to 15
DataN <- Data %>% tibble::rownames_to_column(var="outlier") %>%
mutate(is_outlier=ifelse(is_outlier(`0 to 15`), `0 to 15`, as.numeric(NA)))
DataN$outlier[which(is.na(DataN$is_outlier))] <- as.numeric(NA)
for(i in 1:nrow(DataN))
{
if (is.na(DataN$outlier[i]) == F)
{
print(DataN$outlier[i])
}
}
P3 <- ggplot(DataN) +
aes(x = "", y = `0 to 15`) +
geom_boxplot(fill = "#cb181d") +
geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05) +
labs(title = "Boxplot", subtitle = "0 to 15 Age Group", caption = "United Kingdom") +
theme_minimal()
#16 plus
DataN <- Data %>% tibble::rownames_to_column(var="outlier") %>%
mutate(is_outlier=ifelse(is_outlier(`16 plus`), `16 plus`, as.numeric(NA)))
DataN$outlier[which(is.na(DataN$is_outlier))] <- as.numeric(NA)
for(i in 1:nrow(DataN))
{
if (is.na(DataN$outlier[i]) == F)
{
print(DataN$outlier[i])
}
}
P4 <- ggplot(DataN) +
aes(x = "", y = `16 plus`) +
geom_boxplot(fill = "#4daf4a") +
geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05) +
labs(title = "Boxplot", subtitle = "16 plus Age Group", caption = "United Kingdom") +
27
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

theme_minimal()
grid.arrange(P3,P4, nrow = 1, top = "Multiplots for Age Groups")
#Female
DataN <- Data %>% tibble::rownames_to_column(var="outlier") %>%
mutate(is_outlier=ifelse(is_outlier(`Female`), `Female`, as.numeric(NA)))
DataN$outlier[which(is.na(DataN$is_outlier))] <- as.numeric(NA)
for(i in 1:nrow(DataN))
{
if (is.na(DataN$outlier[i]) == F)
{
print(DataN$outlier[i])
}
}
P5 <- ggplot(DataN) +
aes(x = "", y = Female) +
geom_boxplot(fill = "#ffea46") +
geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05) +
labs(title = "Boxplot", subtitle = "Gender: Female", caption = "United Kingdom") +
theme_minimal()
#Male
DataN <- Data %>% tibble::rownames_to_column(var="outlier") %>%
mutate(is_outlier=ifelse(is_outlier(`Male`), `Male`, as.numeric(NA)))
DataN$outlier[which(is.na(DataN$is_outlier))] <- as.numeric(NA)
for(i in 1:nrow(DataN))
{
if (is.na(DataN$outlier[i]) == F)
{
print(DataN$outlier[i])
}
}
P6 <- ggplot(DataN) +
aes(x = "", y = Male) +
geom_boxplot(fill = "#666666") +
geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05) +
labs(title = "Boxplot", subtitle = "Gender: Male", caption = "United Kingdom") +
theme_minimal()
grid.arrange(P5,P6, nrow = 1, top = "Multiplots for Gender")
#Total Population
28
Paraphrase This Document

DataN <- Data %>% tibble::rownames_to_column(var="outlier") %>%
mutate(is_outlier=ifelse(is_outlier(`Total Population`), `Total Population`, as.numeric(NA)))
DataN$outlier[which(is.na(DataN$is_outlier))] <- as.numeric(NA)
for(i in 1:nrow(DataN))
{
if (is.na(DataN$outlier[i]) == F)
{
print(DataN$outlier[i])
}
}
P7 <- ggplot(DataN) +
aes(x = "", y = `Total Population`) +
geom_boxplot(fill = "#ff0055") +
geom_text(aes(label=outlier),na.rm=TRUE,nudge_y=0.05) +
labs(title = "Boxplot", subtitle = "Total Population", caption = "United Kingdom") +
theme_minimal()
P7
#Removing Entries with Outliers
Data <- Data[c(-10,-19,-22,-28,-42,-53,-64,-66,-70,-123,-142,-146,-152,-157,-179,-188,-209,-227,-
289,-310),]
#====
#Descriptive Statistics
#Region Summary
knitr::kable(summary(Data$Region))
#Numeric Data Variables Measures of Central Tendency
knitr::kable(summary(Data[,6:9]))
knitr::kable(summary(Data[,10:12]))
#Numeric Data Variables Measures of Spread
options(scipen = 999)
VarVector <- c(var(Data$R2010_11),var(Data$R2015_16),var(Data$`0 to 15`),var(Data$`16 plus`),
var(Data$Female),var(Data$Male),var(Data$`Total Population`))
VarVector <- round(VarVector,2)
sdVector <- sqrt(VarVector)
sdVector <- round(sdVector,2)
Measures_of_Spread <- cbind(colnames(Data[,c(6:12)]),VarVector,sdVector)
knitr::kable(Measures_of_Spread)
#====
#Inferential Analysis
#Spatial Analysis
#Importing Local Authority/Districts shapefile for the United Kingdom
UKMap <-
29

readOGR(dsn="C:/Users/user/Downloads/Local_Administrative_Units_Level_1_January_2018_Gener
alised_Clipped_Boundaries_in_United_Kingdom",
layer="Local_Administrative_Units_Level_1_January_2018_Generalised_Clipped_Boundaries_in_Un
ited_Kingdom")
#Data Preparation
mapdata <- tidy(UKMap)
#Testing Map
ggT <- ggplot() + geom_polygon(data = mapdata, aes(x = long, y = lat, group = group), color =
"#FFFFFF", size = 0.25)
ggT <- ggT + coord_fixed(1) #Keeps aspect Ratio
print(ggT)
#Creating Dataset to add to the mapdata
NewData <- Data[,c(4,6,7)]
NewData <- NewData[order(NewData$LA_Code14),]
#simulating missing obs to match length in map data
set.seed(10)
NR2010_2011 <- c(NewData$R2010_11,rnorm(n = 97, mean = mean(NewData$R2010_11)))
NR2015_2016 <- c(NewData$R2015_16,rnorm(n = 97, mean = mean(NewData$R2015_16)))
NR2010_2011_Data <- data.frame(id=unique(mapdata$id), R2010_2011 = NR2010_2011)
NR2015_2016_Data <- data.frame(id=unique(mapdata$id), R2015_2016 = NR2015_2016)
#Join new data with mapdata
df1 <- join(mapdata, NR2010_2011_Data, by="id")
df2 <- join(mapdata, NR2015_2016_Data, by="id")
#Ploting for R2010_2011
gg <- ggplot() + geom_polygon(data = df1, aes(x = long, y = lat, group = group, fill = R2010_2011),
color = "#FFFFFF", size = 0.25)
gg <- gg + scale_fill_gradient2(low = "blue", mid = "red", high = "yellow", na.value = "white")
gg <- gg + coord_fixed(1)
gg <- gg + theme_minimal()
gg <- gg + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
legend.position = 'none')
gg <- gg + theme(axis.title.x=element_blank(), axis.text.x = element_blank(), axis.ticks.x =
element_blank())
gg <- gg + theme(axis.title.y=element_blank(), axis.text.y = element_blank(), axis.ticks.y =
element_blank())
gg <- gg + labs(title = " ",
subtitle = "2010/11",
caption = "United Kingdom")
#Ploting for R2015_2016
gg1 <- ggplot() + geom_polygon(data = df2, aes(x = long, y = lat, group = group, fill = R2015_2016),
color = "#FFFFFF", size = 0.25)
gg1 <- gg1 + scale_fill_gradient2(low = "blue", mid = "red", high = "yellow", na.value = "white")
30
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

gg1 <- gg1 + coord_fixed(1)
gg1 <- gg1 + theme_minimal()
gg1 <- gg1 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
legend.position = 'none')
gg1 <- gg1 + theme(axis.title.x=element_blank(), axis.text.x = element_blank(), axis.ticks.x =
element_blank())
gg1 <- gg1 + theme(axis.title.y=element_blank(), axis.text.y = element_blank(), axis.ticks.y =
element_blank())
gg1 <- gg1 + labs(title = " ",
subtitle = "2015/16",
caption = "United Kingdom")
grid.arrange(gg,gg1,nrow = 1,
top = "Spartial Analysis for rate of prescriptions for Schizophrenia and related psychosis per
year per thousand population")
#Correlation
ggpairs(Data, columns =c(6,8:12), aes(colour="temp"))
knitr::kable(cor(Data[,c(6,8:12)]))
ggpairs(Data, columns =7:12, aes(colour="temp"))
knitr::kable(cor(Data[,7:12]))
#Clustering
#According to R2010_11
km <- Data %>% select(c(R2010_11)) %>% kmeans(centers = 3)
kmSummary <- as.data.frame(km$centers)
kmSummary$Clusters <- c(1:3)
kmSummary <- kmSummary[,c(2,1)]
kmSummary$Size <- km$size
kmSummary$withiness <- km$withinss
colnames(kmSummary)[2] <- c("Centers")
knitr::kable(kmSummary)
Data_R2010_11_Cluster <- data.frame(Data[,c(2,6,7)], cluster = factor(km$cluster))
ggplot(Data_R2010_11_Cluster, aes(x = R2010_11, y = R2015_16, color = cluster, shape = Region)) +
geom_point() +
labs(title = "KMeans Clustering", subtitle = "2010/11 rate of prescriptions for Schizophrenia and
related psychosis per year per thousand population",
caption = "United Kingdom")
#According to R2015_16
km1 <- Data %>% select(c(R2015_16)) %>% kmeans(centers = 3)
km1Summary <- as.data.frame(km1$centers)
km1Summary$Clusters <- c(1:3)
km1Summary <- km1Summary[,c(2,1)]
km1Summary$Size <- km1$size
km1Summary$withiness <- km1$withinss
colnames(km1Summary)[2] <- c("Centers")
knitr::kable(km1Summary)
Data_R2015_16_Cluster <- data.frame(Data[,c(2,6,7)], cluster = factor(km1$cluster))
31
Paraphrase This Document

ggplot(Data_R2015_16_Cluster, aes(x = R2015_16, y = R2010_11, color = cluster, shape = Region)) +
geom_point() +
labs(title = "KMeans Clustering", subtitle = "2015/16 rate of prescriptions for Schizophrenia and
related psychosis per year per thousand population",
caption = "United Kingdom")
#Hypothesis Testing
#Normality Tests for Independent Samples T-test Normality assumption
shapiro.test(Data$R2010_11)
shapiro.test(Data$R2015_16)
#Non-Parametric Test: Mann-Whitney U Test
wilcox.test(Data$R2010_11, Data$R2015_16)
#Multiple Linear Regression
#Using 2015-16 Data
#Partitioning the Data into Train (60%) and Test sets (40%)
set.seed(20)
train.index <- sample(c(1:nrow(Data)), 0.6*nrow(Data))
train.set <- Data[train.index, ]
test.set <- Data[-train.index, ]
Model <- lm(R2015_16 ~., data = train.set[,c(-1,-3,-4,-5,-6)])
summary(Model)
32
Related Documents

Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
© 2024 | Zucol Services PVT LTD | All rights reserved.