Assignment 2: STA2300 Data Analysis, Semester 1, 2019

Verified

Added on  2022/12/19

|14
|1789
|65
Homework Assignment
AI Summary
This assignment solution for STA2300, a data analysis course, addresses a range of statistical concepts using a provided dataset. The solution begins with probability calculations using contingency tables, focusing on gender and cardiac disease. It proceeds to analyze patient height distributions using histograms, calculating descriptive statistics such as mean, median, standard deviation, and IQR, and discusses the appropriateness of each based on the distribution's skewness. The assignment then delves into experimental design, examining the impact of cranberry juice on cholesterol levels and identifying confounding variables. Further analysis includes calculating z-scores, probabilities, and interpreting correlation coefficients from a scatterplot of hemoglobin and cardiac index. Finally, it explores the binomial distribution, calculating probabilities and assessing its applicability, and concludes with regression analysis, providing the regression equation and interpreting the R-squared value. The student has effectively utilized SPSS to derive the necessary statistics and provide detailed explanations for each question asked in the assignment brief.
Document Page
STATISTICS
STUDENT ID:
[Pick the date]
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Question 1
(a) The requisite contingency table as obtained from SPSS based on the data provided is shown
below.
(b) The objective is to determine the female patient proportion who have experienced ‘Coronary
Artery Disease’
Favorable cases i.e. female patients with ‘Coronary Artery Disease’ =15
Total cases i.e. patient count= 146
Requisite probability = (15/146) = 0.1027
(c) The objective is to determine the male patient proportion from the patients who have
experienced ‘Silent Ischemia’
Favorable cases i.e.male patients with ‘Silent Ischemia’ = 13
2
Document Page
Total cases i.e.patients experiencing ‘Silent Ischemia’ = 20
Requisite probability = 13/20 = 0.65
(d) In order to determine if there is any association between type of cardiac arrest and gender,
the two types of cardiac arrest i.e. “coronary heart disease” and “arrhythmia” have been
considered. The requisite incidence of these two types of cardiac arrests has been indicated
in the conditional table indicated below.
Patients who have suffered fromarrhythmia= 58
Proportion of male amongst the patients of arrhythmia = (29/58) = 50%
Proportion of female amongst the patients of arrhythmia = (29/58) = 50%
Patients who have suffered from coronary artery disease = 44
3
Document Page
Proportion of male amongst the patients of coronary artery disease = (29/44) = 65.91%
Proportion of female amongst the patients of coronary artery = (15/44) = 34.09%
The above calculations clearly highlight that gender and type of cardiac arrest tend to have an
association since the conditional probabilities associated with different genders under different
forms of cardiac arrest is not the same.
Question 2
(a) The following histogram indicates the distribution related to the patients height.
4
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
(b) The histogram is asymmetrical in nature as the highest frequency is not present at the centre.
Further, there is negative skew that is present in the data owing to which the distribution of
patients heights would not be termed as normal in distribution. Additionally, the presence of
skew would make median a more suitable choice for indicating central tendency as compared
to mean. Further, the extent of dispersion in the patient height as estimated from the
histogram seems moderate.
(c) The relevant variable is patient height for which the following computations have been
performed using SPSS.
Sample size = 146
Mean = 163.95 cm
Standard deviation = 14.46 cm
(d) The relevant variable is patient height for which the following computations have been
performed using SPSS.
Median = 165 cm
First Quartile = 160 cm
Third Quartile = 171 cm
IQR = 11 cm
(e) The choice of measures of centre and spread are dictated by the nature of distribution which
in the given case is non-normal owing to presence of skew. As a result, the correct measure of
central tendency would be median and not mean since mean can be impacted by outlier values
which is not the case with median. In context of spread, the suitable measure is IQR as
conventional measures such as standard deviation may be influenced by extreme values.
Question 3
5
Document Page
(a) This is an experimental study as the independent variable is the cranberry juice form which is
under the control of the researcher and by varying the same, the cholesterol and antioxidants
levels are recorded so that the underlying impact may be quantified.
(b) I) The HDL or “good” cholesterol is the response variable for the given study. This tends to
highlight the underlying risk with regards to various heart diseases.
ii) The cranberry juice type is the appropriate factor which is considered in two variants. One
of these contains corn syrup while the other artificial sweetener. The factors levels highlight
the eight ounce servings which are given in the various months.
iii) The given study has sample size of 19 which includes 8 men and 11 women.
(c) The given study has not complied with the four key principles related to experimental design.
In this regards, one of pivotal principles is randomization which requires random assigning of
treatments to subjects for limiting bias. However, this is not adhered to in the given study as
treatment choice for the groups was not carried randomly owing to which bias can creep in.
Yet another principle is replication as per which it is possible to replicate the results of the
experiment. The given study design does not permit this also as there is absence of control in
terms of exercise and diet which would have significant impact. The next principle is the
presence of local control so that variation can be controlled which is not adhered to for the
given study.
(d) Confounding variable may be referred to an external variable which has impact on both the
dependent and independent variable and can therefore alter the relationship exhibited
between the input variable and response variable. Therefore, it is essential to restrict the
impact of confounding variable by controlling the same. A relevant confounding variable is
dietary habits of the subjects which would not only impact the HDL changes but can also
decide on the amount of cranberry juice consumed and its effect.
Question 4
6
Document Page
a) “Resting heart rate” of marathon runners is the relevant variable of interest. The
measurement unit is beats per minute or bpm.
b) Resting heart rate mean value = 58 bpm
Resting heart rate standard deviation value = 4 bpm
For X =67, the z value computation is shown below.
Z value = (67-58)/4 = 2.25
The corresponding P(z≤2.25) = 0.9878 (as determined from the relevant Z table)
Therefore, probability of mean resting heart rate exceeding 67 bpm = 1-0.9878 = 0.012
c) Resting heart rate mean value = 58 bpm
Resting heart rate standard deviation value = 4 bpm
For X1 = 55 bpm, the computation of Z value is shown below.
Z1 = (55-58)/4 = -0.75
The corresponding P(z≤-0.75) = 0. 2266 (as determined from the relevant Z table)
For X1 = 65 bpm,the computation of Z value is shown below.
Z1 = (65-58)/4 = 1.75
The corresponding P(z≤1.75) = 0. 9599 (as determined from the relevant Z table)
Probability that resting heart rate would lie between 55 bpm and 65 bpm = 0.9599 – 0.2266 =
0.7333
7
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
d) Since the lowest 2% ought to be considered, hence the probability value would be 0.02.
Using the z tables, the value of probability for this value is -2.054. Let the underlying resting
rate for this probability be X bpm
Applying the requisite formula for Z value,we get the following
-2.054 = (X-58)/4
On Solving, X = 49.78
Question 5
(a) The included variables are hemoglobin level and cardiac index. Considering that these are
represented in numerical terms, hence the variables would be termed as quantitative
variables.
(b) The scatterplot indicating the association between the two variable i.e. hemoglobin and
cardiac index is indicated as follows.
8
Document Page
(c) As evident from the scatterplot above, non=linear relationship exists between variables.
Further, considering the negative slope, it can be concluded that relationship between the
variables is negative. Also, considering the scattered nature of the plot, it can be highlights
that the relationship is weak to moderate in terms of strength with a lot of outliers present.
(d) The appropriate statistic is correlation coefficient as both variables are non-categorical and
numerical.
9
Document Page
From the above output, the correlation coefficient comes out as =0.497 which implies moderate
strength and negative direction of the relationship between given variables.
(e) Regression equation
10
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
The equation of regression line is represented below.
The requisite scatterplot along with best fit line is highlighted below.
11
Document Page
(f) The objective is to find the cardiac index given the value of hemoglobin as 112 g/100 ml.
Considering the predictive power of the regression model is low as represented by R2 of 0.247,
the estimate above may not be accurate and could have a large residual from the actual value.
(g) R2 =0.247. This would indicate that 24.7% of the variation in cardiac index is accounted for
by corresponding changes in the independent variable i.e. hemoglobin.
Question 6
12
chevron_up_icon
1 out of 14
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]