BUS708: Statistics and Data Analysis Report - Trimester 2, 2019

Verified

Added on 2022/11/17

AI Summary

This report presents a comprehensive statistical analysis of two datasets. The first dataset, sourced from Kaggle, focuses on Google Play apps, examining the distribution of free versus paid apps, outlier removal, and comparisons between app types (Tools, Games, Communication). It includes hypothesis testing to determine the proportion of free apps and the overall price of paid apps, along with ANOVA to assess price differences between app categories. The second dataset, collected through surveys, investigates social media preferences across different countries using a Chi-Square test. The report also explores the relationship between app ratings and reviews through regression analysis. Statistical tools like t-tests, ANOVA, and Chi-Square tests are used to draw inferences and conclusions about the datasets. The analysis also includes the use of plots to visualize the data and draw insights from the data.

Running head: STATISTICS AND DATA ANALYSIS
STATISTICS AND DATA ANALYSIS
Name of the Student:
Name of the University:
Author’s Note:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

2
STATISTICS AND DATA ANALYSIS
Table of Contents
Introduction:...............................................................................................................................3
Distribution of Free and Paid apps:............................................................................................3
Outlier Removal.........................................................................................................................5
Comparing Types of Apps:........................................................................................................6
Regression between Rating and Review:...................................................................................8
Variability in preferences of social media apps:........................................................................9
References:...............................................................................................................................10

3
STATISTICS AND DATA ANALYSIS
Introduction:
There are two data sets that is provided. One of them is related to various statistics
about Google play apps and the other is about students from different countries and the social
media sites they prefer to use. The first data set contains secondary data as it is collected by
Kaggle and the various information provided can be subdivided into 7 categorical variables
and 5 quantitative variables.Dataset 2 is collection of primary data as it collected by
interviewing friends. The variables here is countries and the type of mobile application both
of which are categorical variable.
The objective of the report is to study the statistics and make inferences on the various
data that are of interest. There are statistical tools to make estimations about relevant
questions that might guide an organization to making better decisions.
As more and more of our lives are situated online, it has become imperative for
business organizations to realign their strategies regarding capturing the market share. Online
apps are used by most people in their day to day life and it is important to know which
aspects of the market a business should be focused on. App Developers want to know the
impact of their apps on people and for this data collected can offer statistically significant
results on key questions (Martin, Sarro and Harman 2016).
Distribution of Free and Paid apps:
Frequency
Application Type Frequency Percentage
Free 3738 93.45%
Paid 262 6.55%
Grand Total 4000 100%
93%
7%
Mobile Application
Free
Paid

4
STATISTICS AND DATA ANALYSIS
It is seen from the data set that approximately 93% of the apps are free and 7% are
paid out of the 4000 sample.
Constructing a 95% confidence level with a z critical value 1.96, we estimate that the
free apps overall lies between the range 92.68% and 94.22 %.
To test if the overall app are more often free or not a hypothesis test is conducted
assuming the number of free apps overall is less than 50 %.
H0 = Proportion of free apps ≤ 50 % .
Ha= Proportion of free apps > 50%.
From the sample given, the percentage of free apps is: 93.45%. The following results
from the hypothesis testing was found.
Hypothesis Test For Proportion:
Hypothesis Test for π (Proportion)
Hypotheses
Null Hypothesis π < 50%
Alternative Hypothesis π >= 50%
Test Type Upper
Level of significance
α 0.05
Critical Region
Critical Value (s) 1.9600
Sample Data
Sample Size 4000
Count of 'Successes' 3738
Sample proportion, p 93.45%
Standard Error 0.79%
Z Sample Statistic 54.9604
p-value 0.000
Decision
Null Hypothesis is rejected.
Thus, there is significant evidence that there is higher proportion of free apps in the overall
sample.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

5
STATISTICS AND DATA ANALYSIS
Outlier Removal
Summary Statistics
Statistic Value
Sample Size 232
Mean 3.574
Standard Deviation 2.343
Minimum 0.99
Q1 1.99
Median 2.99
Q3 4.99
Maximum 10
After outlier removal, the following descriptive statistics was found. It is observed
that the new mean is much less than the previous 12.517. However as the mean is higher than
the median the distribution is right skewed.
A box plot is generated after an iteration of outlier removal:
Now it is required to test the claim that the overall price for the paid apps is less than $ 3.6.
A left tailed to T test to check the claim.
Hypothesis Test for μ (Mean)
Hypotheses
Null Hypothesis μ >= 3.6
Alternative Hypothesis μ < 3.6
Test Type Lower
Level of significance
α 0.05
Critical Region
Price (Dollars)
Price Distribution of Paid Apps
(After an Iteration of outliers removal)

6
STATISTICS AND DATA ANALYSIS
Degrees of Freedom 3999
Critical Value (s) 1.96
Sample Data
Sample Standard Deviation 2.343
Sample Mean 3.574
Sample Size 4000
Standard Error of the Mean 0.0370
t Sample Statistic
-
0.7018
p-value 0.2414
Decision
Null Hypothesis can’t be rejected.
The null and alternative hypothesis are taken and it is found that the t value lies .7
standard deviations below the mean of the sampling distribution. The corresponding p value
for this t is 0.2414 which is higher than the significant level 0.05 which is standard. Thus the
null hypothesis can’t be rejected and there is statistically significant reason to believe that the
overall price of paid apps is less than $ 3.6.
Comparing Types of Apps:
Statistics TOOLS GAME
COMMUNICATI
ON
Sample Size 30 31 8
Mean 3.502 3.7 2.115
Standard
Deviation 2.721 3.298 1.157
Minimum 0.99 0.99 0.99
Q1 1.97 1.24 0.99
Median 2.99 2.99 1.99
Q3 4.99 4.99 2.99
Maximum 14.99 17.99 3.99
The statistics for three different types of apps Tools, Game and Communication are
calculated and boxplot are drawn to understand the distribution of the paid prices for these
apps.
The mean is higher for all three of them and hence the distributions are right skewed.
It is also then paid app for communication were used by very few people compared to the
other two.
Price of the Mobile Application for given Category (Communication, Game or
Tools)
C
a
t
e
g
o
r

7
STATISTICS AND DATA ANALYSIS
Anova is run in excel to check if there is a significant difference between the distribution in
prices of the groups. The following results were found:
SUMMARY
Groups Count Sum Average Variance
Tools 30 105.06 3.502 7.40503
Games 31 114.69 3.699677 10.87957
Communication 8 16.92 2.115 1.339286
ANOVA
Source of
Variation SS df MS F P-value F crit
Between
Groups 16.24382 2 8.121912 0.97373 0.383031 3.135918
Within Groups 550.508 66 8.34103
Total 566.7518 68
From ANOVA, it is seen that the F value is lower than the critical F value and therefore our
result is not statistically significant. It cannot be said that there is a significant difference
between the paid prices of the groups.
Regression between Rating and Review:
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
0
10000000
20000000
30000000
40000000
50000000
60000000
70000000
80000000
90000000 Rating Vs Review of Mobile Applications
Rating
R
e
v
i
e
w
s
A scatterplot between the rating of the app and the number of reviews is drawn in excel and
it is apparent that there isn’t any apparent relation between the two variables.
Price (Dollars)

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

8
STATISTICS AND DATA ANALYSIS
Further a regression analysis is also done, and the results are tabulated below:
Regression Statistics
Multiple R
0.06468453
2
R Square
0.00418408
9
Adjusted R Square 0.00393501
Standard Error
0.52515668
8
Observations 4000
ANOVA
df SS MS F
Significa
nce F
Regress
ion 1
4.632787
783
4.632
788
16.79
827
4.23961E
-05
Residua
l 3998
1102.606
61
0.275
79
Total 3999
1107.239
398
Coefficie
nts
Standard
Error t Stat
P-
value
Lower
95%
Upper
95%
Lower
95.0%
Upper
95.0%
Interce
pt
4.186050
268
0.008420
115
497.1
488 0
4.169542
148
4.202558
387
4.169542
148
4.202558
387
Review
s
1.11208E
-08
2.71333E
-09
4.098
569
4.24E-
05
5.80113E
-09
1.64404E
-08
5.80113E
-09
1.64404E
-08
From the regression analysis, it is seen that the R value: 0.065, which is the coefficient of
correlation, there is a very weak positive relationship between the two variables.
The R2 value which is the coefficient of determination determines how much of the
variability of the dependent variable can be explained by the variability of the independent
variable. Here the R2 value 0.004 which is very low and is indicative of the weak relationship
between the two variables.
No, it is not possible to predict the rating of an app with the number of reviews.

9
STATISTICS AND DATA ANALYSIS
Variability in preferences of social media apps:
China India Nepal
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
WhatsApp
Wechat
Viber
Instagram
Facebook Messenger
The chart shows the proportion of students from the three countries: India, Nepal and China
who use the different communication apps.
To test if there is a significant difference between the students of different countries and their
preferences of communication app we use a Chi Square Test in excel.
Count of Mobile
Application Column Labels
Row Labels
Facebook
Messenger
Instagra
m
Vibe
r
Wecha
t
WhatsAp
p
Grand
Total
China 3 2 5
India 2 1 2 5
Nepal 1 1 3 5
Grand Total 1 3 3 4 4 15
The frequency of the sample given is tabulated and also the Chi square expected frequency
table is also created:
Expected Values
Countries
Facebook
Messenger
Instagra
m Viber Wechat WhatsApp
Grand
Total
China 0.333333333 1 1 1.33333
1.333333
3 5
India 0.333333333 1 1 1.33333
1.333333
3 5
Nepal 0.333333333 1 1 1.33333
1.333333
3 5
Grand Total 1 3 3 4 4 15

10
STATISTICS AND DATA ANALYSIS
Taking significance level at 0.05 the chi squared statistic is 0.33 which is higher than the
significance level. Thus the null hypothesis can’t be rejected. And there is no statistically
significant reason for us to conclude that there is a significant difference between the
preferences of communication apps between the students of different countries.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

11
STATISTICS AND DATA ANALYSIS
References:
Martin, W., Sarro, F. and Harman, M., 2016, November. Causal impact analysis for app
releases in google play. In Proceedings of the 2016 24th ACM SIGSOFT International
Symposium on Foundations of Software Engineering (pp. 435-446). ACM.
Siegel, A., 2016. Practical business statistics. Academic Press.
Stine, R. and Foster, D., 2014. Statistics for Business: Decision Making and. Addison-Wesley
SOFTWARE-JMP.
Winston, W., 2016. Microsoft Excel data analysis and business modeling. Microsoft press.