Data Analysis & Visualisation - Pre-processing, Analysis, Descriptive Statistics, Visualization

Verified

Added on  2023/06/05

|33
|4485
|128
AI Summary
This report on Data Analysis & Visualisation covers pre-processing of data in Excel, analyzing data, descriptive statistics, and visualization of data. It provides insights on sales, profits, and product categories of a superstore. The report also includes a survey on ice cream flavors and reveals that 47.5% of students like vanilla flavor.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Analysis and
Visualization

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Table of Contents
PART 1............................................................................................................................................1
Pre – processing of data in Excel.................................................................................................1
Analyzing the data.......................................................................................................................8
Descriptive statistics..................................................................................................................10
Visualization of data..................................................................................................................11
PART 2..........................................................................................................................................13
1. How many students like vanilla flavour of ice cream...........................................................13
2. How many students are male and female?............................................................................14
3. Mean and Median of participants who like chocolate and strawberry flavor ice cream.......14
2.4 Cluster analysis example- K means clustering....................................................................16
2.5 Most common data mining and text mining methods.........................................................22
2.6 Advantages and disadvantages of using SPSS and Excel...................................................23
REFERENCES..............................................................................................................................25
APPENDIX....................................................................................................................................26
Document Page
PART 1
Pre – processing of data in Excel
Microsoft – Excel is considered to be a great tool for the purpose of pre – processing &
handling of structured data. Pre – processing of data is one of the step in data analysis and data
mining process which allows for transforming raw data into a format which can be analyzed and
understood through machine learning & computers (Kandpal, 2021). It is one of the component
of data preparation where processing of raw data is performed to make is suitable for several
another procedures pertaining to data processing. In other words, data preprocessing involves
data manipulation prior to using it for any other processes or procedures in order to enhance the
performance of data analysis tasks.
For instance, in the case of superstore, the application of filter in excel facilitates grouping of
sales and profit data on the basis of years from which they are relevant to. With the application
of filter on the order data column, selection criteria is as the “year” like 2009, 2010, 2011 and
2012 in the present case and accordingly, the excel gives the results associated with the particular
year selected. It facilitates summing up of the profits and sales for that particular year and
accordingly, the same procedure has been used for rest of the year and the results of total profits
and sales are obtained as displayed below.
Year Numbe
r of
order
placed
Sales Profit
2009 897 175406
1
152252
2010 816 131886
7
132154.
9
2011 856 147335
5
161414.
1
2012 879 160155
2
130967
Filter function in excel: Through filter function, the analyst can filter the set of data on the basis
of defined criteria (López-Zambrano and et.al., 2018). For the application of filter over a range
of data, the following steps has been used with respect to the superstore.
1
Document Page
1. select the column of the data which is needed to be filtered. Like in case of superstore data set,
the column headed as “order date” has been selected.
2. Then on the Home tab itself, in the editing group, click on the Sort & Filter option and from
the drop down list, the filter option has been selected which results in application of filter on all
the column head of the sheet.
3. On the column head, the down arrow is clicked to select the criteria for the required data set.
In case of superstore data set, the year 2009 is set for getting the sales, profit and number of
orders placed data for the year. Accordingly, the sum of the sales and profits relevant to 2009
was entered in the table corresponding to the year 2009.
While conducting preprocessing of data, filter function allows displaying of that data meeting the
specified criteria. Therefore, preprocessing of data facilitates gaining of precise knowledge of the
data set prior to conducting any analysis and drawing conclusions about it.
Further, Pivot table has been used as a tool for preprocessing of data in order to gain
understanding of the movement in the sales and profits figure over the years with respect to the
superstore.
Pivot table: It is usually regarded as the most interactive way for the quick summarization of
large amount of data. It is used for the analysis of numerical data in a detailed manner which
helps in answering several questions regarding the business performance in the number of years
2

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
of different basis and aspects (Guerrero, 2018). Also, it is a very powerful tool meant for
analyzing, calculating and summarizing the data patterns & trends can be identified and
comparisons can be established among different year’s sales, profit and product performance. It
works differently which is based on the type of platform where it has been used in excel.
The steps to be followed for creating the pivot table for the superstore data set are as follows:
1. Selection of all the columns and rows for which the pivot table is required to be created.
2. By clicking on insert tab and in the tables group, there would be a pivot table option.
3. Click on the pivot table option and the dialog box would be displayed showing the table range
as the cell reference for the entire data set.
3
Document Page
4. Click on new worksheet to get the pivot table results on the new sheet and then click OK. This
would create a pivot table template on the new worksheet.
5. On the pivot table fields at the right hand side, drag the order date in filters quadrant, sales and
profits in columns quadrant and product category in row quadrant.
4
Document Page
6. To see the sales and profits with respect to the product category sold in 2009, the down arrow
on order date option would be clicked and 2009 would be set as the criteria, which in turn will
display the required results.
5

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Accordingly, the following results for the year 2009 have been obtained.
6
Document Page
By selecting all in the order date, the following results have been generated with respect to the
sales and profits generated from all the three product categories over the years.
Order Date (All)
Row Labels
Sum of
Sales
Sum of
Profit
Furniture
5178590.54
2 117433.03
Office
Supplies 3752762.1 518021.43
Technology
5984248.18
2 886313.52
Grand
Total
14915600.8
2
1521767.9
8
Another function in excel which has been used for preprocessing of data is SUMIFS.
SUMIFS: This is one of the best function in excel through which values can be added up to meet
multiple criteria. Through this function, sum of values can be obtained if the corresponding cells
satisfy the criteria specified in terms of numbers, text and dates (Cleff, 2019). The syntax used
for applying this function is as follows:
=SUMIFS (sum_range, range1, criteria1, [range2], [criteria2], ...)
7
Document Page
In the case of superstore data, the SUMIFS function has been applied to determine the sales and
profits figures over the years with respect to the superstore.
The formula used for the same is as follows:
= SUMIFS (Sales data, order date range, “>=” &DATE (year, month, date), order date range,
“<=” &DATE (year, month, date)
Further, for the determination of reduction in sales and profits over the years the formula used is
as follows:
= (Sales for the current year – Sales of the previous year) / Sales for the current year * 100
Accordingly, the results for the change in sales and profits over the years generated are as
follows:
Year Sales Profit change in sales Change in profits
2009 1754061 152253
2010 1318867 132154.9 -24.81% -13.20%
2011 1473355 161414.1 11.71% 22.14%
2012 1601552 130967 8.70% -18.86%
8

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Analyzing the data
Microsoft excel is regarded as one of the best application for conducting analysis of data (Kaula
and Kaula, 2018). This is possible through several tools equipped within it such functions, pivot
table, etc. One of such function is filtering option which is being to obtain the data based on the
specified conditions. Accordingly, the sales, profits and number of orders generated data for the
four years have been identified in this report, which is as follows:
Through the above table, it can be analyzed that number of orders placed has been declined from
2009 to 2010 while the same has continuously increased in the next two years that is, 2011 and
2012. The same is true for sales and profit where from 2009 to 2010, the sales have declined
significantly and then again have increased in 2011and 2012 while profits have increased from
2010 to 2011 and then again decreases in 2012.
SUMIFS: Further, the SUMIFS function have generate the following results with respect to sales
and profit of the superstore:
Through the above table, it can be seen that sales have declined from 2009 to 2010 by 24.81%
and then increased by 11.71% and 8.70% respectively in 2011 and 2012. Further, on looking at
the profits, it has been analyzed that it has been decline by 13.20% from 2009 to 2010 while
subsequently in 2011, it has increased by 22.14% and in 2012 decreased by 18.86%.
9
Document Page
Pivot table: Through this tool, the sales and profits with respect to different product category
have been determined with respect to four years, which are as follows:
Through the above table, it has been analyzed that the sales of technology is highest among all
the products while the sales of office supplies is lowest among all the product category. Further,
the highly profit generating product category is identified as technology while the lowest profits
have been generated from furniture over the four years.
Descriptive statistics
Sales Profit
Mean 1775.878 Mean 181.1844
Standard Deviation 3585.051 Standard Deviation 1196.653
Minimum 2.24 Minimum -14140.7
Maximum 89061.05 Maximum 27220.69
Sum 14915601 Sum 1521768
Through the above table, it has been analyzed that average sales and profits over the four years
was 1775 and 181 respectively and standard deviation is identified as 3585 and 1196
respectively. Due to a very high standard deviation, it has been analyzed that there is a huge
difference among the sales order placed during the four-year period (Evans and Evans, 2020).
The maximum and minimum sales generated over the years is 89061 and 2.24 respectively while
the maximum and minimum profits for the four – year period is identified as -14140 and 27220.
The negative profits indicate that superstore has losses to the extent of 14140. The total sales and
profits generated over the years are identified as 14915601 and 1521768 respectively.
Regression Statistics
10
Document Page
Multiple R 0.58196
R Square 0.33867
8
Adjusted R
Square
0.33859
9
Standard
Error
973.197
Observations 8399
The above regression statistics shows that the degree of relationship between sales and profits of
the superstore is 0.58 which that there is moderate relationship between sales and profits (Zhou,
2022). The dependent variable taken here is profits while the sales are taken as independent
variable where the r-square value of 0.3386 depicts that 33.86% in independent variable can be
explained through the changes taking place in dependent variable.
Visualization of data
Visualization refers to the graphical representation made with respect to the given set of data in
order to show the patterns and trends in the variable of interests over the years. In excel there are
several types of charts that are being used for the purpose of visualization of data (Kronthaler,
2022). Through this, it is very easy for the analyst to determine whether there is an increase,
decrease or constancy within the movement of variables. With respect to the super store the
following visualization have been done.
Graphical representation of sales & profit based on product category
11

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Furniture Office Supplies Technology
0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
Sum of Sales
Sum of Profit
Through the above chart, it has been identified that the sales of technology is highest
while the sales of office supplies is lowest. Further, the profits generated through technology is
highest while the profits generated through furniture is lowest.
Graphical representation of sales and profits over the years
2009 2010 2011 2012
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
Change in sales and profits of the Superstore
Sales Profit
The chart is showing that sales have decreased from 2009 to 2010, then increases from 2010 to
2011 and 2012. On the other hand, the profits have first declined from 2009 to 2010, then
increases from 2010 to 2011 and again declined from 2011 to 2012.
12
Document Page
PART 2
1. How many students like vanilla flavour of ice cream
icecreamflavour
Frequency Percent Valid Percent Cumulative
Percent
Valid
Vanilla 95 47.5 47.5 47.5
Chocolate 47 23.5 23.5 71.0
Strawberry 58 29.0 29.0 100.0
Total 200 100.0 100.0
From the results, it can be analysed that 95 students like Vanilla flavor of ice-cream. The results
show that maximum number of students prefer vanilla flavor to chocolate and strawberry.
13
Document Page
2. How many students are male and female?
Gender
Frequency Percent Valid
Percent
Cumulative
Percent
Valid
Male 91 45.5 45.5 45.5
Female 109 54.5 54.5 100.0
Total 200 100.0 100.0
From the above results, it was found that put of 200 students 91 are males while 109 students are
females. Hence, it can be analysed that majority of the students are females.
3. Mean and Median of participants who like chocolate and strawberry flavor ice cream.
Statistics
icecreamflavour
14

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
N Valid 105
Missing 0
Mean 2.55
Median 3.00
icecreamflavour
Frequency Percent Valid
Percent
Cumulative
Percent
Valid
Chocolate 47 44.8 44.8 44.8
Strawberr
y 58 55.2 55.2 100.0
Total 105 100.0 100.0
The results show that mean and median of participants who like chocolate and strawberry flavor
of ice cream is 2.55 and 3, respectively. From this it can be analysed that on an average, students
prefer strawberry flavor of ice – cream.
15
Document Page
2.4 Cluster analysis example- K means clustering
Following are the steps used for application of K- means cluster to the provided dataset in SPSS:
i) Go to Analyse Classify K- means Cluster.
16
Document Page
ii) The K- means cluster dialogue box will appear. Select the variables and move them to the
‘Variable ’ box.
17

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
iii) Click on the ‘Iterate’ tab. The maximum iterations value by default is 10. Click on
tab.
18
Document Page
iv) Click on ‘Save’ button. Tick on cluster membership and click on ‘Continue’.
19
Document Page
v) Click on ‘Options’ tab and tick on all the three options ‘Initial cluster centers’, ‘Annova table’
and ‘Cluster information for each case’. Then click on ‘Continue’.
At last, click on .
Results
Initial Cluster Centers
Cluster
1 2 3
Gender 1 1 0
icecreamflavour 2 1 1
video 29 55 72
puzzle 26 71 31
20

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Iteration Historya
Iteration Change in Cluster Centers
1 2 3
1 16.287 12.810 17.590
2 3.070 1.812 1.038
3 1.793 .968 .329
4 1.400 .798 .376
5 1.097 .683 .472
6 .407 .115 .616
7 .000 .000 .000
a. Convergence achieved due to no or small change in cluster
centers. The maximum absolute coordinate change for any
center is .000. The current iteration is 7. The minimum
distance between initial centers is 43.313.
The iteration table indicates the progress of the clustering process. The early iterations show a lot
of shift in the cluster centers. By the fifth iteration they have settled down and the last 2
iterations are minor adjustments.
Final Cluster Centers
Cluster
1 2 3
Gender 1 1 0
icecreamflavour 2 2 2
video 41 58 56
puzzle 46 61 39
The final clusters reflect the characteristics of each cluster.
Students in cluster 1 like chocolate ice cream flavor and are more interested in puzzles.
Students in cluster 2 also prefer chocolate ice- cream and tend to play puzzles more than
the other clusters.
Students in cluster 3 are those that prefer playing video games.
Distances between Final Cluster Centers
Cluster 1 2 3
1 22.077 16.063
21
Document Page
2 22.077 21.380
3 16.063 21.380
From the above table it can be analysed that cluster 1 and 2 are most different.
Cluster 2 is approximately equally different to cluster 1 and 3.
ANOVA
Cluster Error F Sig.
Mean Square df Mean Square df
Gender .403 2 .248 197 1.628 .199
icecreamflavour 1.703 2 .725 197 2.351 .098
video 5889.952 2 39.226 197 150.153 .000
puzzle 7430.226 2 40.994 197 181.253 .000
The F tests should be used only for descriptive purposes because the clusters have been chosen
to maximize the differences among cases in different clusters. The observed significance levels
are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster
means are equal.
Number of Cases in each
Cluster
Cluster
1 67.000
2 101.000
3 32.000
Valid 200.000
Missing .000
From the above results it can be analysed that among the 3 clusters, a large number of cases are
assigned to the cluster 2.
2.5 Most common data mining and text mining methods
Data and text mining is the process which assists in deriving high quality information
from a piece of text. It involves gathering unstructured data and identification and removal of
anomalies. The most common Text Mining methods are as follows:
Categorization- This is a process which involved gathering text documents and
performing analysis so that right indexes could be uncovered for each document. It is also
22
Document Page
known as Natural Language Processing. This also involves the use of co- referencing
method for extraction of relevant synonyms and abbreviations.
Clustering- This is considered to be one of the most crucial techniques of text mining.
With the help of this, one can identify intrinsic structures in textual information.
Following this, these structures can be organized into relevant subgroups or clusters. It is
a challenge to form meaningful clusters from that textual data which is unlabelled. But,
clustering is a useful tool which helps in distribution of data.
The most common data mining methods are as follows:
Associations- This method comprises of rules which assist in determining correlations
and associations between various points in a dataset. By using it unique and interesting
relationships can be discovered between variables. Association can be of two types,
single- dimensional and multi- dimensional.
Data cleaning- This involves various processes that aim at organizing data and
eliminating duplicate or corrupted data. Data cleaning method helps in harvesting the
most useful information for analysis. There are various methods for data cleaning. These
include verifying the data, converting data types, removal of irrelevant data, elimination
of duplicate data points, removal or errors and completion of missing values.
Data visualization- This method comprises of translating data into graphic form so that it
can be interpreted in a better way. There are many methods using which data can be
presented visually. These include comparison charts, Maps, density plots etc.
2.6 Advantages and disadvantages of using SPSS and Excel
The advantages and disadvantages of SPSS and Excel can be presented as follows:
SPSS Excel
Advantages It is quick and easy to learn. It
provides an easier and quicker access
to basic functions such as frequencies
and descriptive. It also provides a
wide range of charts to choose from.
It provides the best format for
storing data,. Large amounts of data
can be stored in an appropriate
format using Excel.
It can be used for handling large It helps in performing various types
23

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
amounts of data. It has built in data
manipulation tools such as recoding.
of calculations on data. It is also
helpful in transforming and
cleaning data. For example, filter
can be applied to the data to suit the
requirements of the analysis or
calculations being performed.
It has a greater user interface. It provides ease of data
visualization with the use of
various types of charts and graphs.
Disadvantage It is an expensive tool. There is lack of control and security
in excel.
Its functionality is limited. The excel spreadsheet may be
prone to error.
REFERENCES
Journals and Books
24
Document Page
Cleff, T., 2019. Applied statistics and multivariate data analysis for business and economics: A
modern approach using SPSS, Stata, and Excel. Springer.
Evans, J. S. and Evans, I. R., 2020. Structure analysis from powder diffraction data: Rietveld
refinement in excel. Journal of Chemical Education, 98(2), pp.495-505.
Guerrero, H., 2018. Excel data analysis: Modeling and simulation. Springer.
Kandpal, N., 2021. Web Log Mining of Server Log Data of ‘Counselling Website’Using
Microsoft Excel and Implementation of Preprocessing Algorithm. Turkish Journal of
Computer and Mathematics Education (TURCOMAT), 12(12), pp.1402-1415.
Kaula, R. and Kaula, R., 2018. Analyze data with excel power pivot: A tutorial. International
Journal of Information, Business and Management, 10(4), pp.102-116.
Kronthaler, F., 2022. Statistics Applied With Excel: Data Analysis Is (Not) an Art. Springer
Berlin.
López-Zambrano, J., and et.al., 2018, July. A tool for preprocessing moodle data sets.
In Proceedings of the 11th International Conference on Educational Data Mining,
Buffalo, NY, USA (pp. 15-18).
Zhou, H., 2022. PivotTable Data Analysis. In Mastering Excel Through Projects (pp. 169-201).
Apress, Berkeley, CA.
25
Document Page
APPENDIX
Cluster Membership
Case Number Cluster Distance
1 2 11.404
2 2 5.323
3 3 8.804
4 2 6.779
5 2 4.928
6 2 5.333
7 2 4.928
8 1 10.458
9 2 9.707
10 1 10.156
11 2 4.928
12 2 5.333
13 2 10.908
14 3 6.783
15 1 14.076
26

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
16 2 9.088
17 2 9.149
18 2 4.775
19 2 3.058
20 3 7.177
21 3 10.536
22 2 15.223
23 2 5.527
24 2 3.442
25 1 2.280
26 1 15.377
27 2 6.357
28 3 7.089
29 1 2.563
30 3 1.805
31 1 7.660
32 2 6.332
33 2 13.269
34 3 18.412
35 2 3.442
36 2 6.357
37 2 9.862
38 2 9.874
39 1 11.428
40 1 5.302
41 1 4.957
42 2 9.707
43 2 10.099
44 3 5.874
45 2 7.566
46 3 14.933
47 1 8.085
48 2 11.014
49 3 7.177
50 1 7.660
51 2 4.836
52 1 8.014
53 3 8.733
54 2 10.782
27
Document Page
55 2 12.470
56 1 15.734
57 1 16.365
58 3 6.690
59 3 3.001
60 1 5.676
61 2 1.167
62 2 12.480
63 1 7.126
64 1 11.442
65 1 15.711
66 3 14.535
67 2 6.138
68 3 19.099
69 3 10.465
70 1 10.653
71 1 5.302
72 2 1.265
73 2 7.550
74 2 17.105
75 2 9.707
76 3 13.537
77 2 8.880
78 2 7.566
79 1 2.280
80 1 9.943
81 2 5.549
82 3 6.365
83 2 9.493
84 2 7.000
85 1 6.033
86 2 7.566
87 2 6.779
88 1 9.943
89 1 16.804
90 2 5.769
91 1 5.980
92 3 12.319
93 2 11.272
28
Document Page
94 1 3.458
95 1 7.627
96 2 10.969
97 2 9.488
98 2 15.303
99 1 5.637
100 2 3.426
101 2 12.475
102 2 8.296
103 1 9.636
104 1 10.214
105 2 10.216
106 1 9.863
107 2 9.143
108 3 7.902
109 2 9.082
110 2 10.422
111 2 5.739
112 1 5.254
113 2 13.003
114 2 4.819
115 2 6.149
116 1 5.344
117 3 6.797
118 2 5.482
119 2 4.670
120 1 5.254
121 1 10.131
122 2 9.130
123 1 10.041
124 1 6.956
125 1 7.112
126 2 1.221
127 1 10.243
128 3 5.933
129 1 6.078
130 1 23.555
131 2 9.130
132 2 6.012
29

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
133 1 10.146
134 1 5.964
135 2 14.803
136 2 10.714
137 3 7.103
138 1 10.243
139 1 5.344
140 2 13.836
141 1 5.631
142 1 10.719
143 1 5.283
144 1 6.016
145 2 7.262
146 2 9.866
147 3 8.453
148 1 5.658
149 1 5.348
150 1 8.813
151 1 1.074
152 2 4.788
153 1 10.146
154 2 4.917
155 2 5.461
156 2 10.777
157 2 3.948
158 2 10.848
159 1 5.344
160 2 6.348
161 2 4.826
162 2 7.362
163 3 11.245
164 2 3.259
165 2 3.001
166 1 15.257
167 2 3.391
168 3 9.284
169 3 1.987
170 3 3.598
171 3 6.723
30
Document Page
172 2 10.848
173 1 12.479
174 2 10.228
175 1 7.766
176 2 6.578
177 2 10.777
178 3 3.232
179 3 4.353
180 2 9.819
181 1 10.146
182 1 5.283
183 2 5.345
184 2 7.869
185 2 9.500
186 2 10.368
187 2 10.819
188 1 10.278
189 2 3.040
190 2 7.201
191 1 5.314
192 3 6.016
193 2 5.517
194 1 8.671
195 2 9.130
196 1 9.863
197 1 5.152
198 2 9.143
199 2 1.120
200 2 4.941
31
1 out of 33
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]