This article discusses data pre-processing, analysing & visualisation techniques with solved examples. It covers the decline in sales & profits of Superstore, sales by region & product, and student preferences for ice cream flavors. The article provides expert guidance on Desklib for students and professionals.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
Analysis & Visualisation
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Table of Contents Part 1................................................................................................................................................3 Data Pre-Processing, Analysing & Visualisation........................................................................3 PART 2..........................................................................................................................................11 2.1 Students which like vanilla flavoured ice-cream.................................................................11 2.2 Students that are male and female.......................................................................................12 2.3 Mean and Median of participants who like chocolate and strawberry flavour ice cream.. .13 2.4 Cluster analysis example- K means clustering....................................................................15 2.5 Text mining and data mining methods that are used in businesses.....................................22 2.6 Advantages and disadvantages of using Excel and SPSS...................................................23 REFERENCES................................................................................................................................1
Part 1 Data Pre-Processing, Analysing & Visualisation Data Pre – processing The term data pre – processing is the term used for referring to the process by which data is manipulated before it is to be used to work upon. The purpose for which data processing is done is to ensure that the processing of the data will give results that are enhanced. In data mining data pre – processing is considered as one of the most important step. Data pre- processing is done through following a systematic step wise procedure. The steps involved are namely data cleaning, data transformation and data reduction (Tang, Yuan and Zhu, 2020). Every data can have various parts that are missing and not related to the information that is required to be assessed. To handle such data, the process of data cleaning is undertaken. The situation of missing data is where when there are some parts of the data that are missing from the complete data. There are two main ways that are followed for the purpose of dealing with such missing data known as ignoring the tuples and second is filling the data values that are missing. After cleaning the data in data processing the next step is noisy data. There is data some that cannot be understand or processed with the usage of machines such a data is considered as noisy data. The faults in data collection are the responsible elements for the creation of noisy data. For handling such a data during the data pre – processing methods like binning method, regression and clustering are adopted. Further after the data cleaning step the step of data transformation is followed (Al-Taie, Kadry and Lucas, 2019). For the generation of valuable or desired results from the present set of data it needs to be transformed into the appropriate form of data. There are number of ways that are followed for the particular task of data transformation. Few examples are normalization, attribute selection, discretization and concept of hierarchy generation. Data reduction is the next step. This step is dedicated especially for easing the process of data analysation. The various steps followed for this are data cube aggregation, attribute subset selection, numerosity reduction and dimensionality reduction.
The given dataset of superstore has also been pre- processed for the purpose of further analysing and visualising of such data. The data is pre- processed using the filter feature present in excels’ sort and filter option. The reason for the selection of this particular feature of excel is that this feature will help in analysation of data of several years to determine the decline in sales of the concerned store (Data Preprocessing, Analysis & Visualization, 2022). Using this feature on the column of order date only the data of a particular year can be viewed at a time and rest of the data that is irrelevant or better say the data that is not required or necessary at one time can be hided. YearSalesProfit 20091754061152252 20101318867132154.9 20111473355161414.1 20121601552130967 By using the filter option, the above table is created in the excel. The table is the representation of the total sales and profit data for the year 2009, 2010, 2011 & 2011. Using this option in excel the data can be copied, formatted and also printed without the need of arranging it in ascending or descending format or moving the data to any other location. The process of data pre-processing is an essential element as it helps in generation of results that are reliable in a precise format.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
The available data has been pre – processed by using SUMIFS function and Pivot Tables in addition to the filter option.SUMIFS functionis a basic function of excel that is use widely for the data pre – processing. This function is used for getting the sum of specific range of values. In the current case this function is for getting the sum total of the sales & profits of Superstore specifically for each of the years starting from 2019 to 2012. “SUMIFS (Sales rows, date rows, “>=” &DATE (year, month, date), Date rows, “<=” &DATE (year, month, date)”, this formula is applied in the excel for the application of SUMIFS function. After the application of SUMIFS function the next task is to find the decline in profits over the years. This step is done by applying the formula as dividing the result of deduction of previous year’s sales from that of the current year by the sales of current year multiplied by 100. Pivot Tableis the next function of excel that is applied over the processed set of data. It is one of the powerful excel function or tool by the utilization of which the Excel gives opportunity to its users in form of performing calculations, summarization of data and analyse it on the basis of establishing comparison of the resultant data or by identification of patterns or
trends that the data represents. It is an interactive approach using which data in huge quantities can be summarized swiftly (Alshdaifat and et.al, 2021). There are three pivot tables that are created from the given set of data. The results of the SUMIFS function along with the pivot table function will be analysed in the data analysation. Data Analysation SUMIFS Function YearSalesProfit Declinein sales Declinein profit 20091754061152253 20101318867132154.9-24.81%-13.20% 20111473355161414.111.71%22.14% 201216015521309678.70%-18.86% In the data pre – processing the way by which the SUMIFS function is applied is explained. Above is the tabulated representation of the results that are generated with the help of using this function in excel. The first column of this table represents the year, the next two columns are showing the total sales and profit that have been generated by the Superstore for each of these year. Decline in sales and decline in profits are the attributes that have calculated by the using the formula (Sales of the current year – Sales of the previous year) / Sales of the previous year. This formula represents the results in the number format but for better analysing of the results the percentageformisconsideredasmorepreferableformat.Excelprovidestheoptionof representing the data into percentage format. For this on the home tab General display of values was changed to percent format. From the data it is clear that the sales of the Superstore in year 2010 experienced a fall of nearly 25% which is a huge decline. Further for the years 2011 & 2012 the sales increased around 12% & 9% respectively from the previous year. The previous year is 2010 for 2011 and 2011 for the year 2012. It is clear that out of the three years in two years’ super store has experienced declining profits. The decline in profits for the year 2010 is 13.20% and around 19% for the year 2012. The year 2011 was good for the company as its profits were increased from the previous year. The sales of the store for year 2012 is showing an increasing trend whereas for the same period
the profits are showing declining pattern it means that the store expenses need to controlled for the better performance. Pivot Tables Order Date(Multiple Items) Row LabelsSum of Sales Sum of Profit Furniture1472671.72461804.53 Office Supplies1035399.64177646.27 Technology1701825.482194645.22 Grand Total4209896.846434096.02 The above pivot table represents the total sales and profits that are generated by the Superstore for the years 2009, 2010, 2011 and 2012 grouped by the type of product that are furniture, office supplies and technology. It is clear from the above data that the maximum of store’s sale is for the product of technology followed by furniture and office supplies. The store is earning maximum of its profits from the technology, and office supplies and furniture follows it. these results indicate that the total profit generated by the Superstore by selling office supplies is more than furniture but the sales of furniture is more than that of office supplies. This can be interpreted as selling office supplies is more profitable for the store. Hence it can consider to increase its focus over the sales of office supplies. Row LabelsSum of Sales Atlantic2014248.204 North Carolina116376.4835 Northwest Territories83817.746 Ontario3780242.063 Prarie2837304.602 Quebec1510195.08 West3597549.276 Yukon975867.371 Grand Total14915600.82 This pivot table is showing the total of sales for all the years of Superstore grouped by the region. This data is extracted as it can be very helpful for the store to get knowledge of the regions in which it is more popular (Wasserman, 2018). The popularity of the store in specific regions can be determined by the total of sales for the years. The maximum number of store’s sales is in the region of Ontario and the least sales there in the region of Northwest Territories.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
the second most popular region context of store’s sales is West followed by Prarie, Atlantic, Quebec, Yukon, North Carolina. Row Labels Sum of Profit Atlantic238960.66 North Carolina2841.11 Northwest Territories8307.05 Ontario439214.57 Prarie321160.12 Quebec140426.65 West297008.61 Yukon73849.21 Grand Total1521767.98 The above table represents the total amount of three years profits earned by the Superstore clubbed by the region. The analysation of is resultant data is helpful for the company in knowing which area generates maximum of the store’s profit. Ontario region gives most of the profits to the store. This region is followed by Prarie, West, Atlantic, Quebec, Yukon, Northwest Territories & North Carolina. The store should focus on the profitable regions to further enhance profits that are earned from such regions (Belotto, 2018). And the worst profit generating regions can be considered so as to whether continuing the operations in those regions. Data Visualisation The pie chart is representing the sales of the Superstore for the four years. Out of the collective sales of all the years 26% is from the year 2012, 29% from 2009, 21% from the year
2010 and 24% for the year 2011. Maximum is from year 2009 and minimum is from the year 2010. The above bar diagram shows the profit earned by the store for all the years. The maximum profits are from the year 2011 followed by 2009, 2012 and 2010. The decline of profits and sales is presented through this chart. The sales declined in the year 2010 and increased for the years 2011 & 2012 (Weissgerber and et.al, 2019). While the profits declined for year 2010 & 2012. In the year 2011 profits rise from the previous year. Pivot Tables
The chart shows the total sales and profit generated by Superstore for all the years collectectively by products. Maximum selling and profit eaning product is technology based. This graph is representing the sales by different regions in all the years. Maximum sales are done in the region of West (Peck, Ayuso and El-Etr, 2019). There is not much difference between Ontario and West region. Worst performing region is of Northwest Territories.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
The maximum profit earning region for all the years collectively is Ontario. North Carolina generated least profits. PART 2 2.1 Students which like vanilla flavoured ice-cream Ice_cream FrequencyPercentValid PercentCumulative Percent Valid Vanilla9547.547.547.5 Chocolate4723.523.571.0 Strawberry5829.029.0100.0 Total200100.0100.0
Interpretation:Fromthe above table it has been analyzed that 95 out of 200 students like vanilla flavor of ice cream. The table further shows that 47.5% of the students lie in the category of the students which prefer vanilla over other flavors of ice cream. The students while like vanilla are in majority in comparisons with the other students which like chocolate and vanilla. 2.2 Students that are male and female Gender FrequencyPercentValid PercentCumulative Percent Valid Male9145.545.545.5 Female10954.554.5100.0 Total200100.0100.0
Interpretation:From the above frequency table, it has been found that there are 91 students out of 200 who are male. However, the percentage of females in students is 54.5% which shows that are they are in majority in the high school. 2.3 Mean and Median of participants who like chocolate and strawberry flavour ice cream. Statistics Ice cream flavour NValid105 Missing0 Mean2.55 Median3.00
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Ice cream flavour FrequencyPercentValid Percent Cumulative Percent Valid Chocolate4744.844.844.8 Strawberr y5855.255.2100.0 Total105100.0100.0 Interpretation:From the above table it has been analyzed that the average number of students like strawberry flavour of ice cream. The median value of (3.00) depicts that 50% of students like strawberry flavoured ice cream in comparison with other flavours.
2.4 Cluster analysis example- K means clustering Below is the process of using K-means cluster for analysing the provided data set in SPSS: i) Press AnalyseClassifyK- means Cluster.
ii) Open the K-means cluster dialog box and select the variables for shifting them to the Variable box.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
iii) Press the Iterate tab and select the maximum iteration value which is 10 by default. Then click ontab.
iv) Click the save button and select cluster membership then press continue.
v) Press the options tab and select the three options ‘Annova table’, ‘Initial cluster centers’ and ‘Cluster information for each case’. At last, click on the continue option. At last, click on. Results
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Initial Cluster Centers Cluster 123 Gender110 Ice cream flavour211 video295572 puzzle267131 Iteration Historya IterationChange in Cluster Centers 123 116.28712.81017.590 23.0701.8121.038 31.793.968.329 41.400.798.376 51.097.683.472 6.407.115.616 7.000.000.000 a. Convergence achieved due to no or small change in cluster centers. The maximum absolute coordinate change for any center is .000. The current iteration is 7. The minimum distance between initial centers is 43.313. Interpretation:Theabove K-Means cluster iteration table has analysed that a significant shift in the initial cluster centres. However, coming to the fifth cluster the shift has started to reduce and the situation is stabilised. At last, the final cluster is showing fewer adjustments which signifies that they have settled down. Final Cluster Centers Cluster 123 Gender110 Ice cream flavour222
Video415856 Puzzle466139 Interpretation:From the above table it has been analysed that majority of the students are females in the first cluster. However, maximum students like chocolate ice cream flavour in the 3 cluster. Further it has been found that 41 students like to play video games in the first cluster which is the least in comparison with other clusters. At last, second cluster has the highest number of students (61) that like to play puzzles rather that other games. Distances between Final Cluster Centers Cluster123 122.07716.063 222.07721.380 316.06321.380 Interpretation:From the above table, it has been found that the final cluster centres of cluster 1 and 3 have more variation. In comparison with these two, 2 cluster has less variation in its final cluster centres. ANOVA ClusterErrorFSig. Mean SquaredfMean Squaredf Gender.4032.2481971.628.199 Ice cream flavour1.7032.7251972.351.098 video5889.952239.226197150.153.000 puzzle7430.226240.994197181.253.000 The F tests should be used only for descriptive purposes because the clusters have been chosen to maximize the differences among cases in different clusters. The observed significance levels are not corrected for this and thus cannot be interpreted as tests of the hypothesis that the cluster means are equal.
Number of Cases in each Cluster Cluster 167.000 2101.000 332.000 Valid200.000 Missing.000 Interpretation:From the above Anova table, it has been found that cluster 2 has the highest number of cases which is followed by cluster 1 being at the second and cluster 3 has least cases with it. 2.5 Text mining and data mining methods that are used in businesses Data mining is a process of analysing large data for determining specific trends which may have a significant impact on the overall performance of the businesses.Clusteringis a commonly used data mining method which is useful for businesses to group different data points on the basis of their characteristics (Regin, Rajest and Singh, 2021). For example, businesses in the retail industry use the clustering models for finding the segment of customers who purchase a particular product or service. Another data mining method isassociationwhich is used for determining the relationships between different variables that are present in databases. For example, electronics businesses use this method for recommending accessories which are specific to the digital devices which are purchased by customers(Gibert and et.al., 2018). Text mining is an aspect which is beneficial for extracting crucial data from standardised language text data.Term-based methodis effective for inspecting a document and analyse the key terms which have a unique identity. This is essential for reducing confusion among the users and enhancing their overall experiences(Gola and et.al., 2018).Phrase-based methodhelps in anticipating documents on the basis of phrases which is more effective and less complex. Businesses are able to reduce confusion with the help of this text mining method and resolve its issues. 2.6 Advantages and disadvantages of using Excel and SPSS BasisExcelSPSS AdvantagesExcel is beneficial forTheresultscanbe
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
organizinglargedata into easy charts, tables and spreadsheets. Various formulas are available in Excel which are helpful in improvingcalculating complex data sets(Becker and Bröcker, 2021). generatedinSPSS withinamatterof seconds. Large amount of data can be handled in this softwarewhichis convenientfor analysing big data sets. DisadvantagesItrequiresalotof practice and training to learntheconceptsof Excel. It is costly to procure a professional version of Excelwhichcan handle large amount of data. Representationof chartsandtablesis complex in comparison withExcelwhichis slightlydifficultfor interpretation (Purwantoandet.al., 2021). Thisisanexpensive software which is not easilyaccessiblefor use.
REFERENCES Books and Journals Alshdaifat, E. A. And et.al, 2021. The effect of preprocessing techniques, applied to numeric features, on classification algorithms’ performance.Data.6(2). p.11. Al-Taie, M. Z., Kadry, S. and Lucas, J. P., 2019. Online data preprocessing: a case study approach.International Journal of Electrical and Computer Engineering.9(4). p.2620. Assarroudi, A. and et.al, 2018. Directedqualitativecontent analysis: the descriptionand elaboration of its underpinning methods and data analysis process.Journal of Research in Nursing. 23(1). pp.42-55. Becker, M. and Bröcker, J.O., 2021. Business Model Canvas-Overview of the main advantages and disadvantages. Belotto, M. J., 2018. Data analysis methods for qualitative research: Managing the challenges of coding, interrater reliability, and thematic analysis.The Qualitative Report.23(11). pp.2622-2633. Gibert, K. and et.al., 2018. Which method to use? An assessment of data mining methods in Environmental Data Science.Environmental modelling & software,110, pp.3-27. Gola,J.andet.al.,2018.Advancedmicrostructureclassificationbydatamining methods.Computational Materials Science.148. pp.324-335. Peck, E. M., Ayuso, S. E. and El-Etr, O., 2019, May. Data is personal: Attitudes and perceptions of data visualization in rural pennsylvania. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(pp. 1-12). Purwanto, A. and et.al., 2021. Education Research Quantitative Analysis for Little Respondents: Comparing of Lisrel, Tetrad, GSCA, Amos, SmartPLS, WarpPLS, and SPSS.Jurnal Studi Guru Dan Pembelajaran,4(2). Regin, R., Rajest, S.S. and Singh, B., 2021. Spatial data mining methods databases and statistics point of views.Innovations in Information and CommunicationTechnology Series, pp.103-109. Tang, S., Yuan, S. and Zhu, Y., 2020. Data preprocessing techniques in convolutional neural networkbasedonfaultdiagnosistowardsrotatingmachinery.IEEEAccess.8. pp.149487-149496. Wang, C. and et.al, 2019. Visualization by example.Proceedings of the ACM on Programming Languages.4(POPL). pp.1-28. Wasserman,L.,2018.Topologicaldataanalysis.AnnualReviewofStatisticsandIts Application.5. pp.501-532. Weissgerber, T. L. And et.al, 2019. Reveal, don’t conceal: transforming data visualization to improve transparency.Circulation.140(18). pp.1506-1518. Online 1