This report explores the current trends in data warehousing, business intelligence, and data mining. It analyzes the decline in sales using Microsoft Excel and Weka. The report also discusses the workings of Weka and common data mining methods.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
DATA HANDLING AND BUSINESS INTELLIGENCE
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Contents Contents...........................................................................................................................................2 INTRODUCTION...........................................................................................................................1 Part 1................................................................................................................................................1 Current trends in data warehousing, business intelligence and data mining...............................1 Determining the decline in sales over the years while evaluating the use of Excel for pre- processing the data, analysing the data and visualising the data.................................................2 Part 2................................................................................................................................................7 Workings of Weka with explanation of results...........................................................................7 Common data mining methods....................................................................................................9 Advantages/disadvantages of Weka over Excel........................................................................10 CONCLUSION..............................................................................................................................11 REFERENCES..............................................................................................................................12
INTRODUCTION Data handling and business intelligence are the branches of data analysis in which decisions are made by using a technology driven process where the most important asset to an organisation is their data(Ataman, Kulick and Sim, 2011). The present report is developed with the aim of gain an understanding of using statistical analysis tools such as Excel and Weka. For this aim, this report is divided into two sections. The first section of this report will include current trends in the world of data mining and warehousing. Along with it, this section will include analysis of Superstore data in which use of Microsoft Excel will be analysed including pre processing, analysis and visualisation of data. Second section of this report will include the analysis of Audi dealership data. This data will be evaluated using the software application of Weka. This application and data will be used to analyse certain patterns and behaviour of data. This section will also include analysis of data mining methods that are used in real life. Lastly, this section will include advantages and disadvantages of Weka over Microsoft Excel. Part 1 Current trends in data warehousing, business intelligence and data mining Different trends in data warehousing Complex Data Marts Will Define the Future Business Models - With changing time period, needs and requirement of information is also changing. So, Data Marts will be using business models with new specification and focused area zones.Moreover, in in this modern time, everyone is focused towards flow of speed so Data Marts will improve the speed of functionality of their data higher scale which will also leads to higher and better efficiency. Column-based Storage is on the Rise - Another trend which is rising is retrieve of data by adopting column based storage. This is because row based storage tools time to get or processed the credentials. But with add of column it will be easier to retrieve data analytics. Mixed Workloads Are Becoming Common - Further comes the mixed workloads amongst data warehousing. In relevance with Data warehousing they includes several types of workloads like operational BI, data mining etc.Therefore, they as these workloads occurs new causes and issues also rises.So, mixed workloads are in trends with new structure (El-Sappagh and et.al, 2011). 1
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Trends in business intelligence The rise of explainable IT- This helps in managing information and data related to artificial intelligence. Data management enhancement -This is involved in managing the data effectively. There is need of evaluating the clustering of data for making good results. Determining the decline in sales over the years while evaluating the use of Excel for pre- processing the data, analysing the data and visualising the data Microsoft Excel is a software application which is used to storing the data and analysing it with the help of functions, formulas and tools. The current data set of Superstore data is analysed using three steps of data analysis which are data pre-processing, data analysis and data visualisation. All these steps are analysed below to determine the decrease of sales of Superstore over years: Data pre processing This step is dedicated to process the raw data so that it can be used for further analysis. This process includes few tasks such as data cleaning, data transforming and data reduction. The data set of superstore which is provided has two types of data variables; one of such type has numeric variables and other category has string variables that includes text information. It must be noted that non numeric variables or string variables are hard to be analysed in Excel as they are first required to be coded in numbers. Considering such limitation, all numeric data is first given priority to be used for the analysis of sales of Superstore. The superstore data is first cleaned by using the option of “SORT”. This option can be used via Data >> Sort >> Sort by >> Order date. By using this option, entire data was sorted by order date in which the orders which were dated first were shown first in data set(Held, Moriarty and Richardson, 2019). When the entire data set was sorted and cleaned, the data was then reduced in another worksheet of same Excel file. All numeric data variables were considered which are Order Quantity, Discount, Unit Price, Shipping Cost and Sales. After the data cleaning and data reduction process was done, the reduced data was then transformed using PIVOT table. A PIVOT table is an advanced Excel function which was used to transformed the data into rows and columns where all four years are presented in rows and all variables are presented in columns with their sum calculated(Hertrich and Mayrhofer, 2016). 2
By the above process of data pre processing which has done by using Excel, it can be evaluated that Excel application can be effectively used to pre process a dataset. Data analysis and data visualisation The second and third step in the process of coming to a conclusion for superstore are data analysis and data visualisation. In the step of data analysis, the Excel function of SUM() will be used and for data visualisation, Excel tables and graphs will be used. The aim of analysing the Superstore data is to determine the decrease in sales due to which first variable analysed here is “Sales”(Walkenbach, 2019). The data gained from PIVOT table is used to generate a graph attached below: The above graph provides a clear overview that the sales of Superstore are decreasing in year 2010 and 2011 but in year 2012, sales of this year starts to enhance again. The maximum sales of Superstore were in the year of 2009 which decreased in 2010 and then decreased even more in 2011 and then slightly increased in 2012; So, it can be said that Sales of Superstore are in a decreasing trend. Once it is estimated that the sales are decreasing, it is now important to know why the sales of this company are decreasing(Team, 2016). For determining the reasons behind the decrease of sales of Superstore, all numeric variables of the data set will be analyzed to understand the pattern or trend of such variables and any variable which is reflecting similar trend as of sales will be considered as the factor due to which sales of this company are decreasing. The evidence of such variables is given below in the form of tables and graphs. RowSum of OrderSum ofSum of UnitSum ofSum of 3
LabelsQuantityDiscountPriceShipping CostSales 200954508105.39232830.9828481.764209896.85 201054379105.81162467.5927354.263560087.04 201151413101.67159653.1124939.853429944.98 201254480104.32195467.5527055.173715671.95 The above graph is prepared for order quantity, the values of such variables is showing similar pattern as sales as order quantity was at its maximum in 2009 which decreased till 2011 and then again increased in 2012. So, it can be said that order quantity is a viable factor or reason due to which sales of superstore are decreasing. 4
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Another numeric variable is analyzed here which is discount, this variable is somewhere showing a similar pattern as the discount in 2011 was at its minimum and such aspect is similar in both discount and sales. The above graph is for unit price variable and the trend of this variable is highly similar to the pattern of sales. The unit price variable is the variable reflecting unit price of the products which are marketed and sold by Superstore and such prices were maximum in year 2009 and were minimum in 2011 just like sales of this company. 5
The last numeric variable analyzed in this case is shipping cost. This variable is also showing similar pattern as sales which is the decreasing pattern till 2011 and then slightly increasing in 2012. By above analysis, it can be seen that somewhere all four variables are showing similar trends to sales, so in order to determine most important factor impacting sales of Superstore, the Excel tool of correlation is used. Correlation helps in determining that whether two variables are related or not and if they are related then what is the level of their relationship. Order QuantityDiscountUnit PriceShipping CostSales Order Quantity1 Discount0.9339764261 Unit Price0.5722873590.4445883951 Shipping Cost0.9167144380.9169131870.7640935441 Sales0.6078583430.5540200740.9765453470.8375090531 The above correlation table is representing correlation coefficients for relationship among each variable. The order quantity has a significant relationship with sales as it is showing correlation coefficient of 0.6, discount is showing 0.55 correlation coefficient, unit price has 0.97 and shipping cost has0.83. Thisanalysisevidentlyproves thatunit price hasstrongest relationship with sales and unit price is the factor due to which sales of superstore are decreasing. Superstore has decreased their unit prices of products in 2011 due to which, sales of this company have also observed to decreased. 6
The above analysis evidently concludes that Microsoft Excel can be effectively used for pre processing, analysis and visualisation of data as this software application is successfully used for determining the decrease of sales in superstore and the reasons assisting this decrease. Part 2 Workings of Weka with explanation of results Weka is a statistical software application which is used for objectives such as clustering and data analysis(Zubi and Mahmmud, 2013). This software is used here for analysing the Audi dealership data and method of clustering used is K means clustering. The Audi dealership data has various variables and the data is in the form of binary codes in which 0 represents negative and 1 represents positive. The results of clustering are attached below: === Run information === Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -S 10 Relation:audidealership2 Instances:100 Attributes:8 Dealership Showroom InternetSearch RS7 A4 TT Financing Purchase Test mode:evaluate on training data === Model and evaluation on training set === 7
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
kMeans ====== Number of iterations: 6 Within cluster sum of squared errors: 160.2980769230769 Missing values globally replaced with mean/mode Cluster centroids: Cluster# AttributeFull Data01 (100)(48)(52) ================================================= Dealership0.540.33330.7308 Showroom0.640.66670.6154 Internet Search0.390.43750.3462 RS70.530.29170.75 A40.550.81250.3077 TT0.50.58330.4231 Financing0.60.33330.8462 Purchase0.380.04170.6923 Time taken to build model (full training data) : 0.02 seconds === Model and evaluation on training set === Clustered Instances 048 ( 48%) 8
152 ( 52%) From the above screenshots, it can be seen that out of 100 people 54 has entered into dealership store and 46 have returned from outside. There are two clusters in this analysis, the first cluster is 0 with overall mean of 48 and second cluster is 1 with overall mean of 52. Each variable screenshot is reflecting certain step; the last step is of purchase which is showing that only 38 out of 100 people has made to the final step of purchase of a car(Kallner, 2016). To analyze the reason that why only 38 people have purchase the car, the screenshot of car models is considered. Most of the people has showed their interest in A4 version of car which implies Audi dealership should market and advertise this version of car as it will help them to increase their sales(Bernard and Senjayawati, 2019). Common data mining methods Data mining is defined as the practise which is used for evaluation and analysis of pre existing databases used in generation of new data and information. It is very essential for completing tasks and projects effectively. Data mining is known as the extraction of useful information from raw data available within the company.Various companies use this for 9
conversion of raw data into useful information. There are following types of data mining methods - Decision trees –This is defined as useful method in which there is development of root nodes, leaf node and the branches. In this, every node is providing information about results of previous node. There is a tree structure formed in this method and every raw data is converted into finished information. There are different organisations using this method for conversion. This is helpful for creating and managing financial data within company (Thusoo and et.al, 2010). Clustering -It is another data mining method in which data is divided into smaller groups. These are called as chunks and they are solved effectively for making desired results. There is distribution of data and this is managed for developing the similarities and managing the data effectively. This method is used for analysing new trends of clusters within business. Correlation –It is known as the analysis of relationship in which two variables are selected. There is analysis of correlation between these variables and the result is analysed. Clustering is done within the area in which there are various frameworks used for analysing correlation among twoormorevariables.Thisismanagedfordevelopinggoodresultsanddetermining determinantsof correlation properly.It is an important method used by companies for evaluating the business activities and sales. Tracking sequential patterns –This is known as the most important method or technique that is used for managing the work and analysing importance of data mining. There is need to develop and manage series of activities in order to recognise transactional data. It is an objective of data mining that different activities are managed properly. This is an important method used by large organisations. It is best suited for retail sector companies to analyse the situations of real world. This method is useful for evaluating the scene of companies about the sales and develop effective strategies (Kimball and et.al, 2008). Advantages/disadvantages of Weka over Excel Weka and Microsoft Excel are both statistical software and both are appropriate for certain conditions of data. Weka is a software which is usually used for clustering and Excel is an application which is used not only used for data analysis but is also used for pre-processing the raw data and visualising the findings from the data. Advantages and disadvantages of Weka over excel are analysed below: 10
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Advantages Weka saves both time and money at calculation Viruses generally attach to Excel files but not in Weka It provides different built-in 'visualization' panels which is not in case of Excel. Disadvantages Weka can only handle small datasets unlike Excel WEKA is an open source data and does not contain high information due to this many big organisations cannot use this system software and cannot make trust upon this. CONCLUSION From the above report, it has been concluded that business intelligence and data handling is the future of business analytics as these help in analysing the data and further leads to effective decision making. It has been also concluded from above report that Microsoft Excel software is better than WEKA when the dataset is large and WEKA is appropriate for smaller and binary data. 11
REFERENCES Books and Journals Ataman, K., Kulick, G. and Sim, T., 2011. Teaching decision tree classification using Microsoft Excel.INFORMS Transactions on Education.11(3). pp.123-131. Bernard, M. and Senjayawati, E., 2019. Developing the Students’ Ability in Understanding MathematicsandSelf-confidencewithVBAforExcel.JRAMathEdu(Journalof Research and Advances in Mathematics Education).1(1). pp.45-56. El-Sappagh, S.H.A., and et.al, 2011. A proposed model for data warehouse ETL processes. Journal of King Saud University-Computer and Information Sciences, .23(2). pp.91- 104. Held, B., Moriarty, B. and Richardson, T., 2019.Microsoft Excel Functions and Formulas. Stylus Publishing, LLC. Hertrich, S. and Mayrhofer, U., 2016. Audi and the Chinese Market: A Success Story. InMarket Entry in China(pp. 37-44). Springer, Cham. Kallner,A.,2016.AstudyofsimulatednormalprobabilityfunctionsusingMicrosoft Excel.Accreditation and Quality Assurance.21(4). pp.271-276. Kimball, R., and et.al, 2008.The data warehouse lifecycle toolkit. John Wiley & Sons. Team, R.C., 2016. foreign: read data stored by Minitab, S, SAS, SPSS, Stata, Systat, Weka, dBase.R package version 0.8-67. Thusoo, A., and et.al, 2010, March. Hive-a petabyte scale data warehouse using hadoop. In2010 IEEE 26th international conference on data engineering (ICDE 2010), .(pp. 996-1005). IEEE. Walkenbach, J., 2019. Microsoft? Excel? 2016 bible. Zubi, Z.S. and Mahmmud, A.A., 2013. Using data mining techniques to analyze crime patterns in the libyan national crime data. InProceedings of the 1st WSEAS Interna-tional Conference on Image Processing and Pattern Recognition(pp. 79-85). 12