Data Handling Report: Excel, Weka, and Data Mining Methods
VerifiedAdded on 2023/01/12
|19
|3921
|77
Report
AI Summary
This report delves into data handling, data warehousing, and data mining techniques. It begins by evaluating the use of Microsoft Excel for analyzing the 'Superstore' dataset, covering data preprocessing, analysis, and visualization using pivot tables, charts, and graphs to identify trends in sales and profit. The report then explores the workings of Weka for clustering using the 'audidealership' dataset, specifically employing k-means clustering to understand customer behavior. Common data mining methods are explained, and a comparison of the advantages and disadvantages of Weka over Excel is provided. The analysis reveals insights into the decline in sales and profits of Superstore, highlighting the impact of discounts, unit prices, shipping costs, and order quantity. The Weka analysis of the audidealership data provides insights into customer interactions, from dealership visits to final purchases, helping to identify areas for improvement in customer engagement and sales strategies.

Data handling
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Contents
INTRODUCTION...........................................................................................................................1
Part 1................................................................................................................................................1
Evaluating the use of Excel by using the Superstore data...........................................................1
Part 2................................................................................................................................................5
Workings of Weka Clustering.....................................................................................................5
Explaining the most common data mining methods..................................................................13
Discussing the advantages/disadvantages of Weka over Excel.................................................13
CONCLUSION..............................................................................................................................13
REFERENCES..............................................................................................................................14
INTRODUCTION...........................................................................................................................1
Part 1................................................................................................................................................1
Evaluating the use of Excel by using the Superstore data...........................................................1
Part 2................................................................................................................................................5
Workings of Weka Clustering.....................................................................................................5
Explaining the most common data mining methods..................................................................13
Discussing the advantages/disadvantages of Weka over Excel.................................................13
CONCLUSION..............................................................................................................................13
REFERENCES..............................................................................................................................14

INTRODUCTION
The concept of data handling is related with processing, analysing and then presenting the
data so that numerical information can be understood by everyone. This concept is more like a
process in which the pre recorded data is analysed using certain tools and techniques so that
meaningful insights can be gained from that data(Shmueli, Patel and Bruce, 2011). The main aim
of this report is to evaluate the current trends in data warehousing along with building an
understanding regarding principles of predictive analytic software. For this purpose, two data
sets are used in this report which is “Superstore’ and “audidealership”.
In this report, the first data set is used to pre process, analyse and then visualise the data
using analytical software of Microsoft Excel. This process is done by using Excel functions of
Pivot tables, charts and graphs. Using the second dataset of “audidealership”, clustering is done
using analytic software of Weka. Various data mining methods are also analysed along with
evaluation of merits and demerits of Weka over Excel.
Part 1
Evaluating the use of Excel by using the Superstore data
Current trends in data warehousing, business intelligence and data mining
Data warehousing the process of storing the data into warehouses where the security of
the data can be ensured. The aim behind this concept is to combine all the data collected from
different sources in one place so that it can easily be analysed for the purpose of gaining data
insights and decision making (IşıK, Jones and Sidorova, 2013). There are various current trends
in the field of data warehousing which are database management system, agile development
methodologies, data streaming and consolidation with business intelligence organisations. It is
difficult for organisations to manage and store their data due to which they hire firms which are
specialised in BI.
In the field of business intelligence, current trends which are shaping the future are usage
of models such as SaaS and SOA. Also the software of Web 2.0 based visualisation is a current
trend in BI.
Data mining is different from the concept of data warehousing. In this concept, data is
extracted from different sources and then organised for better usage (Minelli, Chambers and
1
The concept of data handling is related with processing, analysing and then presenting the
data so that numerical information can be understood by everyone. This concept is more like a
process in which the pre recorded data is analysed using certain tools and techniques so that
meaningful insights can be gained from that data(Shmueli, Patel and Bruce, 2011). The main aim
of this report is to evaluate the current trends in data warehousing along with building an
understanding regarding principles of predictive analytic software. For this purpose, two data
sets are used in this report which is “Superstore’ and “audidealership”.
In this report, the first data set is used to pre process, analyse and then visualise the data
using analytical software of Microsoft Excel. This process is done by using Excel functions of
Pivot tables, charts and graphs. Using the second dataset of “audidealership”, clustering is done
using analytic software of Weka. Various data mining methods are also analysed along with
evaluation of merits and demerits of Weka over Excel.
Part 1
Evaluating the use of Excel by using the Superstore data
Current trends in data warehousing, business intelligence and data mining
Data warehousing the process of storing the data into warehouses where the security of
the data can be ensured. The aim behind this concept is to combine all the data collected from
different sources in one place so that it can easily be analysed for the purpose of gaining data
insights and decision making (IşıK, Jones and Sidorova, 2013). There are various current trends
in the field of data warehousing which are database management system, agile development
methodologies, data streaming and consolidation with business intelligence organisations. It is
difficult for organisations to manage and store their data due to which they hire firms which are
specialised in BI.
In the field of business intelligence, current trends which are shaping the future are usage
of models such as SaaS and SOA. Also the software of Web 2.0 based visualisation is a current
trend in BI.
Data mining is different from the concept of data warehousing. In this concept, data is
extracted from different sources and then organised for better usage (Minelli, Chambers and
1
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Dhiraj, 2013). This concept is based on complex algorithms due to which, it requires specialised
skills to mine a data. Current trends in the field of data mining are sequential and time series
mining in which data of cyclical and seasonal trends can be mined.
Pre processing the data
Superstore is an organisation which is facing the decline in their sales and profit over the
years. In order to determine the reason behind this decline, the data of Superstore is used. The
first step of evaluation is the pre processing the data.
Data pre processing allows cleaning and transforming the data in such a way that it can
be used for effective analysis and visualisation. For the data provided, the information is first
cleaned using identifying the missing values. In Microsoft Excel application, all the data is
selected and then empty cells are navigated. The Excel function which applied is “Shift key +
F4”. From this process, it has been identified that there are various missing values in the column
of “product base margin”. Once the missing the values are identified, these cells are filled with
their column average (Talati, McRobbie and Watt, 2012).
The next stage of data pre processing is data transformation. In this the data will be
transformed into normal. For this all the numerical values (sales, discount, profit, unit price,
shipping cost and product base margin) in data set are formatted to two decimal points.
The third and last stage of data pre processing is data reduction. The data of superstore
worksheet is reduced in another worksheet by the way of Pivot table. In this table only those
variables are selected which can impact the superstore’s sales and profit. These variables are
order date, Sum of Sales, Sum of Discount, Sum of Profit, Sum of Unit Price, Sum of Shipping
Cost, sum of Product Base Margin, individual order priority, shipment mode, region, customer
segment, product category and lastly product container.
Analysing and visualising the data
Sum of
Profit
Sum of
Sales
Sum of
Discount
Sum of
Shipping Cost
Sum of
Unit Price
Sum of Order
Quantity
2009 434096 4209897 105.39 28481.8 232831 54508
2010 364917 3560087 105.81 27354.3 162468 54379
2011 380311 3429945 101.67 24939.9 159653 51413
2012 342444 3715672 104.32 27055.2 195468 54480
2
skills to mine a data. Current trends in the field of data mining are sequential and time series
mining in which data of cyclical and seasonal trends can be mined.
Pre processing the data
Superstore is an organisation which is facing the decline in their sales and profit over the
years. In order to determine the reason behind this decline, the data of Superstore is used. The
first step of evaluation is the pre processing the data.
Data pre processing allows cleaning and transforming the data in such a way that it can
be used for effective analysis and visualisation. For the data provided, the information is first
cleaned using identifying the missing values. In Microsoft Excel application, all the data is
selected and then empty cells are navigated. The Excel function which applied is “Shift key +
F4”. From this process, it has been identified that there are various missing values in the column
of “product base margin”. Once the missing the values are identified, these cells are filled with
their column average (Talati, McRobbie and Watt, 2012).
The next stage of data pre processing is data transformation. In this the data will be
transformed into normal. For this all the numerical values (sales, discount, profit, unit price,
shipping cost and product base margin) in data set are formatted to two decimal points.
The third and last stage of data pre processing is data reduction. The data of superstore
worksheet is reduced in another worksheet by the way of Pivot table. In this table only those
variables are selected which can impact the superstore’s sales and profit. These variables are
order date, Sum of Sales, Sum of Discount, Sum of Profit, Sum of Unit Price, Sum of Shipping
Cost, sum of Product Base Margin, individual order priority, shipment mode, region, customer
segment, product category and lastly product container.
Analysing and visualising the data
Sum of
Profit
Sum of
Sales
Sum of
Discount
Sum of
Shipping Cost
Sum of
Unit Price
Sum of Order
Quantity
2009 434096 4209897 105.39 28481.8 232831 54508
2010 364917 3560087 105.81 27354.3 162468 54379
2011 380311 3429945 101.67 24939.9 159653 51413
2012 342444 3715672 104.32 27055.2 195468 54480
2
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Analysing a data is a complex procedure which includes tools such a descriptive statistics
and data analysis. First sum of all the sales and profit are calculated using sum function of Excel
which is then presented in a graph below:
The above graph is evidence that sales and profit of Superstore are declining. As it can be
seen in above graph that blue and red line are declining at every point of the year. In order to
analyse the reason behind this decline, sum of discount, unit price and order price is also
calculated. For each of these four variables, graphs are developed as follows:
The above graph of sum of disocunt shows the similar trend as sales as in both the cases
the value in 2011 is minimum which means due to lower disocunt allowed, the sales of
superstore are declining continously.
3
and data analysis. First sum of all the sales and profit are calculated using sum function of Excel
which is then presented in a graph below:
The above graph is evidence that sales and profit of Superstore are declining. As it can be
seen in above graph that blue and red line are declining at every point of the year. In order to
analyse the reason behind this decline, sum of discount, unit price and order price is also
calculated. For each of these four variables, graphs are developed as follows:
The above graph of sum of disocunt shows the similar trend as sales as in both the cases
the value in 2011 is minimum which means due to lower disocunt allowed, the sales of
superstore are declining continously.
3

Shipping costs is the another variable which is presenmting the similar pattern as sales as
in both the cases values are minimum in 2011 which implies due to low shipping costs incurred
by the company, their sales are declining.
Unit price is the value of a single product which is sold by an organisation. These unit
prices are analysed by using Sum function. It has been seen that due to decreasing unit prices of
products in 2011, the sales of this company are declining.
4
in both the cases values are minimum in 2011 which implies due to low shipping costs incurred
by the company, their sales are declining.
Unit price is the value of a single product which is sold by an organisation. These unit
prices are analysed by using Sum function. It has been seen that due to decreasing unit prices of
products in 2011, the sales of this company are declining.
4
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

The above variable is sum of order quantity which is also a reason of declining sales of
superstore.
None of the variables presents the similar pattern as profit of superstore due to which data
analysis technique of correlation is used which is considered as more effective technique of
analysing the relationship between two variables. Relationship between profit (column 1), unit
price (column 2) and shipping costs (column 3) are analysed. The results of correlationship are
presented below:
Column 1 Column 2
Colum
n 3
Column 1 1
Column 2 -0.00885 1
Column 3 -0.02136
0.23995937
5 1
From the above analysis, it has been analysed that column 1 and 2 that is profit and price
has a significant relationship with each other but this relationship is negative which means with
decreasing unit price of a product, the overall profit of the company increases. Column 1 and 3
which are profit and shipping costs also a negative significant relationship with each other which
states with the increasing shipping costs of the company, their profit decreases.
Conclusion on determining the decline in sales/profits over the years
The analysis and visualisation which have done above shows those sales of the superstore
are declining due to sum of discount, unit price, shipping cost and order quantity. The reason due
to which profit of this company is decreasing is shipping costs. Due to increasing shipping costs
of this company, the profit of this company is declining.
Part 2
2.1 Workings of Weka Clustering
Weka is a statistical program which helps in running the algorithms. WEKA is known as
Waikato Environment for Knowledge Analysis that helps an investigator to mine the data which
is pre recorded. This application allows to analyse present data set in such a way that it can
provide additional information and assits in decision making (Sharda and et.al., 2014).
Weka is analytical software which helps in pre processing and analysing the data. For the
data set of “audidealership”, this software is used to perform clusters (Zheng, Fader and
5
superstore.
None of the variables presents the similar pattern as profit of superstore due to which data
analysis technique of correlation is used which is considered as more effective technique of
analysing the relationship between two variables. Relationship between profit (column 1), unit
price (column 2) and shipping costs (column 3) are analysed. The results of correlationship are
presented below:
Column 1 Column 2
Colum
n 3
Column 1 1
Column 2 -0.00885 1
Column 3 -0.02136
0.23995937
5 1
From the above analysis, it has been analysed that column 1 and 2 that is profit and price
has a significant relationship with each other but this relationship is negative which means with
decreasing unit price of a product, the overall profit of the company increases. Column 1 and 3
which are profit and shipping costs also a negative significant relationship with each other which
states with the increasing shipping costs of the company, their profit decreases.
Conclusion on determining the decline in sales/profits over the years
The analysis and visualisation which have done above shows those sales of the superstore
are declining due to sum of discount, unit price, shipping cost and order quantity. The reason due
to which profit of this company is decreasing is shipping costs. Due to increasing shipping costs
of this company, the profit of this company is declining.
Part 2
2.1 Workings of Weka Clustering
Weka is a statistical program which helps in running the algorithms. WEKA is known as
Waikato Environment for Knowledge Analysis that helps an investigator to mine the data which
is pre recorded. This application allows to analyse present data set in such a way that it can
provide additional information and assits in decision making (Sharda and et.al., 2014).
Weka is analytical software which helps in pre processing and analysing the data. For the
data set of “audidealership”, this software is used to perform clusters (Zheng, Fader and
5
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Padmanabhan, 2012). The method of clustering which has been used is k means clustering and
number of groups is selected as 2 which are “0 cluster” and “1 cluster”.
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -
I 500 -S 10
Relation: audidealership2
Instances: 100
Attributes: 8
Dealership
Showroom
InternetSearch
RS7
A4
TT
Financing
Purchase
Test mode:evaluate on training data
=== Model and evaluation on training set ===
kMeans
======
Number of iterations: 6
Within cluster sum of squared errors: 160.2980769230769
Missing values globally replaced with mean/mode
Cluster centroids:
6
number of groups is selected as 2 which are “0 cluster” and “1 cluster”.
=== Run information ===
Scheme:weka.clusterers.SimpleKMeans -N 2 -A "weka.core.EuclideanDistance -R first-last" -
I 500 -S 10
Relation: audidealership2
Instances: 100
Attributes: 8
Dealership
Showroom
InternetSearch
RS7
A4
TT
Financing
Purchase
Test mode:evaluate on training data
=== Model and evaluation on training set ===
kMeans
======
Number of iterations: 6
Within cluster sum of squared errors: 160.2980769230769
Missing values globally replaced with mean/mode
Cluster centroids:
6

Cluster#
Attribute Full Data 0 1
(100) (48) (52)
=================================================
Dealership 0.54 0.3333 0.7308
Showroom 0.64 0.6667 0.6154
InternetSearch 0.39 0.4375 0.3462
RS7 0.53 0.2917 0.75
A4 0.55 0.8125 0.3077
TT 0.5 0.5833 0.4231
Financing 0.6 0.3333 0.8462
Purchase 0.38 0.0417 0.6923
Time taken to build model (full training data) : 0.02 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 48 ( 48%)
1 52 ( 52%)
7
Attribute Full Data 0 1
(100) (48) (52)
=================================================
Dealership 0.54 0.3333 0.7308
Showroom 0.64 0.6667 0.6154
InternetSearch 0.39 0.4375 0.3462
RS7 0.53 0.2917 0.75
A4 0.55 0.8125 0.3077
TT 0.5 0.5833 0.4231
Financing 0.6 0.3333 0.8462
Purchase 0.38 0.0417 0.6923
Time taken to build model (full training data) : 0.02 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 48 ( 48%)
1 52 ( 52%)
7
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Using Weka, two clusters are formed; the cluster 0 has 84% instances and cluster 1 has
52% instances. The data set of audidealership has various variables, the first variable among
them is delearship. If the data cell shows 0 that means people has not even walked into
delaership and if the data cell shows 1 then people has walked thoeugh the dealership. From the
above results and graph, it has been seen that out of 100 people only 54 has walked through the
delearship.
8
52% instances. The data set of audidealership has various variables, the first variable among
them is delearship. If the data cell shows 0 that means people has not even walked into
delaership and if the data cell shows 1 then people has walked thoeugh the dealership. From the
above results and graph, it has been seen that out of 100 people only 54 has walked through the
delearship.
8
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

After walking into the delaership, next step is to walk into showroom. Out of 52 people
who has walked into dealership, only 64% among them has walked into showroom and rest has
returned.
9
who has walked into dealership, only 64% among them has walked into showroom and rest has
returned.
9

Among the total people who has walked into showroom, only 39% has done an internet
search for reseraching about the cars and their features.
Out of all the people who has entered into showroom, only 53% among them has
reviewed RS7 as their purchasing option. Rest 47% has not even looked at this option of vehicle.
10
search for reseraching about the cars and their features.
Out of all the people who has entered into showroom, only 53% among them has
reviewed RS7 as their purchasing option. Rest 47% has not even looked at this option of vehicle.
10
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide
1 out of 19
Related Documents

Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
Copyright © 2020–2025 A2Z Services. All Rights Reserved. Developed and managed by ZUCOL.