Data Mining Assignment: Association Rule, Clustering Analysis

Verified

Added on 2020/03/16

AI Summary

This assignment solution analyzes a cosmetic data set using association rule mining and clustering techniques. The student utilizes XLMiner to perform association rule analysis, exploring the impact of confidence levels on rule generation and identifying redundant rules. The solution then applies hierarchical clustering, using dendrograms to determine optimal cluster numbers and discusses the importance of data normalization. Finally, it compares hierarchical clustering with K-means clustering, analyzes cluster characteristics, and suggests targeted offers for each cluster based on flight and non-flight transaction data. The document references several academic sources to support the analysis.

Data Mining
[Pick the date]
Student Id and Name

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Business Case Analysis
1. Association rule
Cosmetic small.xls data has been taken into consideration to perform association rule. For this
operation, XLMiner Analytical Tool has been used.
The XLMiner default setting such as minimum confidence (%) of 50 has been taken into account
and the output is highlighted below.
(i) First three rules based on the output are discussed as highlighted below (Zaki, 2000).
Association Rule 1
1

The first row id (1) represents confidence percentage of 100%. This means that it can be cited
with 100% confidence that one who would purchase brush would also make a purchase of nail
polish.
Association Rule 2
The second row id (2) represents confidence percentage of 63.218%. This means that it can be
cited with 63.218% confidence that one who would purchase nail polish would also make a
purchase of brush.
Association Rule 3
The third row id (3) represents confidence percentage of 59.195%. This means that it can be
cited with 59.195% confidence that one who would purchase nail polish would also make a
purchase of bronzer (Rouse, 2004).
(i) The redundancy of rule refers to a situation where the support expressed by particular
rule is on predictable lines as that predicted by a rule acting as the ancestor. An apt
example of this exists in the form of second rule which is redundant when compared with
rule one. This is because the support levels for the two are exactly the same. Similarly,
redundancy is also observed for in case of rule 16 and rule 17 where the latter is the
redundant rule.
The utility of the rules derived is based on the underlying support and confidence parameters
associated with a rule. Support acts as a reflection of significance while confidence level
provides a reflection of the conditional probability of events. Combining both, useful information
can be derived (Zaki, 2000).
(ii) The impacts when the confidence percentage input in XLMiner has been altered from
50% to 75%. Need to be discussed
The new XLMiner output (with confidence percentage as 75%) is shown below:
2

Impact
 List of rules has been reduced from 9 to 1 because of the increase in the confidence
percentage.
 Only one rule appears with confidence % of 100 on which the person who used to purchase
brushes would also purchase nail polish.
Reason
XLMiner has eliminated the number of rules which indicates the confidence percentage lower
than the input minimum confidence percentage. This is because XLMiner will only display the
association rules which represent higher percentage of confidence as compared with the input
minimum confidence percentage (Rouse, 2004).
3

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Question 2
(a) Dendrogram would be used to find the clusters. XLMiner is used to conduct hierarchical
clustering for the set of variables and to make dendrogram.
Decision rule
The main number of clusters would be determined that lie in the horizontal line initiated from the
threshold value of distance of the dendrogram.
Conclusion
Based on the above shown dendrogram, it would be fair to cite that only three clusters would lie
on the horizontal line which has been initiated from the threshold distance of 1000 (threshold
distance as 1000 is an assumption). Further, there are many small or sub-clusters are also located
below the threshold distance which would not be considered in the present case.
(b) The following difficulties would be raised when the standard data normalization is not taken
place before conducting hierarchical clustering.
4

 Accuracy of result would reduce
 The distance between centroids would not be represented in an accurate manner thus leading
to incorrect cluster formation.
 Measurement of scale would undermine for the set of variables which represents significantly
high magnitude.
Hence, it is suggested that one should normalize the dataset before applying any cluster
operation through XLMiner.
(c) The cluster labeling needs to be linked with the various cluster features that are observed in
the relevant output of the hierarchical clustering that has been represented below.
Output (Cluster 1)
The certain noteworthy aspects as derived from the output pasted above are as follows.
5

 During the last year, flight based transactions and non-flight transactions are very
minimal hinting at limited spending power
 Also, the balance of miles which can be used be award points are least amongst the three
clusters.
Consequently the appropriate label is that of “Middle Class Flyers”.
Output (Cluster 2)
The certain noteworthy aspects as derived from the output pasted above are as follows (Huang,
2014).
 During the last year, flight based transactions and also non-flight based transactions both
are quite healthy
6

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

 Also, the balance of miles which can be used be award points seems to be quite high
coupled with values of various miles (CC1, CC2 & CC3)
Consequently the appropriate label is that of “High Networth Flyers”.
Output (Cluster 3)
The certain noteworthy aspects as derived from the output pasted above are as follows (Berkhin,
2015).
 During the last year, flight based transactions are very minimal but it is not the case with
non-flight based transactions.
 Also, the balance of miles which can be used be award points are also quite healthy.
Consequently the appropriate label is that of “Infrequent Flyers”.
7

(d) The XL Miner output pertaining to the K-Means Clustering useful for comparison is given
below (Huang, 1998).
The first observation is that the number of clusters is three similar to the hierarchical clustering.
However, the most significant component is labeling of these where it becomes apparent that
difference tends to occur (Berkhin, 2015).
For example, considering cluster 2, certain aspects that are apparent is that balance of miles is
lowest amongst the clusters along with the non-flight and flight transactions. Thus, this segment
would be “Middle Class Flyers” which is in contract with the “High Networth Flyers” label
derived under hierarchical clustering. Thus, similar pattern does not result in K-Mean Clustering
(Huang, 2014).
(e) The respective offers for the targeted clusters are as indicated follows.
8

Reference
Berkhin, P. (2015). Survey of clustering Data Mining Techniques. Accrue software, Inc. 123-47.
https://www.cc.gatech.edu/~isbell/reading/papers/berkhin02survey.pdf
Huang, Z. (1998). Extensions to the k-means Algorithm for Clustering large data set with
Categorical Values. Vol (2), 283-304. https://link.springer.com/article/10.1023%2FA
%3A1009769707641?LI=true
Huang, Z. (2014). Clustering Large Data Sets with Mixed Numeric and Categorical Values.
CSIRO Mathematical and Information Sciences. 16(2), 45-78.
https://grid.cs.gsu.edu/~wkim/index_files/papers/kprototype.pdf
Rouse, M. (2004) Association Rules in Data Mining. Retrieved from
http://searchbusinessanalytics.techtarget.com/definition/association-rules-in-data-mining
9

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Shumueli, G., Bruce, C.P., Yahav, I., Patel, R. N., Kenneth, C., & Lichtendahl, J. (2016) Data
Mining For Business Analytics: Concepts Techniques and Application (2nd ed.).London:
John Wiley & Sons.
Zaki, M.J.(2000), Generating non-redundant association rules. In: Proceeding of the ACM
SIGKDD, pp. 34–43.
10