Data Mining Assignment on Association Rules and Clustering Techniques

Verified

Added on 2020/05/11

AI Summary

This document presents a detailed solution to a data mining assignment, addressing association rules and clustering techniques using XL Miner. The solution begins by explaining association rules, rule redundancy, and the interpretation of XL Miner output. It then delves into hierarchical clustering, discussing cluster count determination, the impact of raw data, and cluster labeling based on centroid distances. The solution also compares K-means clustering output with hierarchical clustering to assess the consistency of cluster classifications. Finally, it proposes targeted offers for specific clusters, based on the insights gained from the clustering analysis. The assignment emphasizes practical application and interpretation of data mining concepts.

Data Mining
[Pick the date]
Student id

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Question 1
The objective is to bring out the various association rules based on the given data regarding the
purchase of cosmetic items. The tool used in XL Miner and the following output has been
obtained.
(i) The first three rules are briefly explained below (Rouse, 2004).
 Association rule 1 – The purchase of brushes by a customer results in the purchase of nail
polish with an underlying probability of 1.
 Association rule 2 - The purchase of nail polish by a customer results in the purchase of
brushes with an underlying probability of 0.6322.
 Association rule 3 - The purchase of nail polish by a customer results in the purchase of
bronzer with an underlying probability of 0.6322.
(ii) Rule redundancy is a quite common issue in the usage of association rules. Hence, it is
imperative that once these are identified, these should be eliminated from the output and
1

only the remainder must be used. Before proceeding it is essential to understand as to when
a particular rule would be termed as redundant. The first aspect to understand that the
concept of redundancy only arises when a given rule is compared with another rule which
is the immediate ancestor or preceding rule. If the support of the rule termed as redundant
is obtainable from the ancestor, then the term redundancy comes into picture (Zaki, 2000).
The above understanding should be clarified with the use of an appropriate example. This
would essentially be in the form of association rule # 2. For this rule and the ancestor i.e,
rule # 1, it is apparent that there is no difference in the support level and the only visible
difference is in confidence level on which the ancestor is the superior. Hence, elimination
of rule 2 is permissible here on account of redundancy concerns (Abramowics, 2013).
The rule utility can truly be accessed if the association rules are interpreted as a group
rather than individual sets of relations. When all the rules are looked at together, a unified
and complete picture tends to emerge. However, at times, individual rules may be the
subject of concern. In such cases, the imperative aspect deal with the confidence and
support of the individual rule under consideration. Support is measured by the lift ratio and
determines the importance (Ana, 2014).
(iii) As per the given information, there has been an increase in the minimum confidence level
and the impact of the same needs to be studied considering the output available below.
2

Output Interpretation: A key difference in the above output and the original output relates to the
number of rules that are on display. From a total of 9 rules that were originally available, not
only 1 is visible. This is because the other eight have failed to cross the minimum threshold level
of confidence which has been set at 75%. A major implication of setting this confidence level to
such high values is that rules which are significant may be discarded and hence it is advisable to
keep confidence level at a level which is reasonable based on the underlying task at hand
(Shumueli et. al., 2016).
Question 2
(a) The objective is to opine on the cluster count in the hierarchical clustering output obtained.
One of the hints in this regards is the input value of 3 clusters but the dendrogram obtained is
also critical.
XLMiner output - Dendrogram
3

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Pegging the cutoff distance at a level below 1000 but above 950 would lead to formation of three
clusters. Also, in the clustering output also the various outputs are arranged in the form of three
clusters namely I,2 and 3 (Ragsdale, 2014).
(b) The use of raw data has adverse implications for the hierarchical clustering process. The
primary aspect where this is most apparent is with relation to distance determination which does
not yield correct values. Owing to errors in distance determination, there are changes that the
clusters obtained would not be reliable and in the process the output accuracy is also
compromised. This is primarily due to effect of scale to which the hierarchical clustering process
is especially sensitive. Hence, it is always a good practice not to use raw data for cluster
formation using hierarchical clustering (Ana, 2014).
(c) Based on the underlying centroid distances, the objective here is to outlined an appropriate
label for each of the three clusters that have been obtained in the output. For this, a sample of the
respective clusters has been listed and common features brought to light so as to facilitate
labeling or classification (Shumueli et. al., 2016).
OUTPUT (CLUSTER 1)
4

One key observation is that for all the entries depicted, the qual_miles variable has a value of
zero. Also, the annual flight transactions in the preceding year have been quite low only.
Additionally, the balance of bonus miles that can lead to reward points are comparatively on the
lower end when compared to the other clusters.
The above observations justify the label of “Middle Class Flyers”.
OUTPUT (CLUSTER 2)
The annual flight transactions in the preceding year have been quite high with numbers reaching
in double digit. Additionally, the balance of bonus miles that can lead to reward points are
comparatively on the higher end when compared to the other clusters.
5

The above observations justify the label of “High Networth Flyers”.
”.
OUTPUT (CLUSTER 3)
The annual flight transactions in the preceding year have been quite low with most entries being
zero. However, the non-flight bonus transactions are comparatively on the higher end when
compared to the other clusters.
The above observations justify the label of “Non-frequent Flyers”.
d) “K Means Clustering Output”
In light of the above output, it is imperative to compare the output derived from each of the given
clustering techniques to opine if the end picture emerging can be concluded to be same or not. In
order to reach a conclusion, the cluster classification obtained from the two techniques merits
comparison and if they match, then the underlying picture emerging is also the same (Ana,
2014).
 Cluster 1 (K Mean Clustering)
6

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Certain observations which are apparent are in the form of highest level of balance bonus miles,
very high frequency of annual flight transactions. This evidence seems sufficient to indicate that
the group under consideration is “High Networth Flyers”.
The corresponding label for Cluster 1 in case of hierarchical clustering came out as “Middle
Class Flyers”. Hence, without proceeding further to compare the other clusters, the current
evidence is sufficient to reflect that the picture of the clusters presented by the two given
clustering technique is not the same and infact quite different.
(e) Target Cluster – Cluster 3
Offer rolled out: Bonus points usage in 70:30 ratio in favor of flight based transactions would
lead to incremental reward points.
Rationale: Enhancing the flight based transactions.
Target Cluster – Cluster 2
Offer rolled out: For transactions that are conducted through the use of the frequent flyer cards,
higher rewards points as a % of bonus miles can be given provided the annual transactions
exceed 8.
Rationale: Enhance overall group loyalty.
7

References
Rouse, M. (2004) Association Rules in Data Mining. Retrieved from
http://searchbusinessanalytics.techtarget.com/definition/association-rules-in-data-mining
Zaki, M.J.(2000), Generating non-redundant association rules. In: Proceeding of the ACM
SIGKDD, pp. 34–43.
Abramowics, W. (2013) Business Information Systems Workshops: BIS 2013 International
Workshops (5th ed.). New York: Springer.
Ragsdale, C. (2014) Spread sheet Modeling and Decision Analysis: A Practical Introduction to
Business Analytics (7th ed.). London: Cengage Learning.
Shumueli, G., Bruce, C.P., Yahav, I., Patel, R. N., Kenneth, C., & Lichtendahl, J. (2016) Data
Mining For Business Analytics: Concepts Techniques and Application (2nd ed.).London:
John Wiley & Sons.
Ana, A. (2014) Integration of Data Mining in Business Intelligence System (4th ed.). Sydney:
IGA Global
8