Analysis of Association Rules and Clustering Techniques in Data

Verified

Added on 2020/05/08

AI Summary

The task involves utilizing XL Miner to analyze association rules within a dataset concerning cosmetic purchases, highlighting confidence levels and redundancy in these rules. The study focuses on how different confidence thresholds impact rule generation and interpretation. Additionally, the assignment evaluates hierarchical clustering techniques on airline data, emphasizing the necessity of normalization for accurate results. It discusses the labeling of customer segments based on characteristics such as spending behavior and flight transaction frequency. Furthermore, it contrasts cluster labels derived from hierarchical methods with those obtained via K-means clustering, revealing differing insights into customer segmentation.

Data Mining
Student Id and Name
[Pick the date]

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Question 1
XL Miner has been used as the appropriate tool for highlighting the association rules in the
context of the given data on cosmetic purchases. The relevant data is indicated in the form of a
screenshot obtained from XL Miner desktop version.
(i) Rule #1: The buying of brushes would be followed by the buying of nail polish with a
100% confidence level.
Rule #2: The buying of nail polish would be followed by the buying of brushes with a
63.22% confidence level.
1

Rule #3: The buying of nail polish would be followed by the buying of brushes with a
59.20% bronzer.
.
(ii) Definition: The rule redundancy tends to occur when the underlying support level
represented by lift ratio is anticipated fairly accurately by the rule predicting the same
(Zaki, 2000).
Application: To demonstrate an example of redundancy in association rules, an apt
example can be present by choosing rule #2. The lift ratio as represented for rule #1 and
rule #2 does not differ at all. Also, the corresponding confidence level tends to higher for
rule #1 when compared with rule #2. Thus, it may be emphasized that rule #2 faces
redundancy here on account of rule 1.
Utility: Even though the association rules can be interpreted in isolation as has been done
in part (1) but their actual utility can be utilized when the collective view is taken of the
given association rules based on the underlying support and confidence constraints that the
researcher may wish to impose. These would be essentially outlined based on the
underlying task at hand (Liebowitz, 2015).
.
(iii) The task is to understand the impact of raising the minimum confidence interval to 75%
using XL Miner as an enabling tool. The relevant output in this context is highlighted as
shown below.
2

Interpretation: The most obvious reading which one narrows down to when the above output is
compared with the previous output is that the output contains significantly lesser rules as just one
rule is highlighted. This can be explained using the increased confidence level which acts as a
threshold barrier that each rule which wants to be displayed in the output needs to meet. In this
case, only one rule is able to cross that threshold. Considering the above explanation and output,
it is required that caution needs to be exhibited when choosing a higher confidence level required
as it tends to overshadow certain rules which are otherwise significant (based on lift ratio) but
still cannot be captured in the output. This could potentially misguide the underlying research
conclusions based on the given data (Ana, 2014).
Question 2
3

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

(a) Hierarchical Clustering output in the form of dendrogram is outlined for the given
airlines data.
The total cluster count for the given clustering would be three which arises from the above
dendrogram along with the input provided while running the clustering. In case of dendrogram, a
particular cutoff distance has to be defined and a line drawn through the same would lead to the
formation of three clusters based on the intersection points.
(b) The failure to do normalisation before doing hierarchical clustering can lead to several
problems highlighted as follows (Shumueli et. al., 2016).
 The distances that are calculated tend to be unreliable as the underlying differences
in scale impact the same.
 The measure accuracy is also lower as the underlying cluster formation and overall
output is driven by higher values of raw data.
The net effect of the above two is that end utility of this process tends to be compromised.
Thus, it is always considered better if data normalisation is performed during the clustering
process.
(c) Cluster 1(XL Miner Output)
4

The labeling needs to be carried out by paying attention to a handful of customers which fall in
the cluster one. A key observation seems in the form of low-moderate spending especially when
compared with other clusters. There are various parameters that are quite low like bonus miles
balance, flight transactions, non-flight bonus transactions along with qual_miles which is
consistently zero. Hence, the customer segment may be represented by the label “Middle Class
Flyers”.
Cluster 2 (XL Miner Output)
5

The labeling needs to be carried out by paying attention to a handful of customers which fall in
the cluster two. A key observation seems in the form of high spending especially when compared
with other clusters. There are various parameters that are quite high like bonus miles balance,
flight transactions and also non-flight bonus transactions. Hence, the customer segment may be
represented by the label “High Networth Flyers”.
Cluster 3(XL Miner Output)
The labeling needs to be carried out by paying attention to a handful of customers which fall in
the cluster three. An interesting observation is that flight transactions incurred by these in the
past year is quite low but all of the customers in this segment seem to have significantly high
non-flight bonus transactions. Hence, the customer segment may be represented by the label
“Non-Frequent Flyers”.
(d) Relevant Output (K Means Clustering)
6

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Comparison: The core issue here is to draw a parallel between the two clustering techniques and
understand if the cluster formation gives rise to similar or comparable results or not. In this
endeavor, each of the clusters would be compared to see if the label yielded is same (Ragsdale,
2014).
Implementation: Comparing the Cluster 1 for the two, the first step is to outline the label as per
K means clustering output indicated above. The cluster 1 would capture “High Networth Flyers”
which is quite obvious from the high flight transactions in the past year. Also, supporting
evidence is also in form of superior balance bonus miles along with top class status eligibility
represented by the variable qual_miles.
This conclusion does not match with that derived from hierarchical clustering in part © above.
The cluster 1 represents the “Middle Class Flyer” segment and hence the differences are obvious
in this case.
Conclusion: The two techniques of clustering employed on airlines data lead to difference
clustering labels and hence the patterns obtained in both cases is different.
e) Offers to target segments
7

References
Ana, A. (2014) Integration of Data Mining in Business Intelligence System (4th ed.). Sydney:
IGA Global
Liebowitz, J. (2015) Business Analytics: An Introduction (2nd ed.). New York: CRC Press.
Ragsdale, C. (2014) Spread sheet Modeling and Decision Analysis: A Practical Introduction to
Business Analytics (7th ed.). London: Cengage Learning.
Shumueli, G., Bruce, C.P., Yahav, I., Patel, R. N., Kenneth, C., & Lichtendahl, J. (2016) Data
Mining For Business Analytics: Concepts Techniques and Application (2nd ed.).London:
John Wiley & Sons.
Zaki, M.J.(2000), Generating non-redundant association rules. In: Proceeding of the ACM
SIGKDD, pp. 34–43
8