Data Mining Techniques: Association and Clustering Analysis Project

Verified

Added on 2020/05/11

AI Summary

The homework assignment explores the use of association rules within data mining. It discusses the identification and redundancy of these rules based on confidence levels and lift ratios, emphasizing the importance of collective utility analysis over individual rule assessment. The task also involves examining hierarchical and k-means clustering methods applied to airline data, highlighting discrepancies between cluster formations and discussing the impact of non-normalization on distance computations. Key aspects include evaluating clusters based on flight transactions and bonus miles, leading to labels like 'Middle Class Flyers' and 'Non-frequent Flyers'. References from literature support the methodologies discussed.

Data Mining
Student Id and Name
[Pick the date]

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Question 1
(i) Rule #1: A purchase of brushes by the customer tends to ensure the purchase of nail polish
with underlying conditional probability that equals 1.
Rule #2: A purchase of nail polish by the customer tends to ensure the purchase of nail
polish with underlying conditional probability that equals 0.6322.
Rule #3: A purchase of nail polish by the customer tends to ensure the purchase of bronzer
with underlying conditional probability that equals 0.5920.
.
1

(ii) Rule redundancy is quite often observed in case of association rules. It happens when the
support level for the concerned association rule can be predicted by the rule just preceding
the same. An apt example to quote to illustrate this concept is Rule 2 in the output
indicated above. Careful observation of the lift ratio (indicative of support) implies that it is
the same for both the rules. Thus, rule 2 is rendered redundant as rule 1 tends to enjoy a
higher confidence level (Leibowitz, 2015).
In relation to utility analysis, a critical aspect is that the consideration of association rules
must be taken on a collective basis and not on individual basis as the complete picture may
not emerge. Also, it is noteworthy that for individual association rules, two critical factors
arise which includes confidence (confidence level) and support (lift ratio) (Zaki, 2000).
.
(iii) The revised association result output produced due to higher minimum confidence level of
75% is highlighted below.
2

Interpretation: The output highlighted above clearly signifies that there has been a reduction in
the rules that are outlined. In the original case, it was nine but in the current case it is only one.
This is because of higher threshold confidence level which the barring one, the other association
rules are not able to meet. But, it is essential that minimum confidence level must be kept at
reasonable level or else the association rules which have high support (life ratio) would be
rejected on lack of desired confidence (Ana, 2014).
Question 2
(a) (Output) Hierarchical Clustering
Considering the above output in the form of dendrogram for the airline data coupled with input
during the clustering process, it would fair to put forward that there are three clusters that are
formed here.
3

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

(b) Non-normalisation is a big issue since it can potentially cause inaccuracy is distance
computation between the centroids of clusters. As a result, the clusters identified could be
inaccurate and hence the process utility could be severely diminished. Further, the scale
effect tends to distort the results and hence it is advisable that the normalisation option
should be ticked while performance the hierarchical clustering process (Shumueli et. al.,
2016).
(c) Cluster 1(Sample)
Key aspects: i) Last 12 months flight transactions abysmally low
ii) Also, non-flight bonus transactions is lowest for the three clusters
Label: “Middle Class Flyers”
Cluster 2 (Sample)
4

Key aspects: i) Last 12 months flight transactions quite high
ii) Also, non-flight bonus transactions are moderately high in last 12 months.
iii) The balance of bonus miles for the last 12 months are maintained are high level
Label: “Middle Class Flyers”
Cluster 3(Sample)
5

Key aspects: i) Last 12 months flight transactions very low and disappointing
ii) However, non-flight bonus transactions are highest and lead cluster 2 as well.
Label: “Non-frequent Flyers”
(d) K Means Clustering
Using the above output, the clustering output through the two techniques need to be compared so
as to opine on whether both lead to the same picture in terms of cluster formation or not. To
facilitate this endeavor, it makes sense to compare the corresponding clusters from both the
clustering techniques (Ana, 2014).
The comparison process can be clicked off with cluster 1.
The end result from the above is that the outputs produced by the two clustering methods that
have been provided do not match as the clustering labeling tends to differ.
e) Offers (Identified Cluster)
6

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

References
Ana, A. (2014) Integration of Data Mining in Business Intelligence System (4th ed.). Sydney:
IGA Global
Liebowitz, J. (2015) Business Analytics: An Introduction (2nd ed.). New York: CRC Press.
Shumueli, G., Bruce, C.P., Yahav, I., Patel, R. N., Kenneth, C., & Lichtendahl, J. (2016) Data
Mining For Business Analytics: Concepts Techniques and Application (2nd ed.).London:
John Wiley & Sons.
Zaki, M.J.(2000), Generating non-redundant association rules. In: Proceeding of the ACM
SIGKDD, pp. 34–43
7