Comprehensive Report on Enterprise Business Intelligence Techniques

Verified

Added on  2020/02/18

|32
|3426
|258
Report
AI Summary
This report provides a detailed analysis of enterprise business intelligence, focusing on manual knowledge discovery from datasets using the Apriori algorithm and decision tree induction. It explores the generation of association rules, the comparison of Apriori and FP-Growth algorithms, and the application of decision trees in various domains like business management, CRM, and engineering. The report includes pseudocode for the Apriori algorithm and illustrates the construction of an FP-Growth tree. Additionally, it presents a manually created decision tree for traffic accident data, along with entropy and information gain calculations. The report also includes a brief discussion on the limitations and advantages of the decision tree algorithms.
Document Page
ENTERPRISE BUSINESS INTELLIGENCE
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Table of Contents
Task – 1 Manual Knowledge Discovery from the datasets.............................................................2
1. Apriori algorithm..................................................................................................................2
2. Decision tree Induction Algorithm.......................................................................................3
3. Apriori Algorithm and FP Growth Algorithm......................................................................6
4. Decision Tree......................................................................................................................17
Task 2 – Knowledge Discovery from the given datasets..............................................................21
1. Construction of Association Rules.....................................................................................21
2. J48 Decision Tree for the Car-dataset................................................................................27
References......................................................................................................................................32
1
Document Page
Task – 1 Manual Knowledge Discovery from the datasets
1. Apriori algorithm
Generation of Association rule is divided into two steps:
1. Initially, less support is used to discover overall frequent item sets in a database.
2. Second, these item sets and the limitation with less confidence helps to make rules.
Discovering entire frequent item sets is very complex since it invokes searching the possible item
combination (Li, 2010). The second one is direct method and first step requires more
consideration. The combination of desirable item sets is the set having power over I and size 2n-
1. Despite, the powerset size exponentially increase in items number as n in I (Tank, 2012). To
use the property of downward-closure effective search is desirable which is also known as anti-
monotonicity (Motoda et al., n.d.). Generally, to count the item sets of candidates this algorithm
uses BFS (breadth-first search) and also uses a tree model to count candidate item sets efficiently
(Schuyler, 2001).
Apriori principle
If an item combination is periodic, then its entire subset should also be frequent.
Otherwise, if an item combination is infrequent then its entire superset also be
infrequent.
Apriori principle grasp due to following support calculation
-Where, item set support never be surpass of its subsets support
-This is also called as support property of anti-monotone
There are two rules associated with Candidate rule which is executed by combining two rules
that contributed the same prefix in the rule consequent,
One is join(CD=>AB, BD=>AC) which will generate the candidate rule called D =>
ABC
Next rule is prune rule as D=>ABC if its subset as AD=>BC which does not have more
confidence.
2
Document Page
Comparison of Apriori and FP-Growth algorithm
Apriori
Techniques
It produces techniques called singletons, triples, pairs etc.
Runtime
In Apriori, generation of candidate rule is dead slow and runtime grows exponentially
based on the number of various items
Memory Usage
Apriori saves pairs, singletons, triples etc.
Parallelizability
Execution of candidate is extremely parallelizable.
FP-Growth
Techniques
To add arranged items by frequency into a pattern tree
Run time
In fp-growth algorithm, Runtime grows linearly which is based on the quantity of items
and transactions.
Memory Usage
It collects a database with compact version.
Parallelizability
In fp-growth algorithm, data are extremely inter-dependent where each and every node
requires the root node.
2. Decision tree Induction Algorithm
A decision tree is one of the tree structure algorithm in which each node of branch refers a
selection between some amounts of alternatives (De Ville and Neville, 2013). Then, each leaf
node of the tree shows a decision or classification (Blobel, Hasman and Zvárová, 2013) .
3
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
The algorithm functions over a group of training data set represent as C.
If all objects in C are available in class P, construct a node called P and terminate,
otherwise choose an attribute named as F and make a node called decision.
Divide the training objects in C into subsets depending on the values of V.
Implement the algorithm recurrently into each subset named C.
Converting From trees to rules
Simple way: each leaf having one rule
„ C4.5rules: effective prune conditions from each and every rule if this decreases
its calculated mistakes
It can generate duplicate rules
Review for this at last.
Then
monitor each and every class in turn
examine the principle for that class
discover a “good” subset which is handled by MDL)
To neglect conflicts, list the subsets
At last, eliminate rules if this reduces failure on the training data set.
C4.5rules:
It is extremely slow for datasets which is very big and noisy.
Commercial version of C5.0rules utilizes the various technique
More quick and little more accurate
C4.5 has two limits
One is confidence value by default 25% where secondary values acquire
excessive trim
4
Document Page
Very less number of objects in the two most familiar branches.
Classification rules
General procedure called divide-and-conquer (Goetz, 2011).
Differences:
Search techniques for example greedy, beam search methods.
Criteria of Test selection criteria e.g. accuracy,
Pruning or trimming method e.g. MDL, hold-out set)
Terminating criterion for e.g. minimum accuracy
Post-processing level
And Decision tree list vs. one rule group for each class.
Advantages
Decision trees offer advantages for examine alternatives,
Graphic. We can perform possible outcomes, decision alternatives, and chance events
schematically. The diagrammatic approach is specifically useful in sequential decisions
as well as dependencies of output.
Efficient. We can instantly expose difficult alternatives surely. We can easily edit a
decision tree with new data become usable.
Numerical and nominal attributes are handled by decision tree
Representation of decision tree is sufficient to produce classifier called discrete-value
classifier.
Decision trees are ability of managing large datasets with errors.
Decision trees are ability of managing datasets that may consist of missing values.
Decision trees are referred as a method called nonparametric method. This represents that
decision trees have no premise of the area allocation and the categorize structure.
Problems in decision tree Algorithm
Most of the algorithm such as ID3 and C4.5 needs that the destination only have distinct
values
5
Document Page
The divide and conquer technique in decision tree tend to process well if some extremely
related attribute exist.
Decision tree Greedy feature leads to another drawback that must be mention. Because
this is excitability to the training set to noise and inappropriate attribute.
Decision Tree Applications
Business Management
Previously, many companies developed their private databases to improve their customer
services (Bramer, 2017). Therefore, decision tree are an appropriate way to refine useful data
from databases. In specific, decision tree techniques is broadly used in CRM (Customer
Relationship Management) and fraud detection (GRABCZEWSKI, 2016).
Customer Relationship Management
A widely used techniques to maintain customer’s relationship is to examine how
individual can connect with online services (Neves-Silva, Jain and Howlett, 2015). Therefore, an
inspection is mainly achieved by storing and resolve individual’s usage information. Then
promoting suggestion depend on the refined information (Kotu and Deshpande, n.d.).
Engineering
The useful application domain of decision trees is engineering (Koh and Rountree, 2010).
In specifically, decision trees are broadly used in consumption of energy and fault detection.
3. Apriori Algorithm and FP Growth Algorithm
Pseudocode for Apriori Algorithm
Procedure to find the association rules using CustomerTrans Data
6
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Step 1
Item Support
A 5
B 4
C 4
D 1
E 3
F 2
G 5
H 2
A, B, C, D, E, F, G and H denotes each item in a supermarket. Support is the value that the
presence of the item in a transaction. Let us consider minimum support is 3 and the confidence is
70%.
Step 2
Item Support
{A,B} 2
{A,C} 3
{A,D} 0
{A,E} 3
{A,F} 1
{A,G} 4
{A,H} 1
{B,C} 2
{B,D} 1
{B,E} 0
7
Document Page
{B,F} 2
{B,G} 4
{B,H} 2
{C,D} 1
{C,E} 2
{C,F} 1
{C,G} 3
{C,H} 0
{D,E} 0
{D,F} 1
{D,G} 1
{D,H} 0
{E,F} 0
{E,G} 2
{E,H} 0
{F,G} 2
{F,H} 1
{G,H} 2
Now in the above table frequent item pair and support is specified. The values of the
support which are below the minimum support value is in bold and it need not be considered for
next step. The algorithm need not be repeated and it ends in the second step as there are no 3-
triple that has the minimum support value as 3.
Association Rules
The customers who bought the item A had also bought the item C.
8
Document Page
The customers who bought the item A has also bought the item E.
The customers who bought the item A has also bought the item G.
The customers who purchased the item B has also bought the item G.
The customers who purchased the item C has also bought the item G.
Strong Association Rule
The item A in association with G is purchased 4 out of 7 times.
The item B in association with G is also purchased 4 out of 7 times.
FP Growth Algorithm
FP Tree Construction
Transaction ID Item
1 A,B,F,G,H
2 B,C,D,F,G
3 A,E,C
4 A,C,E,G
5 A,E,G
6 A,B,C,G
7 B,G,H
After reading the transactions, the FP growth tree is displayed as follows. The FP growth tree is
used to find the frequent itemset.
9
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Step 1
After reading the Transaction ID (TID) 1, the tree is as follows
10
null
H:1
F:1
G:1
B:1
A:5
Document Page
Step 2
After completing the TID 1, TID 2 is considered and after the completion of TID 2 the tree is
represented as below.
11
null
H:1
G:1
F:1
B:1
A:1
D:1
G:1
F:1
B:1
C:1
Document Page
Step 3
After reading the TID 3, the tree is shown below.
12
null
H:1
G:1
F:1
B:1
A:2
D:1
G:1
F:1
B:1
C:1
C:1
E:1
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Step 4
After reading the transaction ID 4, the tree is represented as below.
13
null
H:1
G:1
F:1
B:1
A:3
D:1
G:1
F:1
B:1
C:1
E:1
C:1
G:1
Document Page
Step 5
After reading the TID 5, the FP tree is shown below.
14
null
H:1
G:1
F:1
B:1
A:4
D:1
G:1
F:1
B:1
C:1
E:2
C:2
Document Page
Step 6
The TID 6 is completed and the result is represented below.
15
G:2
null
H:1
G:1
F:1
B:1
A:5
D:1
G:1
F:1
B:1
C:1
E:2
C:2
B:1
C:1
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Step 7
After reading the last transaction, the final tree is as follows
16
G:3
null
H:1
G:1
F:1
B:1
A:5
D:1
G:1
F:1
B:1
C:1
E:2
C:2
G:4
B:1
C:1
B:1
H:1
Document Page
Thus, the FP Growth tree is constructed using step by step conditional branches. Thus,
the path containing A, B and G are taken. When there is a presence of A and B, the item G comes
in that path. So, it is found that A has the association with G and B has the association with G.
For example, consider A as the item called Butter and B is the item called Fruit Jam. So,
the item G is obviously called as Bread. So, if the item A is purchased, they also purchase G and
if B is purchased then also the item G is purchased. It is clear that when Bread is purchased
either Jam or Butter is purchased.
4. Decision Tree
Manually creating the decision tree for the given traffic accident data. The Data set is shown
below.
17
Document Page
The decision tree is shown below.
18
Traffic Violation
Disobey traffic
signal (DT)
Disobey stop
sign (DS)
Exceed speed
limit (E)
Driver
Condition
Sober
Alcohol-
impaired
Weather
Condition
Good Bad
Seat Belt
Yes No
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Entropy
a)
Crash severity
Severe Minor
10 6
b)
Entropy(Cresh Severity) = Entropy (10,6)
= Entropy (0.68, 0.39)
=-(0.68 log2 0.68) - (0.39 log2 0.39)
=(1.02)
E (Crash Severity, Traffic Violation) = 0.72
Information Gain
a)Entropy calcuation
Entropy(Cresh Severity) = Entropy (10,6)
= Entropy (0.68, 0.39)
19
Document Page
=-(0.68 log2 0.68) - (0.39 log2 0.39)
=(1.02)
Weather Condition
Crash Severity
Severe Minor
Good 6 1
Bad 5 4
Gain = 1.25
Driver Condition
Crash Severity
Severe Minor
Alcohol-impaired 5 2
Sober 5 4
Gain = 1.22
Seat Belt
Crash Severity
Severe Minor
Yes 5 1
No 5 5
Gain = 1.23
To Calculate the Two Variable Information gain by using the below formula.
20
Document Page
G (Crash Severity, Traffic Violation) = E (Crash Severity) – E (Crash Severity, Traffic
Violation)
= 1.02 – 0.72
= 0.3
Task 2 – Knowledge Discovery from the given datasets
1. Construction of Association Rules
Data Pre-processing
Dataset known as “voteM.arff” is preprocessed into the weka tool for finding the association rule
using Apriori algorithm and FP Growth algorithm.
Association rule mining using Apriori Algorithm
In the weka explorer, after importing the arff file. Choose the option called Associate
which is used to mine the association rules using algorithm. In the above screenshot, Apriori
algorithm is used to mine association rules in which “metricType – Confidence is chosen with
21
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
minMetric value as 0.9” and “numRules – 15” is selected as 15. The output with 15 rules and
with minimum confidence of 0.9 is represented below.
Association rule mining with confidence 0.9
22
Document Page
Association Rule mining with lift 1.5
In the GenericObjectEditor, the metricType is selected as lift and then the numRules is selected
as 15. The output for the association rule mining using the metriType Lift with minMetric value
1.5 is represented below.
Association Output
Association Rule Mining using FP Growth Algorithm
23
Document Page
Report for the construction of Association Rules
Association Rule mining is the type of function that is used in the data mining in which it
finds the probability of the frequency of the items or datasets. The relation between those co
occured items are represented as association rules. In this assignment, two algorithms are used
for mining the association rules namely Apriori and FP growth algorithm. weka tool is used to
find the association rules. The association rule mining is performed using apriori algorithm with
confidence of minimum metric value 0.9 and the top 15 association rules are mined. Then the
apriori algorithm is used to find the top 15 association rules with lift of 1.5 as a minimum metric
value.
Findings
Generated sets of large item sets
Size of set of large item sets L(1) : 28
Size of set of large item sets L(2) : 47
Size of set of large item sets L(3) : 20
Size of set of large item sets L(4) : 1
24
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Then the FP Growth algorithm is used to mine the association rules. 21 rules are mined and top
15 rules are displayed.
Finding
Top 15 association rules
2. J48 Decision Tree for the Car-dataset
25
Document Page
26
Document Page
27
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Report for the running of J48 Program in weka
The above specified screenshots are the outputs of the generation of decision trees by
using J48 in the weka tool. J48 pruned tree is generated using the attributes like buying, persons,
maint, lug_boot, doors, safety, and class values. The J48 pruned tree implies all the possibilities.
Buying represents the buying price of the car, maint is the maintenance price spent on the car,
doors denote the number of doors present in the car, persons denote the capability of the persons
that the car can hold, lug_boot represents the luggage_boot capacity of the car and the safety is
the estimated safety of the car. Attribute values of buying are v-high, high, med, low. maint are v-
high, high, med and low. doors are 2, 3, 4, 5-more. persons are 2, 4 and more. luggage_boot
involves small, med and big and finally the safety takes the values known as low, med and high.
A decision tree is generated using J48 program using all the attributes.
Findings by running the J48 program
1. Number of leaves - 131
28
Document Page
2. Size of the tree - 182
3. Time taken to build - 0.08 seconds
4. Time taken to test - 0.02 seconds
5. Correctly Classified Instances are found as 1664
6. Incorrectly Classified are determined as 64
7. Kappa statistic value is 0.9198
8. Mean Absolute error, Root mean squared error, Relative absolute error, and Root relative
squared error are 0.0248, 0.1114, 10.8411%, and 32.9501% respectively.
29
Document Page
References
Blobel, B., Hasman, A. and Zvárová, J. (2013). Data and knowledge for medical decision
support. Amsterdam: IOS Press.
Bramer, M. (2017). Principles of data mining. London: Springer.
De Ville, B. and Neville, P. (2013). Decision trees for analytics. Cary, N.C.: SAS Institute.
Goetz, T. (2011). Decision tree. New York: Rodale.
GRABCZEWSKI, K. (2016). META-LEARNING IN DECISION TREE INDUCTION. [Place of
publication not identified]: SPRINGER INTERNATIONAL PU.
Koh, Y. and Rountree, N. (2010). Rare association rule mining and knowledge discovery.
Hershey, Pa.: IGI Global (701 E. Chocolate Ave., Hershey, PA, 17033, USA).
Kotu, V. and Deshpande, B. (n.d.). Predictive analytics and data mining.
Li, S. (2010). Higher order association rule mining.
Motoda, H., Wang, W., Yao, M., Zaïane, O., Cao, L. and Wu, Z. (n.d.). Advanced data mining
and applications.
Neves-Silva, R., Jain, L. and Howlett, R. (2015). Intelligent Decision Technologies. Cham:
Springer International Publishing.
Schuyler, J. (2001). Risk and decision analysis in projects. Newtown Square, Pa.: Project
Management Institute.
Tank, D. (2012). Real-Time Business Intelligence & Frequent Pattern Mining Algorithm.
Saarbrücken: LAP LAMBERT Academic Publishing.
30
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
31
chevron_up_icon
1 out of 32
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]