Data Mining and Data Warehousing Assignment Solution - IT 446

Verified

Added on 2022/08/10

AI Summary

This assignment solution for IT 446, Data Mining and Data Warehousing, addresses several key concepts in the field. The solution begins by comparing and contrasting star and snowflake schemas, explaining their structures, advantages, and disadvantages, including discussions on normalization and query performance. It then delves into view materialization, explaining its purpose and comparing no-materialization, full materialization, and partial materialization approaches, emphasizing the trade-offs between storage usage and query response time. The solution also outlines the differences between the Multi-Way Array Aggregation method and the BUC method in the context of data cube computation, highlighting their distinct approaches to array partitioning and top-down construction. Finally, the solution applies the Apriori algorithm to a transaction database, identifying frequent itemsets with a minimum support of 2, providing a step-by-step analysis of the algorithm's application.

Data Mining and Data Warehousing
IT 446
Student Details:
Name: ###
CRN: ###
ID: ###
Instructions:
You must submit two separate copies (one Word file and one PDF file) using the Assignment Template on
Blackboard via the allocated folder. These files must not be in compressed format.
It is your responsibility to check and make sure that you have uploaded both the correct files.
Zero mark will be given if you try to bypass the SafeAssign (e.g. misspell words, remove spaces between words,
hide characters, use different character sets or languages other than English or any kind of manipulation).
Email submission will not be accepted.
You are advised to make your work clear and well-presented. This includes filling your information on the cover
page.
You must use this template, failing which will result in zero mark.
You MUST show all your work, and text must not be converted into an image, unless specified otherwise by the
question.
Late submission will result in ZERO mark.
The work should be your own, copying from students or other resources will result in ZERO mark.
Use Times New Roman font for all your answers.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Pg. 1 Question TwoQuestion Two
Question One
Compare and contrast Star and Snowflakes schema of data warehousing model.
Star Schema: In terms of star schema, one fact table can remain in the center of
star and other linked dimension tables around it. This schema is called star
schema as the structure is similar to a star (Sanchez, 2016). Star schema is one
of the simplest type of schema in data warehousing. It has another name called
star join schema. This schema is suitable for querying large data sets.
Example of Star Schema
Learning
Outcome(s):1
LO1: Explain
different data
mining tasks,
problems and the
algorithms most
appropriate for
addressing them.
2 Marks

Pg. 2 Question TwoQuestion Two
Star schema represents every dimension using single dimension table. Attributes of
dimension should be represented within dimension table. Each dimension table should
be connected to fact table using foreign key. Dimension tables should not have
connection between each other (Sidi et al., 2016). Fact table should consist of measure
and key. Star schema is well known for optimal disk space and easily understandable
concepts. Normalization concepts should not be applied to dimension tables. Star
schema has wide support by business intelligence tools.
Snowflake Schema: Snowflake is the extended version of star schema. This is because
snowflake includes additional dimensions within star schema. The same reason is
applied to naming this schema, its structure is similar to a snowflake. Snowflake tables
are created by applying normalization concepts (Benjelloun, El Merouani & El Amin,
2017). For this reason, the dimension tables are decomposed into smaller tables which
add additional dimension tables into schema.
Snowflake Schema
The biggest feature of snowflake schema is lesser usage of disk space than star
schema. Dimension implementation is comparatively easier. Query performance is

Pg. 3 Question TwoQuestion Two
slower because as number of relations between dimensions is more (Mohammed,
2019). Effort to maintain snowflake schema is comparatively more.
Difference between Star Schema and Snowflake:
Star Schema Snowflake
Star schema shows the hierarchical
representation of dimensions through
dimension tables
Hierarchies are separated into dimension
tables
Each dimension table individually
connect to fact table (Sidi et al., 2016)
Dimension tables has connection with
other dimension tables
Single join is used for establishing
connection between fact table and
distinct dimension table
Multiple joins are used to establish
connection between dimensions and fact
table
Database design is very simple and easy
to understand
Database design is comparatively
complex and need idea of normalization
to understand (Mohammed, 2019)
Data redundancy level is extremely high Low data redundancy is key feature
Aggregated data is put into individual
dimension table
Data is separated into multiple
dimension tables

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Pg. 4 Question TwoQuestion Two
Question Two
What is view materialization? Why and under what condition partial view/lattice
of cuboid materialization is preferred over no-materialization and full
materialization?
Materialization can be referred to a copy of a target master from distinct point in time.
In general it is the database object which consist of outcome of a query is called
materialize view. The procedure of generating a materialized view is known as
materialization. Materialization is referred to the form of caching the outcome of
query. This process is similar to value memorization of a method within functional
languages. Materialized views are supported by major relational database management
systems such as Oracle, SQL server, PostgreSQL, MySQL and many more.
No materialization process does not pre-compute anything related to “nonbase”
cuboids. The outcome of this process is exclusive multidimensional computing that
sums on–the-fly. This make the whole process extremely slow. Full materialization, on
the other hand, pre-calculate entire available cuboids. Full cube is referred to as the
result of computation which generate a lattice. This option demands large quantity of
memory space in terms of recording all of the pre-calculated cuboids. Partial
materialization calculates a define subset of enter collection of potential cuboids
(Sohrabi & Ghods, 2016). This leads to possibility that, system can compute only
those subsets that meets specific user criteria. Partial materialization establishes a
perfect balance between the storage usage and response time. In order to process
partial materialization of cuboids three factors must be considered as important such as
identification of cuboids subset to materialize, materialize cuboids exploitation during
processing of query and modification of materialization of cuboids efficiently.
From the above discussion it is clear that partial materialize view is much faster and
storage efficient in comparison two other options.
Learning
Outcome(s):2
LO2: Apply and
evaluate data
mining algorithms
with respect to
problems they are
specifically
designed for.
1 Marks

Pg. 5 Question TwoQuestion Two
Question Three
List two differences between Multi-Way Array Aggregation method and BUC
method.
The multi array aggregation or multiway array aggregation method is used for
computing an entire data cube through utilizing multidimensional array to be assumed
its foundation data structure. Multiway array aggregation is a usual MOLAP approach
(Gulisano et al., 2017). On the other hand, BCU is used for computing iceberg cubes
and sparse cubes. Multiway array uses multidimensional array but BCU utilizes apex
cuboid to generate the cube. This way BCU can share costs of data partitioning.
Through this order processing, BCU can trim at the time of construction, through
utilizing Apriori Algorithm.
Multiway array aggregation make partition of array into pieces. A chunk or piece can
be assumed to be small enough to be fitted within the available memory of cube
computation. In simple words, chunking is the process of dividing an n-dimension
array into small n-dimension pieces. On the other hand, BCU is used for top down
construction, in spite of its naming convention (Hamoud, Hashim & Awadh, 2018).
BCU is used for exploration for the computation of a 3-D data cube.
Learning
Outcome(s):1
LO1: Explain
different data
mining tasks,
problems and the
algorithms most
appropriate for
addressing them
1.5 Marks

Pg. 6 Question TwoQuestion Two
Question Four
Consider the database containing transaction data as shown in the table below.
Apply Apriori algorithm and find the frequent itemsets where min-sup=2.
TID Items Bought
1 Diaper, Bread, Juice
2 Eggs, Bread, Pasta
3 Diaper, Eggs, Bread, Pasta
4 Eggs, Pasta
5 Rice
Above table represents the transaction database that has many collection of
transactions. First transaction in transaction database represents item set of Diaper,
Bread and Juice. It has to be understood that an item cannot appear twice in the same
transaction as well as item is thought to be sorted by lexicographical order within the
transaction table (Yuan, 2017). Total five transactions has been carried out as per
information provided in the table. Items available for transaction are diaper, bread,
juice, eggs, pasta and rice. If Apriori is applied to above table then following result is
generated.
The concern is about output of Apriori algorithm. Apriori algorithm is considered to be
an algorithm that discovers set of items following frequently within the transaction
database. A frequent item set is referred to as the item set occurring within at least
minsup transactions. Minsup is the parameter provide by the user.
Learning
Outcome(s):2
LO2: Apply and
evaluate data
mining algorithms
with respect to
problems they are
specifically
designed for
1.5 Marks

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Pg. 7 Question TwoQuestion Two
Item Sets support
{Diaper} 2
{Bread} 3
{Juice} 1
{Eggs} 3
{Pasta} 3
{Rice} 1
{Diaper, Bread} 2
{Eggs, Bread} 2
{Eggs, Pasta} 3
{Eggs, Bread, Pasta} 2
From the above table it is found that {Diaper}, {Diaper, Bread}, {Eggs, Bread} and
{Eggs, Bread, Pasta} are the item sets that has min-sup = 2.

Pg. 8 Question TwoQuestion Two
Bibliography:
Benjelloun, M., El Merouani, M., & El Amin, A. A. (2017). Using Snowflake Schema
and Bitmap Index for Big Data Warehouse Volume. International Journal of
Computer Applications, 180(8), 30-32.
Gulisano, V., Nikolakopoulos, Y., Cederman, D., Papatriantafilou, M., & Tsigas, P.
(2017). Efficient data streaming multiway aggregation through concurrent algorithmic
designs and new abstract data types. ACM Transactions on Parallel Computing
(TOPC), 4(2), 1-28.
Hamoud, A., Hashim, A. S., & Awadh, W. A. (2018). Clinical Data Warehouse: A
Review. Iraqi Journal for Computers and Informatics, 44(2).
Mohammed, K. I. (2019). Data Warehouse Design and Implementation Based on Star
Schema vs. Snowflake Schema.
Sanchez, J. (2016). A Review of Star Schema Benchmark. arXiv preprint
arXiv:1606.00295.
Sidi, E., El Merouani, M., Amin, E., & Abdelouarit, A. (2016). Star Schema
Advantages on Data Warehouse: Using Bitmap Index and Partitioned Fact
Tables. International Journal of Computer Applications, 975, 8887.
Sohrabi, M. K., & Ghods, V. (2016). Materialized View Selection for a Data
Warehouse Using Frequent Itemset Mining. Jcp, 11(2), 140-148.
Yuan, X. (2017). An improved Apriori algorithm for mining association rules. In AIP
conference proceedings (Vol. 1820, No. 1, p. 080005). AIP Publishing LLC.