Data Warehousing and Mining: Techniques, Drivers, and Benefits Report

Verified

Added on 2022/08/27

AI Summary

This report delves into the core concepts of data warehousing and data mining, exploring their significance and practical applications. It begins by examining the major drivers and benefits of data warehousing, considering different architectural approaches like Star and Snowflake schemas, and cloud-native solutions. The report then analyzes various data mining techniques, including statistical and artificial intelligence methods, with a focus on supervised and unsupervised learning. Furthermore, it discusses the application of these techniques to address analytical challenges, particularly those related to data security and fraud detection. The report also details the use of Tableau Desktop Software for data analysis, highlighting its features and visualization capabilities. Finally, the report addresses the Logi Analytics Maturity Model, providing a framework for assessing an organization's data analysis capabilities. The content includes visualizations created using Tableau and references relevant academic sources.

Name
Economic sources
Lecturer
Date

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

surname1
Drivers and benefits of data warehousing
Data warehouse is the relational database that stores and fetch data that is similar to a normal
SQL query. Under normal circumstances the big data doesn’t follow proper database structure
therefore we need to use hive or spark SQL to see the data by using hive specific query. 100%
data loaded into data warehousing is used for analytics reports.
Star Schema: In this Schema each of the dimensions are individually connected to the fact table
in the centre giving it a star-like diagram. One major difference is that it is NOT normalized
Since they are not normalized it is easier to write queries (you need to write less lines of INNER
JOINs to drill through dimensions).
 This type of table is generally preferred when there are lesser rows in the dimensions.

surname2
Snowflake Schema: One fact table connected to many dimensions, but some dimensions are
linked to each other or normalized to improve query efficiency and reduce data redundancy (less
redundancy = easier to maintain and change data). It gets the name because of the dimensions
being connected to each other in an intricate fashion. The main difference here is that it is
normalized. This is done to ensure the integrity of data and reduce redundancy. As a result,
querying data is much more time consuming and tiresome as compared to Star Schema. An
additional advantage it has over a star schema is that it requires lesser storage space.
 Snowflake schema is preferred when there are too many rows in the dimensions.

surname3
Cloud-native data warehouses — three options for moving your mission-critical data
warehouse to the cloud.
 AWS Redshift
 Google Big Query
 Panoply
Cloud-based ETL tools — they help pull data from cloud sources and transform them into data
warehouse without heavy planning.
 Stitch
 Blendo
 Fivetran

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

surname4
The major advantage of this data warehousing is that Data retrieval is faster within data
warehouses thereby helping the organization to retrieve its data in case of any hacking activity.
This will guarantee data security and prevent losing data thereby building confidence in the
organization’s system. This will also help address the challenge of storing and analyzing
available data. The only disadvantage of this system is compatibility with existing system which
will require the company to modify the database system already in place. The most challenging
analytical problem is research, this is because it requires a lot of time and resources. Without
resources one cannot carry out the research. It is also challenging since one gets in touch with
different people/ things in different environment thereby posing a lot of threat/ risks to one’s life.
Data mining techniques
Traditional methods of data analysis are time consuming and is used for fraud detection. These
methods require complex, time consuming investigations related to the different domain
knowledge like economics and business practices. But nowadays the use of machine language
or data mining has become helpful to overcome the drawbacks of those using detailed
traditional methods.
The techniques used for fraud detection fall into two primary classes:
• Statistical Techniques
• Artificial Intelligence
Statistical data analysis techniques comprise of data processing techniques for filling up of
missing or incorrect data, error correction whereas artificial intelligence techniques comprises
of data mining or machine learning techniques that are used to classify, cluster and segment the

surname5
data. These techniques are used to find the associations among different types of data that may
derive interesting patterns, including those which may be related to fraud.
The machine learning can be classified into:
• Supervised Learning
• Unsupervised Learning
In supervised learning, a random sample of data is taken and manually classified as either
'fraudulent' or 'non-fraudulent whereas unsupervised methods don't make use of labelled
records. This will help control the analytical problems identified in week 1 such as data security,
affinity of data storage with hackers and storing and analyzing data.
In this project, Tableau Desktop Software was used to analyze the data in Week 1 assignment.
The software is available commercially, but also has a free trial of 14 days. It provides robust,
effective and easy utilities to analyze data. From a personal perspective, Tableau software
surpasses the some of the software available in the market today. One of the features that makes
it outstanding is that it has a variety of datastores, which means that users can analyze data that
comes bundled with this software. Some of the most popular data stores that tableau offers are
Kaggle, an affiliate of google and probably the largest online data store. Additionally, the
software provides an option for connecting to a remote data sources, with support for a variety of
various data sources such as MS Excel, MySQL, MariaDB and Google Spreadsheets. The other
feature that makes Tableau a preference is the easy analysis utility with drag and drop feature.
Tableau allows the ser to define queries, the components of a graph, or even filters by dragging
elements to fit requirements.

surname6
The most fundamental data mining techniques include classification and clustering.
Classification is where one collects data into a discoverable category that will make you come
into a conclusion and classify them as per the findings. It is mostly used in populating people and
classifying them into groups e.g. low-income earners and high-income earners. This will also
help in the allocation of resources as per the classification into social class. i.e. the rich, middle
class and the poor. Clustering involves grouping of people / chunk of data according to their
characteristics thereby helping in assigning and designing what was to be allocated according to
their similarities.
The following are some of the visualizations that I created using Tableau.
1/11/2019
1/8/2019
1/10/2019
1/11/2019
1/10/2019
1/7/2019
1/11/2019
1/6/2019
1/11/2019
Fruit Fruit Vegetabl
es Vegetables Vegetabl
es Fruit
Apple Banana Beans Broccoli Carrots Orange
0
2000
4000
6000
8000
10000
12000
Sum of Order ID by Country
Figure 1: A simple bar graph

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

surname7
1/11/2019
1/8/2019
1/10/2019
1/11/2019
1/10/2019
1/7/2019
1/11/2019
1/6/2019
1/11/2019
Fruit Fruit Vegetable
s Vegetables Vegetable
s Fruit
Apple Banana Beans Broccoli Carrots Orange
0
2000
4000
6000
8000
10000
12000
2500
600
8400
7000
2600
8300
10000
4300 3600
Order ID by Country
Total
Axis Title
Order ID
Figure 2: A detailed bar graph
1/11/2019
1/8/2019
1/10/2019
1/11/2019
1/10/2019
1/7/2019
1/11/2019
1/6/2019
1/11/2019
Fruit Fruit Vegetabl
es Vegetables Vegetabl
es Fruit
Apple Banana Beans Broccoli Carrots Orange
0
2000
4000
6000
8000
10000
12000
2500
600
8400
7000
2600
8300
10000
4300 3600
Total
Axis Title
Order ID
Figure 3: Comparative Line graph

surname8
1
2
3
4
56
7
8
9
$0
$50,000
$4,300$8,300
$600
$8,400$2,600$3,600$10,000
$7,000$2,500
1/6/2019
1/7/2019
1/8/2019
1/10/2019
1/10/20191/11/2019
1/11/2019
1/11/2019
1/11/2019
000
0000
00
Orders
Amount Date Country
Figure 14: A radar Chart
References
Adamson, C. (2010). Star Schema The Complete Reference. McGraw Hill Professional.
Bergeron, B. (2013). Developing a Data Warehouse for the Healthcare Enterprise: Lessons
from the Trenches. HIMSS.
Bulusu, L. (2012). Open Source Data Warehousing and Business Intelligence. CRC Press.
Ganapathi, P., & Shanmugapriya, D. (2019). Handbook of Research on Machine and Deep
Learning Applications for Cyber Security. IGI Global.
Jensen, C. S., Pedersen, T. B., & Thomsen, C. (2010). Multidimensional Databases and
Data Warehousing. Morgan & Claypool Publishers.

surname9
Linoff, G. S., & Berry, M. J. (2011). Data Mining Techniques: For Marketing, Sales, and
Customer Relationship Management. John Wiley & Sons.
Thielscher, M., & Zhang, D. (2013). AI 2012: Advances in Artificial Intelligence: 25th
International Australasian Joint Conference, Sydney, Australia, December 4-7, 2012,
Proceedings. Springer.
Vetter, S., Lu, H., Olejniczak, M., & Redbooks, I. (2018). Enterprise Data Warehouse
Optimization with Hadoop on IBM Power Systems Servers. IBM Redbooks.