ETL Process Project: Data Warehousing and Business Intelligence

Verified

Added on  2022/08/12

|40
|4377
|29
Project
AI Summary
This project delves into the ETL (Extract, Transform, Load) process, a critical component of data warehousing. It begins with an introduction to ETL, highlighting its importance in integrating data from various sources into data marts and data warehouses, enabling business intelligence and advanced data insights. The project outlines the three core steps of ETL: extraction, transformation, and loading, detailing how data is identified, retrieved, structured, and loaded into the data warehouse. It categorizes ETL technologies into sophisticated, enabler, simpler, and rudimentary types, discussing their advantages and disadvantages. The project presents an example of ETL technologies, including Informatica, Data Junction, Transact SQL, Java, and COBOL, as well as modern tools like Apache NiFi, Jasper, and Pentaho Data Integration. It emphasizes the significance of defining business requirements before selecting an ETL tool and outlines the key considerations of cost control, revenue generation, security, compliance, and new initiatives in ETL technologies. Furthermore, the project details the ETL plan, encompassing extraction, cleaning, transformation, and loading steps. It specifies target data requirements, including dimension and fact tables, and lists data sources and mappings. Data extraction rules, transformation, and cleansing rules are thoroughly explained, providing a comprehensive understanding of the ETL process.
Document Page
ETL
Steel L.C. Database
[Type the author name]
[Pick the date]
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
P a g e | 2
Introduction
ETL process is vital for data warehousing because the data tables are built from stage
areas into data marts, cubes or summary tables. The ETL procedure completes their
data warehousing project. Minus the ETL method, there would not be a way to use data
for BI or advance significant insights from all the data gathered by companies. “The
challenge in data warehouse environments is to integrate, rearrange as well as
consolidate large volumes of data over many systems, thereby providing a new unified
information base for business intelligence (Lane, 2005.).”
Analysis as well as Answers to all the 4 questions
ETL process has three steps: 1. Extract 2. Transform as well as 3. Load. Extraction
consists of valuable data being identified as well as extracted from several sources.
There have been some cases when the useful date was not identified instantly. The
information that was obtained without initial desirability. The second step is extracted
data is transformed into a structured setup may be predefined by DW format. Loading is
when the structured information is loaded into an appropriate table in the DW.
The four categories of ETETL L technologies are Sophisticated, Enabler, Simpler as
well as rudimentary as defined by Solomon (2006). All of these technologies have
advantages as well as disadvantages that should be careful before using them on a DW
project. Example a sophisticated technology will have a high cost, as well as above
average learning curve, it does have good meta-data capabilities. A combination of
enabler, more straightforward as well as necessary techniques can be bought at a low
cost, but DWA’s will need to continue meta-data diligently.
ETL PROCESS 3
Example of ETL technologies where: ETL, Informatica (sophisticated0, data junction
(enabler), Transact SQL simpler, Java, as well as COBOL (rudimentary). (Solomon)
Today, available ETL tools have improved, as well as a nearly free option was available
to incorporate: ApachNifi, Jasper as well as Pentaho Data combination. The commercial
decision includes Confluent, Oracle Data integrator, SAS ETL as well as Alooma. ETL
Document Page
P a g e | 3
Tools suggest “Begin with defining the business requirements, then consider the
technical aspects, as well as finally, choose the right ETL tool (n.d.),”
ETL stands for extraction,transformation,as well as loading of data as it name suggests
three important word which in itself is a process,i.e. to form data warehousing you need
to extract data from either spreadsheet or other data. It is not only confined to
this,butthe ETL process also plays a vital role in the updating of real-time data
warehousing efforts depending upon the design of the ETL(Berliantara, 2017).
There are three steps of the ETL process which are as follows,i.e.,Extraction,
Transformation,as well as Loading.
Document Page
P a g e | 4
Extraction
Extraction step of the ETL process connects with source data as well as then collects
necessary data as well as then select the data for further analytical processing. The
main aim of this process is to collect data from different sources.
Transformation
Transformation Process is the second step of the ETL process. The primary objective of
this step is to transform extracted data into the specified standard format by applying
specific rules or functions.
Loading
The next phase of the ETL process is loading that includes transferring the extracted
data or data transformed. These data are transferred or imported to database targeted
or data warehouse.
The most common processes used for these steps are filtering, sorting standardizing,
etc.
The four categories of ETL technologies are as follows:
Cost Control
The primary costs in ETL technologies will constitute hardware, software, support,
consulting, people as well as cloud. The ETL Technologies manufacturer will have to
keep cost as well as benefit analysis in order to explore in the industry(B., M. & A.,
2015).
Revenue Generation
ETL technologies have played a vital role in the driving of revenue generation of an
organization.
Security as well as compliance
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
P a g e | 5
For attaining superiority you have to take extreme care of the security of data, the ETL
technologies had taken extreme measure in about the security of the data.
New initiatives
ETL technologies must continue to educate as well as re-educate themselves on
technology as well as consumer trends in order to stay in the market.
Examples of ETL technologies are as follows:
1) Informatica – Power Center.
2) IBM – Infosphere Information Server.
3) Oracle Data Integrator.
4) Microsoft – SQL Server Integrated Services (SSIS)
5) Ab Initio.
6) Talend – Talend Open Studio for Data Integration.
7) CloverETL.
8) Pentaho Data Integration
ETL Plan
ETL stands for Extract-Transform-Load. ETL covers a process of how the data is loaded
from the source system to the data warehouse.
Extract
The extraction step covers data extraction from the source system and makes it
accessible for further processing. The primary purpose of the extraction step is to
retrieve all the necessary data from the source system with as few resources as
possible. Extraction steps should be designed to not adversely affect the source system
for performance, response time, or any type of lock.
Document Page
P a g e | 6
Cleaning
The cleaning step is one of the most important steps to ensure the quality of the data in
the data warehouse. Cleaning ensures that identifiers are unique, null values are
converted to standardized, unavailable/not provided values, phone numbers are
converted to standardized format, address fields are validated against each other
(state/country, city/state, City / ZIP code, City / Street).
Tranform
This includes a set of rules for converting data from source to target. This includes
converting data to an Fit dimension using the same units so that it can be joined later.
This includes joining data from many sources, generating aggregation, surrogate keys,
sorting, and deriving calculated values.
Load
It is essential to properly load with as few resources as possible. To make the load
process more efficient, you can disable constraints and indexes before loading, and
then enable them only after loading. Referential integrity must be maintained by the ETL
tool to ensure consistency.
Document Page
P a g e | 7
i. Target Data needed in the Data Warehouse
In our data warehouse we have two data marts, one is customer count as well as other
is product sales. In the target data warehouse on the whole for these two data marts we
needed five tables. Those are three dimension tables product dimension, store
dimension as well as time dimension, as well as two fact tables Product Sales fact as
well as Product Category fact
Target Table Columns Data type
DIM_STORE STORE_KEY int
NAME varchar(50)
CITY varchar(50)
STORE varchar(50)
Target Table Columns Data type
DIM_TIME Time_key int
WeekNumber int
Start varchar(50)
End varchar(50)
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
P a g e | 8
Special Events varchar(50)
Target Table Columns Data type
DIM_PRODUCT Product_key int
ProductID varchar(50)
Category varchar(50)
Description varchar(50)
Target Table Columns Data type
Product Category varchar(50)
Sales float
STORE int
WEEK int
Target Table Columns Data type
Document Page
P a g e | 9
Demo NAME varchar(50)
CITY varchar(50)
STORE int
WORKWOM varchar(50)
Target Table Columns Data type
Movement STORE varchar(50)
UPC varchar(50)
WEEK int
PROFIT float
SALE varchar(50)
CATEGORY varchar(50)
Target Table Columns Data type
FactProductSales Storekey int
Timekey int
Document Page
P a g e | 10
Category varchar(50)
Sales varchar(50)
CITY varchar(50)
WORKWOM varchar(50)
Target Table Columns Data type
FactProductCategor
y
PROFIT int
SALE int
Storekey int
Timekey int
Productkey int
ii. Data Sources
Data needed Data Source
Dominick Stores
details
Dominick_Store.xls (created using Dominick’s Data
Manual.pdf)
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
P a g e | 11
Time details Week_Decode.xls (created using Dominick’s Data
Manual.pdf)
Product (UPC) details UPC<Product Acronym>.csv
Movement details W<Product Acronym>.csv
Demographic DEMO.csv
CCount CCOUNT.csv
iii. Mappings
10.3 Mapping tables for staging as well as data mart loads
Dimension: Store
DW Target
Table
DW
Target
Column
Targ
et
Data
type
Source
System
/ Table
Sourc
e
Colum
n
Transformation/
Business Rule
Error
Handl
ing
Rules
[601-group5-dw-
area].[dbo].
[Store_Dim]
Store_Id int Surrog
ate
key
Store_nu
mber
int Domini
ck’s
Stores
Store
Document Page
P a g e | 12
Store_Cit
y
varc
har
Domini
ck’s
Stores
City
Store_Zo
ne
int Domini
ck’s
Stores
Zone
Store_Zip
code
varc
har
Domini
ck’s
Stores
Zipcod
e
Dimension: Time
DW Target Table DW
Target
Column
Targ
et
Data
type
Sour
ce
Syste
m/
Table
Source
Colum
n
Transformation/
Business Rule
Error
Handli
ng
Rules
[601-group5-dw-
area].[dbo].
Time_Id int Surrog
ate key
chevron_up_icon
1 out of 40
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]