An Overview of Distributed Data Warehouses and Discretization

Verified

Added on  2024/04/03

|13
|1476
|227
Report
AI Summary
This report provides a comprehensive overview of distributed data warehouses, including local, global, and independently evolving types. It discusses the advantages and disadvantages of distributed data warehousing, focusing on the transfer of data between local and global environments. The report also delves into data discretization, explaining its purpose in converting continuous data into discrete intervals for improved data quality and reduced running time in data mining tasks. Key steps of discretization are outlined, along with an explanation of binning as a data smoothing technique, illustrated with an example of age discretization. The document is for informational purposes and Desklib provides access to similar solved assignments and past papers.
Document Page
The Distributed Data
Warehouse
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Distributed Data Warehouse
Most organizations build and maintain a single
centralized data warehouse environment. This setup
makes sense for many reasons:
The data in the warehouse is integrated across the
corporation, and an integrated view is used only at
headquarters.
The corporation operates on a centralized business
model.
The volume of data in the data warehouse is such
that a single centralized repository of data makes
sense.
Even if data could be integrated, if it were dispersed
across multiple local sites, it would be cumbersome
to access.
Document Page
Types of Distributed Data
Warehouses
The three types of distributed data warehouses are as follows:
1. Business is distributed geographically or over multiple, differing product lines. In this case
there is what can be called a local data warehouse and a globaldata warehouse. The
local data warehouse represents data and processing at a remote site, and the global
data warehouse represents that part of the business that is integrated across the
business.
2. The data warehouse environment willhold a lot of data, and the volume of data will be
distributed over multiple processors. Logically there is a single data warehouse, but physicall
there are many data warehouses that are all tightly related but reside on separate
processors. This configuration can be called the technologically distributed data wareho
3. The data warehouse environment grows up in an uncoordinated manner — first one data
warehouse appears, then another. The lack of coordination of the growth of the different
data warehouses is usually a result of politicaland organizational differences. This case can
be called the independently evolving distributed data warehouse.
Document Page
Local and Global Data Warehouses
When a corporation is spread around the world, information is needed both locally
and globally. The global needs for corporate information are met by a central data
warehouse where information is gathered. But there is also a need for a separate
data warehouse at each local organization — that is, in each country. In this case, a
distributed data warehouse is needed. Data will exist both centrally and in a
distributed manner.
A second case for a local and global distributed data warehouse occurs when a
large corporation has many lines of business. Although there may be little or no
business integration among the different vertical lines of business, at the corporate
level — at least as far as finance is concerned — there is. The different lines of business
may not meet anywhere else but at the balance sheet, or there may be considerable
business integration, including such things as customers, products, vendors, and the
like. In this scenario, a corporate centralized data warehouse is supported by many
different data warehouses for each line of business.
In some cases, part of the data warehouse exists centrally (that is, globally), and other
parts of the data warehouse exist in a distributed manner (that is, locally).
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
The Local Data Warehouse
A form of data warehouse, known as a local data
warehouse, contains data that is of interest only to the
local level. There might be a local data warehouse for
Brazil, one for France, and one for Hong Kong. Or there
might be a local data warehouse for car parts,
motorcycles, and heavy trucks. Each local data
warehouse has its own technology, its own data, its own
processor, and so forth. The local data warehouse
serves the same function that any other data
warehouse serves, except that the scope of the
data warehouse is local. For example, the data
warehouse for Brazildoes not have any information
about business activities in France, or the data
warehouse for car parts does not have any data
about motorcycles. In other words, the local data
warehouse contains data that is historicalin nature
and is integrated within the local site. There is no
coordination of data or structure of data from one
local data warehouse to another.
Document Page
Local Data warehouse
Activity appears at the local level
Bulk of the operational processing
Local site is autonomous
Each local data warehouse has its unique architecture and contents of data
The data is unique and of prime essential to that locality only
Majority of the record is local and not replicated
Any intersection of data between local data warehouses is circumstantial
Local warehouse serves different technical communities
The scope of the local data warehouses is finite to the local site
Local warehouses also include historical data and are integrated only within the
local site.
Document Page
GLOBAL DATA WAREHOUSE
The global data warehouse contains information that must be integrated at the
corporate level.
In many cases, this consists only of financialinformation.
In other cases, this may mean integration of customer information, product
information, and so on.
While a considerable amount of information willbe peculiar to and usefulto only
the local level, other corporate common information willneed to be shared and
managed corporately.
The global data warehouse contains the data that needs to be managed
globally.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Disadvantages of Distributed Data
warehousing
How frequently willthe transfer of data from the localenvironment to the global
environment be made? Daily? Weekly? Monthly? The rate of transferdepends
on a combination of factors. How quickly is the data needed in the global data
warehouse? How much activity has occurred at the locallevel? What volume of
data is being transported?
Is the transportation of the data from the local environment to the global data
warehouse across nationallines legal?
What network will be used to transport the data from the local environment to
the global environment? Is the Internet safe enough? Is it reliable enough? Can
the Internet safely transport enough data? What is the backup strategy? What
safeguards are in place to determine if allof the data has been passed?
What safeguards are in place to determine whether data is being hacked during
transport from the localenvironment to the globalenvironment?
Document Page
Data Discretization
discretization is the process of transferring
continuous functions, models, variables, and
equations into discrete counterparts. This process is
usually carried out as a first step toward making
them suitable for numerical evaluation and
implementation on digital computers.
Data discretization is defined as a process of
converting continuous data attribute values into a
finite set of intervals with minimal loss of information
and associating with each interval some specific
data value or conceptual labels.
Document Page
Why is it needed?
Improves the quality of discovered knowledge.
Easy maintainability of the data.
There is a necessity to use discretized data by many DM algorithms
which can only deal with discrete attributes.
Reduces the running time of various data mining tasks such as
association rule discovery, classification, and prediction.
Prepares data for further analysis, e.g., classification.
Discretization is considered a data reduction mechanism because it
diminishes data from a large domain of numeric values to a subset
of categorical values.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Steps of Discretization
Step 1: Sorting the continuous values of the feature to be
discretized.
Step 2: Evaluating a cut point for splitting or adjacent
intervals for merging.
Step 3: Splitting or merging intervals of continuous values
according to some defined criterion.
Step 4: Stopping at some point.
Document Page
Binning
Binning is a data smoothing technique and its helps to group a
huge number of continuous values into a smaller number of bins. For
example, if we have data about a group of students, and we want
to arrange their marks into a smaller number of marks intervals by
making the bins of grades. One bin for grade A, one for grade B,
one for C, one for D, and one for F Grade.
chevron_up_icon
1 out of 13
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]