MBIS623 Assignment 2: Data Modeling Issues in NCMSC Project Report

Verified

Added on  2022/12/12

|13
|4468
|469
Report
AI Summary
This report provides an in-depth analysis of the entity resolution and identifier management issues within the National Cat Management Strategy Group (NCMSC) project, a New Zealand initiative aimed at managing cats. The report, conducted within the framework of DAMA-DMBOK2, discusses the challenges of data modeling, master data management (MDM), and reference data in the context of the NCMSC project. The study identifies key issues such as de-duplication versus fidelity enhancement, clustering versus snapping, the role of confidences, schema mismatches, performance considerations, and privacy concerns. The report details the steps involved in entity resolution, including recognizing master data sources, determining relevant attributes, defining entity key attributes, entity matching, and identifying the system of record. It explores the complexities of merging records, handling multiple versions of truth, and the impact of data governance on entity resolution efforts. The report also delves into the practical challenges of clustering and snapping in managing cat data, owners, and geographical locations. Overall, the report provides a comprehensive overview of the data management problems the NCMSC project may encounter, offering insights into effective data modeling strategies and best practices for resolving entities and managing identifiers.
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 1
Data Modeling: Issues Surrounding entity resolution and identifier management issues in the
NCMSC
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 2
Executive Summary
This report outlines the issues surrounding entity resolution and identifier management which may
be expected to be encountered in the NCMSC project in the context of reference and MDM, and
following the rules set by the DAMA-DMBOK2. The NCMSC is a national New Zealand project
aimed at managing cats through an information repository to trace cats, their owners, and
implementing measures to forestall situations where cats pose problems to the ecosystem. After
reviewing the project, the following entity resolution and identifier management issues were
identified; De duplication vs fidelity enhancement, clustering vs snapping, confidences, mismatches
of schema, performance, metrics, and privacy.
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 3
Table of Contents
Executive Summary..............................................................................................................................2
Introduction and Background...............................................................................................................4
Discussion.............................................................................................................................................4
Conclusion............................................................................................................................................9
References...........................................................................................................................................11
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 4
Data Modeling: Issues Surrounding entity resolution and identifier management issues in the
NCMSC
Introduction and Background
Data Modeling refers to the process of analyzing data objects and the relationships existing
between them and other data objects and is usually the first step in designing databases. Data
Modeling entails first creating a conceptual model and the progressing to creating a logical model,
and finally, a physical schema (West, 2011). In data modeling and databases systems development,
the data usually has a master data. Master data refers to the uniform and consistent set of extended
attributes and identifiers that describe the core entities within a given data set (Umanath & Scamell,
2014). The management of this critical data is termed MDM (master data management). MDM
refers to the method used in the definition and management of critical data in an organization, and is
usually used in tandem with data integration to provide a single point of reference. MDM and
reference provide the contextual transactional data capabilities and the management of reference
data and master data makes it possible for organizations to better understand their operational data
as well as effectively evaluate disparately gathered data (Berson & Dubov, 2011).
MDM and reference usually involves collecting non-transactional general data to provide
context for transactions, and provide connection points for data located in different tables, records,
or files, or other data storage formats. Reference data is used for categorizing and classifying other
data while master data are the entities that offer business transaction contexts (Allen & Dalton
Cervo, 2015). In this report, the reference and MDM issues for the National Cat Management
Strategy Group (NCMSC) within the New Zealand Department of Conservation (termed Predator
Free 2050) is discussed. In performing Reference and MDM, it will be necessary to identify and
merge the various NCMSC records that represent real-world entities; a process known as Entity
Resolution (ER). ER is a well known problem in data management and other applications and some
of the issues of ER as well as identifier management that can afflict the NCMSG are discussed in
this report. The discussion of the reference and MDM for the NCMSG is conducted in the context
of the DAMA-DMBOK2 framework for data management.
Discussion
The NCMSC is a national mandated process for De-sexing and micro chipping of domestic
cats during ownership transfers as a component of increased focus on responsible ownership of
pets. The NCMSC also makes proposals on how stray cats should be managed as well as possible
ways of enforcing cat curfews in areas of ecological sensitivity. This implies managing large
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 5
volumes of data for cats, their owners, their locations, and the cats themselves as they will have a
micro chip for automatic identification in case of stray cats. The DAMA-DMBOK2 framework for
data management is a comprehensive compilation of principles of data management that defines
the best practices for IT professionals, educators, executes, researchers and knowledge workers. The
framework is aimed at better data management to ensure the information systems of such
organizations is well matured. The DAMA-DMBOK2 framework for data management gives
guidelines and information on data governance, data development, management of data
architecture, database operations management, management of data security, Reference and Data
management, data warehousing, and content and document management (Mosley, Brackett, Earley
& Henderson, 2010; Earley & Henderson, 2018).
ER, as stated earlier, is a common problem in several applications (Mouloua & Mouloua,
2018); in the NCMSC project, the cat data and their owner information can contain multiple entries
that represent a single physical address, although every record can be different, even slightly such
as having differing spellings or may be missing some information. Identifying matching records;
implying records representing the same location, owner, or cat is a challenge since there are no
unique identifiers across the various record attributes, such as when the owners of cats have similar
names but different addresses or different names fr their cats. A given record entry, such as the
names of cats may appear in in different ways in every catalog or physical address and there is
sufficient amount of guess work in the determination of matching records (Weikum & Theobald,
2010). Define whether records match, for instance is one or more cats change ownership amongst
many people is usually ‘expensive’ computationally because the process may entail finding
common maximal sequences in 2 (two) strings. How records can be merged – implying how
records that match can be combined is also an process dependent on an application. For instance,
different owners of cats or different names of cats can appear in records that are to be merged; in a
way, there is a wish or need for the NCMSC to keep both of these records while in some instances,
there may be a desire to pick just one name as the ‘consolidated’ name.
ER and identifier management (IM) in the context of the NCMSC is a generic database
problem; the approach to studying the issues surrounding its implementation is generic because the
internal details of functions used in comparing and merging records. Instead, the focus is on
viewing those functions as ‘black-boxes’ that the ER engine invokes (Talburt, 2011). ER is among
the reasons why MDM becomes so complex and also the reason whey there are few available out of
the box technical solutions. Despite being a simple concept, achieving it is very complex: the
objectives of ER, conceptually, is to recognize specific entities and represent it completely,
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 6
uniquely, and accurately in a proper way form the perspective of data (Fürber & Hepp, 2010). ER
is an important aspect in MDM since addresses issues of multiple versions of the truth and entails
techniques and processes for identifying and resolving many occurrences of entities across several
sources (Talburt, 2011). The entity is an individual, object, place, unit, or any item that requires
uniqueness within a specific domain. For instance in the NCMSC domain, the owner can be an
individual or many people that own a cat and so should be considered as entities. Domains as well
as their related entities are usually obvious to determine since they are core to the operations of an
organization (Kanchi, Sandilya, Ramkrishna, Manjrekar & Vhadgar, 2015). They are non
transactional intrinsically, but are always associated with a specific transaction. Once entities are
determined, the next process requires finding and resolving the many occurrences and several forms
of the entity. The steps required for this, and which are applicable to the NCMSC project include;
Recognizing master data sources which entails identifying contain master data that should
be gathered
Determining what attributes should be collected for a specific entity and this requires
deciding the master data what attributes of the master data will be in the MDM hub
Defining the entity key attributes that will be used in identifying an entity uniquely
Undertaking entity matching to link records and their associated attributes to the same entity
Identification of the system of record that entails defining the best sources of given
attributes. The system of record refers to the authoritative data source for a specified data
element
The business data quality rules must then be compiled for the NCMSC project; a set of rules
must be assembled to be used for cleansing, standardizing and ensuring the best information
survives.
Business and data quality rules must then be applied based on the DAMA-DMBOK2
guidelines; the previously collected rules will need to be executed.
A golden record will then have to be assembled to create a single ‘truth’ version for the
entity (Allen & Dalton Cervo, 2015)
In undertaking the NCMSC project, the steps may not be followed wholly but the ER
process is reusable and remains domain specific on the classification of data, availability of
reference data, rules for data quality, business rules, entity identity, and data ownership. For every
entity, ER is very specific, but their activities may overlap (Fidalgo, Alves, España, Castro & Pastor,
2013). Data governance, as an example, has several overlapping procedures and practices across
several domains. However, efforts to govern several or multiple domains do not increase at the
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 7
same rate as efforts to solve for ER for them. For each domain, ER is unique therefore, efforts to
execute ER and IM are expected to grow in a linear manner with more domains being added
(Riedel, Yao & McCallum, 2010).
The ER and IM problems that the NCMSC project may face are discussed below;
De duplication vs Fidelity Enhancement
The De duplication problem is likely to be encountered in the NCMSC project; in this
problem, there is a single record set and attempts are made to merge those representing the same
entity in the real world (Rossetti & Bilbro, 2018). For instance, the name of a cat owner may be
represented as a string in the database but when the cat has two owners, there is an attempt to merge
the names of the cats. In the fidelity enhancement problem, there are two record sets: a base records
set of interest and a new acquired information set of records (Xu, Wang & Yu, 2014). The objective
is to merge/ coalesce all these new information into the base records (Snidaro, García, Llinas &
Blasch, 2016); for example, if one owner gives a cat up for adoption, the new owners data
represents new information and there will no doubt be attempts to add the new information to the
base records about the cat.
Clustering vs Snapping
Snapping enables two features move so that the coordinates coincide and in the structuring
of data, given tolerances is specified so that features are able to be snapped together. Snapping is a
useful approach to aligning and cleaning up features and it also helps in discovering errors, such as
as duplications and narrow sections (Haruna, Hou, Eghan, Kpiebaareh & Tandoh, 2018). With
snapping, records are examined in a pair-wise manner and determinations made if the same entity is
represented. If the records pair-wise records are representative of the same entity, then they are
merged into a single record and the pair wise comparison process is continued. Clustering refers to
the process of grouping a set of objects in a way that objects within the same group (the cluster)
have greater similarities to each other than to objects in another group (cluster). Clustering is a
major task in exploratory data mining and is also used commonly in statistical analysis of data (Xu,
Wang & Yu, 2014; Prince, 2009). In creating a database of cats, their owners, and geographic
locations, there will be need to cluster records and this is likely to pose a huge challenge. In
clustering, all records are evaluated and partitioned into groups that are believed to represent the
same real world record entities and at finally, every partition is merged into a single record (Teorey,
Lightstone, Nadeau & Jagadish, 2011). This represents a challenge for the project because it is
difficult to determine what factors are used to classify objects with small distances between them.
As such, clustering is likely to be a multi objective problem in optimization of the cat or their
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 8
owners records because the parameter settings for clustering such as the distance functions to be
used, the expected cluster numbers, and the density threshold depend heavily on individual data sets
and the intended use fir the results. Further, given the iterative process nature of clustering that
entails knowledge discovery involving trials, errors, and failures, it will pose challenges for the
NCMSC because how data is Pre-processed must be modified (Ren et al., 2015). In addition, the
model parameters also have to be modified until the desired properties are achieved in the results.
Confidences
In putting data and various information into databases/ records, specific measures are used
in establishing association rules among various items. For instance, assume that the names of two
cats in a given geographic location is represented by X and Y; then, the confidence explains the
likelihood of Y being owned by the same person when X (cat name) is returned. Hence, an
association between two items is defined, for instance, when the name of cat is X, from a given
location, there is a likelihood that the same owner also owns cat Y. the transactions proportions is
represented by X where Y also appears (Ceroni & Fisichella, 2014). In some scenarios of entity
resolutions, confidences must be managed effectively. For example, input records can have a
confidence value that represents the likelihood that they records are true. Confidences are also
found in snap rules that shown when there is a match between two records and their confidences
indicate the likelihood of two records representing a single real world entity (de Melo & Weikum,
2010). As such, while records for the NCMSG are being merged, their confidences must also be
tracked and this represents another challenge for developing the data for the NCMSC project.
Mismatches of Schema
Schema mismatches happen when sachems in maps differ when compared to data read in a
table of file by a database system at connection time. In some cases of ER, discrepancies among
different sources schemas must be dealt with, apart from resolving or dealing with information
about entities (An, Hu & Song, 2010). For instance, the attribute formats and names form a give
source may fail to match those from other sources and this will also pose challenges for the
NCMSC project.
There are four main settings for schema mismatches and these include;
Error- The mismatch is treated by the system as an error and so the map is not run, enabling
the problem to be assessed. This is normally the default option and enables method selection based
on each case.
Use-Map- Schema from the map is used by the system and this option should be used if
there is certainty the selected map schema is the correct one.
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 9
Use-connection_map_by_the _ position – The schema from the connection is used by the
system and it attempts to ma fields based on their position in the schema structure.
Use_connection_map_name- The schema from connection is used by the system and an
attempt is made to generate mappings based on field names.
Performance
Performance is another issue that will impact the NCMSC project; ER algorithms have to
perform very large numbers of record or filed comparisons through the user provided functions. As
such, it is critical that only the minimum invocations to comparison functions. The development of
efficient algorithms is similar to the development of efficient joint algorithms for use in relational
databases (Dominguez-Sal et al., 2010) and this is likely to be a challenge to be solved for the cat
project.
Manipulation of Confidences
There is very little understanding on how confidences ought to be manipulated in the ER
setting (Fuller, Murthy & Schafer, 2010); for example in the NCMSC project, a record can say that
John owns cat EX and lives at address 123 with a confidence of 0.9, while Joe owns cat EYY and
lives at address 456 with a confidence of 0.7. the relevant snap rule indicates that John and Joe are
the same individual with a confidence of 0.8; so can it be assumed that the person has two addresses
and own two different cats? Or that the correct address is 123 since the record is associated with a
higher confidence? If the records are merged, what would be the resulting confidences?
Metrics
This is also going to be an issue in the NCMSC project; if there are two resolution schemes,
namely B and C, how can it be determined that the better results are yielded by A and not by B
(Vatsalan, Christen & Verykios, 2013). Therefore, metrics is bound to be another unique challenge
in the project and these can be addressed through the development of metrics that not only quantify
the performance of the ER, but the accuracy of the ER as well
Privacy
ER has a strong correlation with information privacy; if John, for example, has two records
containing his private information where Record 1 has his name, address, and hone number while
Record 2 has his work place and age; the total information that has ‘leaked’ or become public
knowledge is dependent on how well someone, say Dick, is able to piece together this data. If it can
be determined that the records are referring to a single same person, the John’s phone number and
work place is known and this can open the door for hacking or identity theft by Dick. However, if
the records are unable to snap together, then Dick will have less information about John and so a
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 10
smaller leak is expected. As such, a better model must be developed that ensures minimal
information leakage in the ER context such that leakages can be quantified if one fact is released or
the decrease in leakage through releasing disinformation (Vatsalan, Christen & Verykios, 2013).
Conclusion
Data Modeling is a the process that involves analyzing data objects and the relationships existing
between them and other data objects. In managing data, the data usually has a master data which
refers to the uniform and consistent set of extended attributes and identifiers that describe the core
entities within a given data set. The master data is critical and its management is termed master data
management (MDM). In evaluating the NCMSC, this report has identified various issues
surrounding the project in the context of entity resolution and identifier management issues based
on the DAMA-DMBOK2. (The NCMSC is a national New Zealand project
aimed at managing cats in terms of tracing cats, their owners, and implementing measures to
forestall situations where cats pose problems to the ecosystem ). These issues include; De
duplication vs fidelity enhancement, clustering vs snapping, confidences, mismatches of schema,
performance, metrics, and privacy. These issues ought to be resolved and considered from a holistic
perspective because the ER and IM issues makes reference and MDM highly complex.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 11
References
Allen, M., & Dalton Cervo. (2015). Multi-Domain Master Data Management. Amsterdam,
Netherlands: Elsevier Science.
An, Y., Hu, X., & Song, I. (2010). Maintaining Mappings between Conceptual Models and
Relational Schemas. Journal Of Database Management, 21(3), 36-68. doi:
10.4018/jdm.2010070102
Berson, A., & Dubov, L. (2011). Master data management and data governance (1st ed.). New
York: McGraw-Hill.
Ceroni, A., & Fisichella, M. (2014). Towards an Entity–Based Automatic Event Validation. Lecture
Notes In Computer Science, 1, 605-611. doi: 10.1007/978-3-319-06028-6_64
de Melo, G., & Weikum, G. (2010). Towards Universal Multilingual Knowledge Bases. Retrieved
from
https://domino.mpi-inf.mpg.de/intranet/ag5/ag5publ.nsf/0/B0A06C4D6FE8CFD4C125781A00542
880/$file/demelo-mlkb-gwc2010.pdf
Dominguez-Sal, D., Urbón-Bayes, P., Giménez-Vañó, A., Gómez-Villamor, S., Mart nez-Bazán, N.,
& Larriba-Pey, J. (2010). Survey of Graph Database Performance on the HPC Scalable
Graph Analysis Benchmark. Web-Age Information Management, 1, 37-48. doi: 10.1007/978-
3-642-16720-1_4
Earley, S., & Henderson, D. (2018). The Data Management Body of Knowledge (2nd ed.). Bradley
Beach, New Jersey: Technics Publications.
Fidalgo, R., Alves, E., España, S., Castro, J., & Pastor, O. (2013). Functional modeling of
professional activity’s basic elements in frame of Entity-Relationship model. Journal Of
Information And Data Management, 4(3), 406–420. Retrieved from
https://pdfs.semanticscholar.org/87c1/d9ca87d9b2f8e37be3c14e412821f27fc68d.pdf
Fürber, C., & Hepp, M. (2010). Using SPARQL and SPIN for Data Quality Management on the
Semantic Web. Business Information Systems, 1, 35-46. doi: 10.1007/978-3-642-12814-1_4
Fuller, R., Murthy, U., & Schafer, B. (2010). The effects of data model representation method on
task performance. Information & Management, 47(4), 208-218. doi:
10.1016/j.im.2009.06.008
Haruna, C., Hou, M., Eghan, M., Kpiebaareh, M., & Tandoh, L. (2018). A Hybrid Data
Deduplication Approach in Entity Resolution Using Chromatic Correlation
Document Page
DATA MODELING: ISSUES SURROUNDING ENTITY RESOLUTION AND IDENTIFIER
MANAGEMENT ISSUES IN THE NCMSC 12
Clustering. Communications In Computer And Information Science, 1, 153-167. doi:
10.1007/978-981-13-3095-7_12
Kanchi, S., Sandilya, S., Ramkrishna, S., Manjrekar, S., & Vhadgar, A. (2015). Challenges and
Solutions in Big Data Management -- An Overview. 2015 3Rd International Conference On
Future Internet Of Things And Cloud, 1. doi: 10.1109/ficloud.2015.121
Mosley, M., Brackett, M., Earley, S., & Henderson, D. (2010). The DAMA guide to the data
management body of knowledge. Bradley Beach, New Jersey: Technics Publications.
Mouloua, M., & Mouloua, M. (2018). Automation and Human Performance (p. Chapter 11). Boca
Raton: Routledge.
Prince, V. (2009). Information retrieval in biomedicine (1st ed.). Hershey, PA: Medical Information
Science Reference.
Ren, X., El-Kishky, A., Wang, C., Tao, F., Voss, C., & Han, J. (2015). ClusType. Proceedings Of
The 21Th ACM SIGKDD International Conference On Knowledge Discovery And Data
Mining - KDD '15, 1. doi: 10.1145/2783258.2783362
Riedel, S., Yao, L., & McCallum, A. (2010). Modeling Relations and Their Mentions without
Labeled Text. Machine Learning And Knowledge Discovery In Databases, 148-163. doi:
10.1007/978-3-642-15939-8_10
Rossetti, K., & Bilbro, R. (2018). Basics of Entity Resolution with Python and Dedupe - District
Data Labs - Medium. Retrieved from https://medium.com/district-data-labs/basics-of-entity-
resolution-with-python-and-dedupe-bc87440b64d4
Snidaro, L., García, J., Llinas, J., & Blasch, E. (2016). Context-Enhanced Information Fusion (1st
ed., p. 180). Basel: Springer Link.
Talburt, J. (2011). Entity resolution and information quality. San Francisco CA: Morgan
Kaufmann/Elsevier.
Teorey, T., Lightstone, S., Nadeau, T., & Jagadish, H. (2011). Requirements Analysis and
Conceptual Data Modeling. Database Modeling And Design, 55-84. doi: 10.1016/b978-0-
12-382020-4.00004-5
Umanath, N., & Scamell, R. (2014). Data modeling and database design (2nd ed., p. 23). Boston,
MA: Cengage Learning.
Vatsalan, D., Christen, P., & Verykios, V. (2013). A taxonomy of privacy-preserving record linkage
techniques. Information Systems, 38(6), 946-969. doi: 10.1016/j.is.2012.11.005
chevron_up_icon
1 out of 13
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]