Data Mining Assignment 1.1 MIT4204: Business and Technical Aspects

Verified

Added on 2023/04/23

AI Summary

This assignment delves into the core concepts of data mining, beginning with a definition and an assessment of its significance in today's technological landscape. It explores the steps involved in knowledge discovery, emphasizing the evolution of database technology and its impact on data mining. The assignment then presents a business-centric example, highlighting the crucial role of data mining in modern business strategies, along with the data mining functions needed and the architecture for a data mining system within a university setting. The assignment also contrasts data warehouses and databases, providing a detailed overview of different database types, including object-oriented, spatial, text, and multimedia databases. It covers various data mining functions such as characterization, discrimination, association, classification, clustering, and data evolution analysis. Furthermore, the assignment explores primitive data mining tasks, the concept of hierarchies, and clustering techniques, along with a discussion on spatiotemporal data streams and their applications, challenges, and potential solutions.

Running head: DATA MINING
ASSIGNMENT 1.1- MIT4204-DATA MINING
Name of the Student
Name of the University
Author Note

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

1DATA MINING
1. What is data mining?
Data is referred to the process of determining the large data sets that involves
interaction of statistics, database systems and machine learning. Data mining includes
analysis of data and summarizing the data for an effective strategy. This is referred to
the process through which the data are extracted from data mines. Data mining is
referred to the method in which the interesting knowledge’s are extracted in the form
of huge data.
a) However data mining is not another hype. Moreover the demand of data has
increased with the wide availability of data sets and the need of converting this
data into useful knowledge. Thus it can be stated that data mining is a result of
information technology.
b) Data mining is more than a simple transformation of technology that converts
databases, statistics and machine learning into useful data. Data mining includes
integrating the data instead of simply transforming the data. This includes image
and signal processing, analysing spatial data, neural networks, recognition of
patterns, high performance computing and statistics analysis.
c) The development of database technology took place with the development in the
creation of database and data collection method. This led to development of
efficient mechanism that helps in data management. The data management
process includes retrieval and data storage, query and transaction processing.
Several database systems offers transaction and query processing that led to better
data analysis. Thus it can be stated that data mining started its development in
order to meet the requirement.
d) The steps involved in data mining includes a process that will discover the
knowledge. The steps are as follows:

2DATA MINING
 Data cleaning: this is the process that includes removal of unwanted data
and transforms the noise.
 Data integration: this stage includes combining multiple data sources in
order to obtain a proper data set.
 Data selection: this stage includes selecting the data relevant to analysis of
data after retrieving it from the database.
 Data transformation: this stage is related to transferring the data with the
use of appropriate data mining.
 Data mining: this is an essential process that includes efficiently applying
the methods for extracting the data.
 Pattern evaluation: in this stage the patterns related to the data are being
identified based on some measures.
 Knowledge presentation: this includes visualizing the techniques and
representing the knowledge’s that are used to mine the knowledge.
2. A) Data mining plays a major role in development of the business. For example the
business aligned with selling items and providing services uses data mining for
obtaining benefits in the market. This type of business requires both customer
profiling and cross market analysis. The knowledge based on this can be gathered
with the help of data query processing. However it requires some manual working
from the expert market analysts. This both will help in understand the queries that will
help in managing huge amount of data.
B) There are several data mining architecture that are offered by them. This helps in
developing the application effectively. The data mining architecture that will be
beneficial for this application requires to have some necessary components, this
are as follows:

3DATA MINING
A database warehouse, a database that will contain the set of databases and
spreadsheets that will store the information regarding the student and course.
A database warehouse server that will help in fetching the relevant data from the
system based on the request processed by user at the time of data mining.
Apart from this a knowledge base will help in holding the records related to domain
knowledge that will be used as a guide for searching the interestingness patterns.
Data mining engine will contain a set of functional module for performing certain
tasks. This task includes classification, cluster analysis and evolution.
Pattern evaluation module helps in working with tandem that allows data mining
modules and focuses on searching the interesting patterns. With the use of graphical
user interface will provide user with an effective interactive approach.
c) The main difference between data warehouse and a database are, database is
referred to the collection of interrelated data. This data helps in representing the
current status. Different database tends to have different schema. Apart from this ad
hoc query is supported by database system and also allows on line transaction
processing. Whereas data warehouse is referred to the respiratory of information that
collects multiple resources and stores this data under a unified schema. This also helps
decision support and data analysis process. Apart from this the similarity between
data warehouse and a database is that both contains valuable information in a form of
repositories. Both facilitates the user with the ability of storing persistent data.
d) Object oriented database: this is designed based on OOPP. This paradigm stores data
in form of classes in the form of class hierarchy. The data stored within the database
is referred to as object.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

4DATA MINING
Spatial database: this consist of spatial related data and are represented in the form of
vector data. Example or spatial database are VLSI chip designs and satellite images
databases.
Text databases: this contains text documents in the form of long paragraph.
Multimedia database: this is used to store images, audio and video data and is used for
the purpose of performing application. Examples- voice mail system, World Wide
Web and many more.
World Wide Web: this provides user with rich online information services. All the
data objects are linked together that helps in facilitating interactive access.
e) Characterization: this is in the form of summarization of the main characteristics.
Example: features of a student can be produced by generating profile of the
university. This will include GPA and large number of courses.
Discrimination: this is basically a comparison between general features of a
targeted class with the general features of objects. For e.g.: students with high
GPA will be compared with student having lowest GPA.
Association: this helps to state association rule. This helps to show attributes that
takes place frequently. Fr example: Major ( Z, “biology”)= owns(Z, “computer
science”) [support =12%, confidence= 78%]
In this case Z is a variable used for representing a student.
Classification: this differs from prediction and are used for the purpose of
predicting class label.

5DATA MINING
Clustering: this analyses data objects without even consulting a known class label.
The developed clusters can be viewed as an object. This also facilitate taxonomy
formation.
Data evolution analysis: this is used for determining the objects that tends to
change their behaviour with time.
f) Discrimination differs from classification is referred to the comparison performed
to target the class data objects with the general features possessed by the objects.
The similarity between these two lies in the analysis part of class data objects.
Characterization differs from clusters by summarization of general features with
thee analysis of data objects without even consulting a known class. This possess
the similarity as both this is concerned with grouping the objects together.
Classification differs from prediction as this is mainly concerned with processing
the finding of set of models. This are used for predicting the data objects.
3. A) the primitive data mining tasks are as follows:
Task relevant data: this primitive is used to refer to the data part that is used for
performing data mining. This includes specifying the tables and databases for storing
warehouse data. This also helps in specifying relevant data, the attributes related to
data mining and the dimensions associated with exploration. This allows the user to
retrieve the data.
Knowledge type to be mined: this primitive is used to specify that all the data
mining function are performed. This activity includes discrimination, association,

6DATA MINING
clustering, classification or evolution analysis. This also allows the user to be specific
and helps in providing pattern templates.
Background knowledge: this allows the users to specify the functions that are used
for separating the uninteresting patterns from knowledge and will help in mining
process. The several kinds of background knowledge will help in focusing on the
concepts related to hierarchies.
Pattern interestingness measure: this helps the user to differentiate between the
interesting patterns and uninteresting patterns. This uses pattern matching for drawing
the important knowledge. This allows discovering the patterns easily and generated a
large number of patterns. Interestingness are measured with the help of some
characteristics that includes simplicity, utility, novelty and certainty.
Visualization of discovered patterns: this primitive feature refers to the form that
allows to discover the patterns that are displayed. Proper data mining step includes
displaying the discovered patterns in form of rules, tables or pie charts.
B) The concept of hierarchies that are useful in data mining includes a sequence of
mappings from different concept levels. The mapping is done from a set of lower
level concepts to higher level. This are useful for data mining because they allow to
discover knowledge at multiple level of abstraction. The knowledge’s obtained from
data mining can further be generalized or specialized. All these operation together
helps users to view data from different perspectives. This helps in gaining further
information that are hidden within the relationships maintained by data. Interactive
mining of knowledge are used at different levels of abstraction. Interactive mining
uses OLAP operations on a data cube and allows the users to focus completely on
searching the patterns. This also includes refining the data mining requests that are

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

7DATA MINING
based on returned results. The user are allowed to interactively view the data and
discovers the patters so that different angles can be generated.
C) Clustering techniques: after using the clustering techniques different cluster can be
represented with the use of different kinds of data. Outliers are referred to the set of data that
does not falls under clusters.
Prediction technique: in case the predicted value of a data point varies greatly with the
mentioned value then it is expected that the given value will be a part of outlier.
The clustering technique is more reliable for detecting the outlier. As the clustering technique
is unsupervised there is no need of assume anything related to data distribution. Thus it can
be stated that clustering technique will provide better solution.
4. a) three application example for spatiotemporal data streams are:
Climate images obtained from satellites
Data used for describing natural phenomenon
Sequence of sensor images used to identify geographical regions
b) Knowledge’s that can be mined entirely depends on application. The knowledge
that can be gathered is the pattern change of stream data. For example: the humidity
change in climate may reveal some different patterns about the creation of new
typhoon.
c) Challenges:
Problems with managing large scale data
Some patterns occurs over a long period of time
Sometimes the spatial data sensed may not be accurate
Thus it needs high tolerance to manage noise

8DATA MINING
d) For sketching the method mining space image is taken. This will help to determine
whether any new planet is being created or not. This is a example of change detection
problem. The image frame keep on going as f1, f2……ft, ft+1. The algorithm for this
is as follows:
Matching the planets in ft+1 with f1
Detection of unmatched planets
If yes reporting the planet appearance or a disappearance of the planet
5.
a) No coupling: data mining system includes flat files as a source to obtain data sets
from database systems. Thus it can be said that the architecture provides a poor
choice.
Loose coupling: basically data mining is not related to data warehouse. Thus the
architecture will take advantage of the flexibility provided. However scalability is
difficult to achieve.
Semitight coupling: data mining primitives are used to implement database within the
system. Thus enhances the performance of data mining system.
Tight coupling: this contains the system that is fully integrated and includes data
mining query processing.
b) From the architecture it can be state that tight coupling is best for representing
technical issues. However with the change in demand it becomes difficult to
understand the coupling process. Thus the most useful architecture is semitight
coupling as it allows coupling between loose and tight.
6.

9DATA MINING
a) This situation would be hampering my right as the organization obviously has the
liberty to track the finances for their business however they do not have any right to
track my debit card transaction patters.
b) Another situation that might affect the privacy of the human beings is the
organizations noting down the names of the family members and their relationship
status on various occasions during registrations for various type of activities.
c) A privacy-preserving data mining technique that can be performed by the bank is
that they can provide the customers with the option to provide feedbacks with each
and every transactions. Hence after data analysis the desired result cam be fetched by
the bank that takes place with the consent of the customer.
d) Tracking the weather patterns and the cloud formation can provide a great deal of
help for the society. Alerts for incoming storms would be helpful for the society to a
great extent.
7.
a) Data mining can raise issues such as discrimination. There are various analysis that
provide data on gender discrimination within the organization or in the field of
marketing.
b) The most challenging problem in data mining is data mining in network settings.
Yes progress on this filed can be made using better filtration and artificial technique
that would filter out the data that hampers privacy.
c) Developing domain-specific data mining solutions should be focused on as this
would filter out the unwanted data and provide efficient results to the users.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

10DATA MINING
8. a) while continuing the data mining process it is observed that different kinds of
knowledge’s are used by users. The data used ranges from users to users. The main
challenges that are faced with data mining includes the privacy and security maintained
becomes difficult. It becomes difficult to manage huge set of complex data. The distributed
data needs to be stored carefully. Apart from this it becomes difficult to extract huge amount
of data effectively with the help of algorithms. Thus it is important to have proper and
efficient data mining algorithm. There are also several issues that are faced with the diversity
of database types. Part from this the database mining also faces issues regarding the
performance.
b) DataStream: data stream analysis is referred to the presentation of multiple challenges. At
first the data streams are being continuously flowing in and out of the mining process.
Analysis system that is used for processing data needs to be entered carefully so that it can
adapt to real time and also will be able to adapt with changing patterns that are likely to
emerge in future. Apart from this one of the major challenge faced is with the size of data
stream. The mining process gets effected with the data size.
Bioinformatics: bioinformatics is the field that helps in encompassing many other subfields.
This subfields includes molecular biology, genomics, proteomics, and chemi-informatics.
These fields has there individual properties and challenges that are needed to be understood
and identified. The major challenges related to data mining in the field of bioinformatics
includes the difficulty faced with analysing huge set of data , the process that is required to
store information and many more.

11DATA MINING
References
Amani, F. A., & Fadlalla, A. M. (2017). Data mining applications in accounting: A review of the
literature and organizing framework. International Journal of Accounting Information
Systems, 24, 32-58.
Berkhin, P. (2016). A survey of clustering data mining techniques. In Grouping multidimensional
data (pp. 25-71). Springer, Berlin, Heidelberg.
Chen, F., Deng, P., Wan, J., Zhang, D., Vasilakos, A. V., & Rong, X. (2015). Data mining for the
internet of things: literature review and challenges. International Journal of Distributed
Sensor Networks, 11(8), 431047.
De Francisci Morales, G., Bifet, A., Khan, L., Gama, J., & Fan, W. (2016, August). Iot big data
stream mining. In Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (pp. 2119-2120). ACM.
Dutt, A., Aghabozrgi, S., Ismail, M. A. B., & Mahroeian, H. (2015). Clustering algorithms applied in
educational data mining. International Journal of Information and Electronics
Engineering, 5(2), 112.
Ester, M., Kriegel, H. P., Sander, J., Wimmer, M., & Xu, X. (1998, August). Incremental clustering
for mining in a data warehousing environment. In VLDB (Vol. 98, pp. 323-333).
Han, J., Kamber, M., & Tung, A. K. (2001). Spatial clustering methods in data mining. Geographic
data mining and knowledge discovery, 188-217.
Roiger, R. J. (2017). Data mining: a tutorial-based primer. Chapman and Hall/CRC.
Sajana, T., Rani, C. S., & Narayana, K. V. (2016). A survey on clustering techniques for big data
mining. Indian Journal of Science and Technology, 9(3).
Shmueli, G., Bruce, P. C., Yahav, I., Patel, N. R., & Lichtendahl Jr, K. C. (2017). Data mining for
business analytics: concepts, techniques, and applications in R. John Wiley & Sons.