B9DA103 Data Mining: CRISP-DM Critique & Big Data Mining Analysis

Verified

Added on 2023/04/23

AI Summary

This assignment provides a comprehensive critique of the CRISP-DM model in the context of big data mining, referencing several related journal articles published after 2012. It highlights the model's shortcomings in clarity, evaluation, teamwork, and alignment with software engineering principles. The essay also presents a critical analysis of a big data mining problem domain, proposing appropriate data mining tools and techniques to meet an organization's business intelligence needs, along with measurable implementation success criteria. The analysis emphasizes the importance of addressing challenges such as data volume, data variety, and data management in realizing the potential benefits of big data mining. Desklib offers a variety of resources, including past papers and solved assignments, to support students in their academic endeavors.

Big Data Mining Process and Application 1
Big Data Mining Process and Application
Name
Institution

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Big Data Mining Process and Application 2
Big Data Mining Process and Application
Introduction
Data mining is the process via which large datasets are sorted to establish relationships
and to identify patterns. The data mining tools have been widely used by computer scientists to
predict future trends. The idea of data mining was first coined in the 1990s but accepted in 1996
where organizations realized the value of data mining. During this year, the CRISP-DM model
was introduced by four leaders which led to adoption of data mining idea.
“CRISP-DM Model: The New Blueprint for data mining” is an article by Colin Shearer.
The article has organized data mining processes into 6. This monogram will have two parts, the
first part give a critique of the article by Colin as it applies to data mining. This will include a
review of some of the related articles. The second will outline a critical analysis of big data
mining problem domain.
Overview of the article CRISP-DM model by Colin
The article by Colin comprises of six phases of data mining. The very first phase as
indicated by Colin is a business understanding phase. The requirement of this phase is
understanding project requirements from an organization point of view which is then converted
into data mining problem definition. At this phase, one is required to assess situations and
produce the project plans. Data understanding which is the second phase, starts with initial data
collection and then proceeds to familiarity with the organization data to discover initial insights
and identifying data quality problems. Data preparation is the third phase, it covers all the
activities that constructs the final dataset. The fourth phase is the modelling phase where various
modelling techniques and their parameters are selected and calibrated into optimal values. The
fifth phase is the evaluation phase, this phase reviews model’s construction. The author states

Big Data Mining Process and Application 3
that the role of this phase is to critically determine if some essential business issues have been
considered. The last phase is the deployment; which is used for generating a report or for
implementing the data mining process across the organization (Shearer, 2000).
A review of the articles used
The first article which have been used to give a critique of the CRISP-DM model is the
James Taylor’s article. James is one of the leading experts in analytic technology. One of his
achievements is building Decisions support systems. In his article “Decisions Management
Solutions”, James has started by giving an overview of CRISP-DM model and later the four
issues brought by the model. Some of the issues highlighted by the author are lack of clarity,
mindless rework, blind hand of IT, and failure to iterate.
The second article, which has been used in this paper, is the article by Farhad Foroughi
and Peter Luksch, “Data Science methodology for cybersecurity projects” the author has
discussed data mining from cyber-security point of view. Specifically, the author has highlighted
the difference between TDSP and CRISP-DM model. The author has highlighted some of the
issues with CRISP-DM. The third article which have been used is the article by Jen Stirrup,
“What is wrong with CRISP-DM, and is there an alternative?” The author has started by giving
an overview of CRISP-DM and TDSP process. The author has concluded by highlighting the
issues with CRISP-DM model.
Critique
To start with, the article by Colin lack clarity as compared to the current big data mining
practices; this is according to the article by James Taylor (2017). Currently, most of the
companies and even small businesses are handling complex issues. As one can view from the
article by Colin, he does not nail down into details on business problems and how the CRISP-

Big Data Mining Process and Application 4
DM analysis can help businesses. The team that implements the data mining project are usually
limited to business objectives, project goals and some metrics which measure success. This
means that an appropriate data mining model ought to give a clear detailed analysis which can
assist in big data mining.
James (2017) has continued to state that if the model lacks clarity then it means that the
team has very few options. Most are the times that the data mining teams find new data and new
modelling techniques rather working with the organization or the business partners to re-evaluate
a business issues. James has continued to highlight that the developers of the model never
engaged the IT specialist when developing the model on how analytical needs of data mining
needs to be done which results to a model thrown over the wall to data mining process. This also
results to increase in cost and time of a deploying a model which will never have a business
impact. Lastly, the model fail to iterate; this is because the model is never kept-up to date as
business circumstances can change (Taylor, 2017).
Second, the fifth phase by the author i.e. the evaluation phase has been overlooked by the
author. This phase also needs to cover quality assurances. In addition just like the TDSP process
model, the CRISP-DM model need to provide a dynamic framework where the first phase to not
only define the business idea from an organization point of view but it also have to identify some
of the possible scenarios and evaluate them which terminates by generating a project plan for
delivering the solution. The second phase which concentrates on data understanding needs to
perform data acquisition as well just like the TDSP process. In here the phase has also to include
fact-finding and familiarity about big data; the TDSP process has been clearly outlined by Farhad
Foroughi and Peter Luksch (2018). In addition the modeling phase also has to be verified against

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Big Data Mining Process and Application 5
the original business question in addition to adding to the business value and being tested to
ensure that the CRISP-DM model meets the initial business objectives (Luksch, 2018).
Third, the CRISP-DM model does not emphasize in teamwork and collaboration
throughout. The model needs to recognize the essence of working as a group or a team so as to
meet data science goals. Currently, projects start from expectations which then the project
managers start planning cycles for the work. This means the project managers have to start from
short, simple, and small simple model cycle so as to get a basic model quickly which then
develops to a complex and stronger model. CRISP-DM model does not take this into
consideration. The model focuses on a high-level view of business goals but does not give an in-
depth analysis of the data mining processes. Currently, what organizations consider is model
which provides a robust planning process and example is the TDSP process which does not only
clearly outlines the phase’s data mining phases but it also offers very useful information on the
essence of a standardized source controls and back-ups which also include the data mining tools
to be used.
Third, from the analysis, data mining models do not have a big difference with the
process models in Software Engineering. Meaning that they ought to be structured in the same
framework as with software engineering models; these, in turn, help in successfully deploying
data mining practices. This helps in first identifying the available data mining cycles and then
creating a data mining life cycle processes and allocate resources to all these processes. An in-
depth comparison of the CRISP-DM model with SE has been described by Oscar’s article “A
data mining and knowledge discovery process model”. From the comparison, this paper found
out that most of the processes which are defined in the SE which are very essential in the
development of any type of Data Mining (DM) engineering project yet there are missing in the

Big Data Mining Process and Application 6
CRISP-DM model. According to the article, this might be some of the reasons as to why the
CRISP-DM model has never been effective as it ought to be. What the article proposes is that the
author of CRISP-DM ought to take the model tasks and processes and then organize them by
processes as in the case of Software Engineering researches. The missing part from the CRISP-
DM model is an integral part of the data mining process which ensures quality, completeness of
the project. It also ensures that organizational processes are met. The article also noted that the
Software Engineering and CRISP-DM processes might be exact but the practices and methods
are different. In some other areas the elements might have the same goal but in terms of
implementation is very different. Also the CRISP-DM Model has only taken into account a
negligible part of project management which as highlighted in the first phase is the project plan;
this is described in the first phase which has confined itself in defining project milestones and
project deadlines but any project which also includes data mining activities require other project
management activities such as resources, control time, and budget; this has not been taken into
account.
Lastly, in the CRISP-DM model, the concept exploration is considered in the first phase.
Nevertheless, these tasks do not extensively cover the activities as it focusses on the background
of the problem and the terminology. In addition, at this phase, the model produces a list of
requirements only but in does not describes a formal notation of tool neither does the phase
describe how one can translate the business requirements into Data Mining goals. These means
that the model needs to include some tasks which ought to be adapted from software engineering
standards. The modeling phase of the CRISP-DM Model does not consider modeling procedures
or the methods as in the case of Software Engineering. In addition, the model does not consider
knowledge importation of data mining (Han & Kamber, 2017).

Big Data Mining Process and Application 7
According to Jen Stirrup (2018), the CRISP-DM model is no longer maintained. Jen has
started by giving an example with the CRISP-DM.org site which is no longer maintained.
Second, Jen has highlighted that the CRISP-DM framework has not been updated on the issues
of the new technologies like big data. Jen has continued to give a quote by Gregory Piatetsky
where he stated that replacement of CRISP-DM is long overdue. Jen has also highlighted that the
CRISP-DM model neglect the aspect of decision making. Jen has also given the importance if
Team Data Science Process (TDSP) over CRISP-DM model which is aimed at including big data
technology as a data source (Stirrup, 2017).
Critical analysis of a Big Data mining problem domain
Big data mining problem has been highlighted by several computer engineers. As
indicated by these researches the current magnitude of data which has been generated by
organizations, non-profit sectors, and public administrations has increased immeasurably. These
data include textual content and multi-media content. As reported by Xhafa and Dobre (2014)
the data which is produced by organizations every day amounts to over 2 quintillion bytes of
bytes. This has brought the era of big data a phenomenon which has also been referred by
computer engineers as Data Deluge (Perner, 2018).
Even though they remain enormous benefits with Big Data (BD) there remains what
researchers refer to a plethora of challenges of big data mining problems which ought to be
addressed so as to fully realize the potential benefits of Big Data Mining. Some of the problems
with Big Data Mining are volume, combining Multiple Data Sets, Data management, and Data
assertiveness.
To start with the volume of organization data is always exploding. Since 2000, the data
has been growing in a geometric progression. According to a survey carried out by IBM, by 2020

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Big Data Mining Process and Application 8
organizations will be handling Zeta bytes. The problem here is how to handle a variety of data.
Some of the organizations usually get data from the internet of things thus generating
exponential development in its data. Another issue are variety; this feature describes one of the
major problems with big data. The data can be unstructured most are the times big data
incorporates a large number of data which ranges from XML to video to short message. This
brings up the issue of data genuinely which is not a straight forward activity especially if the data
changes rapidly as in the case of big data. Volatility is another problem with big data; it refers to
how long a data element is supposed to be valid. Currently an organization needs to determine
how long data am organization need to store data to a point when the data is not relevant.
Visualization is another issue with big data; in here this aspect is uses charts and graphs to
visualize huge amount which is a very complicated data (Abbass, et al., 2010)
Another issue related to big data is most of the companies and organizations fails to know
some of the basics of big data; example what big data actually is, the infrastructures that are
needed. Without a clear understanding of big data basics then benefits related to big data is
doomed to be a failure to an organization (Wu & Kuma, 2009).
Other issues related to data mining are heterogeneity; this is one technique used to
discover relationships and unknown patterns. Big data possess different types of representation
forms and taking out such data sets possess a great challenge and complexity to data miners (Han
& Kamber, 2017).
Data mining tools
Some of the tools which can be used to sort some of these issues are Apache Mahout.
This is a data mining tool was developed by Apache Software Foundation. The tool offers free
data mining techniques and free implementations of distributed algorithms. The tool also has a

Big Data Mining Process and Application 9
wide range of data mining algorithms and machine learning, data classification, frequent pattern
mining, and clustering. The tools are also designed to offer other statistical visualizing and
computing graphics. Some of data miners use this tool for data analysis (Kantardzic, 2014).
Weka is another data mining tool; the tool can either be applied directly to a dataset or
directly to a database. The tool also includes data mining rules like regression, visualization. Pre-
processing, association rules, and classification. Rapid miner is another tool which was
developed by RapidMiner to support the data mining phases. The tools also help in the
optimization and validation of results. KNIME is another data mining tools; the tool is a user-
graphical friendly tool. The tool is also used for data transformation, data access, and initial data
investigation as well. The tool has also been employed by data miners to perform predictive
analysis and data reporting (King, 2015).
Other tools which organizations can implement for data mining are Orange, DataMelt,
ELKI, MOA, and Rattle. Orange is a tool which empowers data miners with a rich compilation
of machine learning algorithms for data pre-processing, modeling, classification, and clustering.
The tool can also be used for importing data and dropping widgets. DataMelt tool is used by data
mi9ners as a computational platform that offers scientific visualization, symbolic computations,
and other statistical functions. Other features that are provided by this tool include linear
regression, fuzzy algorithms, interactive visualizations which uses both 2D and 3D plots,
analytical calculations, curve fitting, and histograms (King, 2015). ELKI is a data mining tool
that is licensed under AGPLv3. The tool majorly focuses on cluster analysis and the compilation
of the various data mining algorithms. Some of the design goals of ELKI tool are completeness,
scalability, and extensibility. The tool is best suited for data mining as it has been optimized by
designers to perform data mining practices. MOA another data mining tool commonly known as

Big Data Mining Process and Application 10
Massive Online Analysis is mainly used as a data mining streaming software; the tool has been
specifically designed to handle a very large volume of real-time data. The tool is best suited for
the rich compilation of machine learning algorithms. Some of stream data mining algorithms
typically require very fast computations without storing all the datasets in the computer memory
and have to get data mining work done within a very limited time. MOA tool is well suited for
these two requirements. KEEL tool is commonly known as Knowledge Extraction for
Evolutionary Learning; this tool enables one import, edit, export, and visualize data with
different file formats (Maloof, 2006).
Some of the benefits of these tools are that they help in predicting future trends. Second,
these tools help in signifies customer habits. Example by working with tools such as ELKI,
MOA, and Rattle an organization is able to understand their customer behaviors as these tools
are able to handle information acquiring techniques. Third, these tools help in decision making as
they are able to provide regression reports (Zanasi, et al., 2012)
Techniques to meet an organization’s needs for business intelligence
To start with the concept of business intelligence was first coined in 1967 by Vilensky, an
American professor. According to the professor business intelligence indicates processing of
information and data collection. This means that business intelligence has a large impact on
effectiveness and efficiency of an organization. The two features of business intelligence are
processing of smart and organization learning whereas the components of business intelligence
are data mart and data sources.
The techniques used to meet an organization’s needs for business intelligence include
planning and conduction, obtaining of information from the database, processing information,
and analysis and production information. Planning technique helps in deciding the starting and

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Big Data Mining Process and Application 11
subsequent appeals. Processing information technique helps in identifying and analyzing the
relationship between data elements. Analysis and production information technique assist in the
production of diagrams, reports, and charts.
The benefits of these is that it help an organization to better manage their customers.
Other benefits are in-depth market analysis, elevated levels of customer satisfaction,
identification of loyal customers, facilitate decision making, and early detection of risks.
Conclusion
From this paper, it is evident that CRISP-DM considers planning from knowledge
deployment at the client side but it disregard the software part of it which manipulates the data.
The model also does not provide operation and support processes which is a requirement in the
current data mining practices, if a model is provided then it is important for one to provide the
client with technical assistance not only results monitoring. From the first section, this paper
believes that CRISP-DM remains a potential research area. In addition, from this monogram, it
is evident that we have entered the era of big data which is featured by heterogeneous and
diverse data sources. The data sources are also distributed, decentralized and very complex and
evolving. This means that without implementing the required data mining tools, organizations
will never reap from big data benefits.
‘

Big Data Mining Process and Application 12
References
Abbass, H. A., Sarker, R. A. & Newton, C. S., 2010. Data mining : a heuristic approach. 2nd ed.
New York: Idea Group.
Han, J. & Kamber, M., 2017. Data mining : concepts and techniques by Jiawei Han. 1st ed.
Chicago: Elsevier Press.
Kantardzic, M., 2014. Data mining : concepts, models, methods, and algorithms by Mehmed
Kantardzic. Chicago: IEEE Press.
King, R. S., 2015. Cluster analysis and data mining : an introduction. Data Mining, I(2), pp. 79-
134.
Luksch, F. F. a. P., 2018. Data Science methodology in Cyber-security. Data Mining, I(1), pp. 7-
12.
Maloof, M. A., 2006. Machine learning and data mining for computer... by Marcus A Maloof.
London: Springer.