Data Mining Software Evaluation: Criteria, Tools, and Analysis

Verified

Added on  2022/09/18

|26
|7227
|34
Report
AI Summary
This report offers a comprehensive evaluation of data mining software, with a specific focus on SAS Enterprise Miner (EM) and SPSS. The report begins with an overview of Decision Tree (DT) induction, detailing the algorithm's benefits, limitations, and practical applications in business. It explains DT growth, pruning techniques, and the importance of addressing overfitting. The core of the report centers on the evaluation criteria for data mining software, including functionality, usability, performance, and ancillary support. The report then provides a description of the features of EM and SPSS, followed by a comparative analysis of both software in relation to the specified evaluation criteria. The report concludes by providing insights into the best data mining software and its selection based on the criteria discussed.
Document Page
Running head: EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 1
EVALUATION CRITERIA FOR DATA MINING SOFTWARES.
Student’s Name
Professor’s Name
Institutional Affiliation
Date Due
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 2
Executive summary
Powerful software tools are required for the application purposes of data mining algorithms. Data
mining is very sensitive; thus, it requires intensive effort in the selection and evaluation of the
software available. With the evolving technology, modifications have been done in many
software to meet the need of the user. It's a challenging task to settle on one or two data mining
tools that will satisfy you. Besides, the number of available software meant for data mining is on
the rise. One of the fields that heavily rely on data mining tools and software is statistics.
Statistics involves the collection of different variety of data. These data need to be simplified to
be interpreted. Complex statistical data are synthesized, analyzed, and interpreted using SAS
Enterprise Miners (EM) data mining software and Statistical Package for the Social Sciences
(SPSS). This data mining software has its strengths and weakness; therefore, they need to be
evaluated. The most effective data mining software evaluation criteria include but not limited to;
Performance, Functionality, Usability, and Ancillary support. These criteria for evaluation have
been adequately explained in this paper. SAS EM and SPSS heavily rely on Decision Tree (DTs)
algorithm to execute their statistical functions
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 3
Contents
Chapter 1: Introduction............................................................................................................................4
Objectives of the study..........................................................................................................................5
Limitations of the study........................................................................................................................5
Chapter 2: Overview of DT induction......................................................................................................6
Tree pruning..........................................................................................................................................8
Important DTs algorithm....................................................................................................................10
Chapter 3: Evaluation criteria................................................................................................................14
Functionality.........................................................................................................................................15
Ancillary task support...........................................................................................................................15
Usability................................................................................................................................................16
Performance.........................................................................................................................................17
Chapter 4: Description of the DT induction software...........................................................................18
EM features for Data mining process....................................................................................................18
Chapter 5: Description of SPSS.................................................................................................................20
The core functions of SPSS.................................................................................................................20
The benefits of using SPSS.....................................................................................................................21
Chapter 6: Comparative Analysis in terms of Relevant Criteria.........................................................22
Chapter 7: Conclusion.............................................................................................................................23
References................................................................................................................................................25
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 4
Chapter 1: Introduction.
Data mining traces its roots from artificial intelligence, statistics, machine learnings, and
database research. Advancements have been witnessed in these fields in that, many data mining
tools have been invented. Mainframe programs have been developed for statistical analysis.
Nevertheless, different varieties of standalone server and web-dependent software are being used
to demystify statistical data sets. Data mining is very important in understanding different
databases (KDD). Data mining involves the application of particular algorithms that create a
particular or specific enumeration of models (or patterns) across the data. KDD is the nontrivial
procedure of identifying novel, valid, potentially useful, and ultimately understandable patterns
in data. This definition of KDD is used synonymously to define data mining. Most software tools
complement the KDD field, which has transformed and grown. Thus, it is indeed sensible to
inquire which data mining software is better placed to transform the market.
Meanwhile, it's a challenging task for business users to settle on the best data mining software
that will meet their budget and utility desires. However, choosing a wrong data mining software
is time consuming, costly, spurious results, and personnel resources (Dušanka et al., 2017). To
mitigate these risks, there is a need to understand the data mining software evaluation criteria.
From these evaluation criteria, one can make an informed decision on the right data mining tool
to settle on depending on the nature of their data. This paper evaluates the SAS Enterprise
Miners (EM) data mining software data mining tool versus Statistical Package for the Social
Sciences (SPSS). This software is evaluated using usability, performance, functionality, and
ancillary support criteria. In solving classification and regression problems, this software relies
on Decision Tree (DT) algorithms. A decision tree is a supervised classification algorithm that is
easy to interpret due to the tree structure. DTs algorithms rely on human-readable and
understandable tree rules of "if…Then …" to extract predictive information (Upadhyay, Pradesh
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 5
& Verma, 2019). The DTs algorithms discussed in this paper include CART, C4.5, ID3, and
CHAID.
Objectives of the study
The main objectives of this paper include;
To describe the DT induction software
Evaluation of SAS EM and SPSS using different evaluation criteria
To provide an insight into the best data mining software.
The decision tree induction software covered in this paper are;
Statistical Package for the Social Sciences (SPSS)
SAS Enterprise Miners (EM) data mining software
Limitations of the study
The current study only highlights the evaluation criteria of data mining tools. Therefore,
it provides little information about the data mining software selection criteria.
The DT induction software discussed in this paper are SAS EM and SPSS; therefore,
little information is provided on the other software meant for Data mining
Chapter 2: Overview of DT induction
One of the most useful supervised learning algorithms is the Decision Tree (DT). In
supervised learning, the behavior to be predicted is already known as well as your existing data
is already labeled; this is different from unsupervised learning. Data is explored using algorithms
to find patterns in unsupervised learning since there are no output variables to guide the learning
process. DTs algorithms are used by business organizations in approximating customer lifetime
values plus their churn rates. Nevertheless, DTs algorithms are incorporated in the manufacture
of autonomous vehicles, which help in recognition of pedestrians (Quirynen, Berntorp &
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 6
Cairano, 2018). DTs algorithms constantly divide data into smaller subsets based on
characteristic features, until they achieve sets that are small enough to be described by some
label. DT algorithms are perfect in solving regression (where values are predicted, for instance,
property prices by machines) and classification (where data are being sorted into classes, for
instance, showing whether an email is a spam or not) problems. Regression trees are utilized
when the targeted variable is continuous or quantitative (for instance, if we want to predict the
probability of experiencing rainfall). Whereas, when the dependent variable is qualitative and
categorical classification trees are used. For instance, if a doctor wants to investigate the blood
group of a patient, a classification tree is the most appropriate. DTs have many applications in
the real world, making them very important. Furthermore, DTs algorithms are mostly used in
ML. Further, they have relevant applications in several industries.
DTs algorithms are used in the early detection of cognitive impairments in the medicine
industry (Su et al., 2019). Besides, they also predict possible development of dementia in
the coming days.
DTs algorithms are used in the manufacture of Chatbots that have revolutionized the
healthcare sector. They gather information from patients through friendly chats.
Nevertheless, Chatbots have completely transformed the customer care sector. Internet
platforms as Google and Amazon are acquiring Chatbots to help them manage their
customer care services (Ikedinachi et al., 2019).
DTs are trained to recognize different causes of forest loss from satellite imagery. They
can be used to predict the possible causes of forest destructors as wildfires, large or small
scale agriculture, logging of tree plantations, and urbanization (Srivastava et al., 2019).
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 7
The organization learns about customers' decision drivers and choices through sentiment
analysis that is conducted by DTs algorithms.
In dealing with invasive species, DTs algorithms help in establishing the best strategy.
DTs algorithms have also enhanced financial fraud detection. This is done by tracing
patterns of credit cards and transactions that rhyme with cases of fraud.
DTs are easy to understand and interpret; this has made them very popular not only in science
but also in the business environment (Ramesh, Rajinikanth & Vasumathi, 2017). Moreover, their
application has expanded to civil areas.
DTs are made of leafs, nodes, and branches. Where each leaf represents an outcome, each branch
represents a decision (or rule), and each node represents a feature (or an attribute). In defining
the depth of a tree, the number of levels is included, and root nodes are excluded.
Figure 1:A DT of two levels adapted from (Zhu et al., 2018)
In handling data, DT uses a top-down approach. The group and label similar observations and
develop rules that exclude dissimilar observations (Zhu et al., 2018). This process is repeated
until a certain degree of similarity is attained. The DTs splits can either be multiway or binary.
For binary splitting, each node is divided into two subgroups, and attempts are made to establish
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 8
the optimal portioning. Whereas for multiway splitting, nodes are divided into multiple
subgroups using as many partitions as per the existing distinct values. Multiway splitting has
notable advantages over binary splitting. This is because all the information is exhausted in a
nominal attribute in multiway splits. This is not true for binary splits.
Tree pruning
The complexity of DTs rises as the number of splits increases (Rai, Devi & Guleria, 2016).
Simpler DTs are preferable because of the lower risk of overfitting. Besides, simpler DTs are
generally easy to understand and interpret. Overfitting leads to noise (irrelevant data and
information) that impacts negatively on the performance of the DTs algorithms (Jaworski, Duda,
& Rutkowski, 2017).
Figure 2: overfitting in DTs adapted from (Sim, Teh & Ismail, 2017)
With overfitting, the model will be unable to replicate its detailed performance. Only the data
provided upfront are captured, and the model breaks down on the introduction of new data sets.
In avoiding overfitting, the user needs to do away with branches that fit data too specifically.
This way, you will have a DT that works well on new data and can also generalize them.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 9
Although this puts you at risk of losing precision on the training data. In eliminating overfitting,
you need a technique called pruning. Pruning eliminates sections of the tree that give little
classification or predictive power. By doing this, pruning reduces the size of the DTs (Sim, Teh
& Ismail, 2017). Pruning aims to eliminate noise and improves the accuracy of the DTs. Pruning
on DTs can be done using two different strategies;
Pre- prune: This is done when information becomes unreliable, so you stop growing the
DT branches (Sim, 2019).
Post –prune: This is when leaf nodes are extracted from a fully grown DT. This is meant
to improve the performance of the DT (Sim, Teh & Ismail, 2018).
Figure 3:Example of unpruned DT adapted from (Sim, 2019)
Important DTs algorithm
DTs uses algorithms to split data and select feature. All DTs are designed to execute the same
task. By splitting the data into subgroups, they examine all the attributes of the data set to
establish the ones that give the best possible result. DTs perform this task repetitively and
recursively by splitting and disintegrating subgroups into smaller and smaller units until the tree
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 10
is exhausted. The splitting process heavily impacts negatively on the tree accuracy and
performance to manage this. DTs can employ different algorithms that differ in the possible
outlook of the tree, for instance, the number of fragmentations per node, when to terminate
splitting, and strategy on how to conduct the splitting. In understanding the attribute to split and
their splitting criteria, insight must be created on the main DTs algorithms
CHAID
One of the ancient DT algorithms methods is the Chi-Squared automatic interaction
detection (CHAID). This algorithm produces multiway DTs (splits of more than two branches)
appropriate for regression and classification tasks. A classification tree is constructed when the
dependent variable is categorical. Besides building classification trees, Chi-square independent
variable forms the basis on which CHAID determines the best split at each step (Akin, Eyduran
& Reed, 2017). Chi-square tests investigate the relationship between two variables and are
utilized at each step to ensure that each branch is significantly related to a statistically significant
predictor of the response variable. In other terms, it settles for the independent variable that
strongly interacts with the dependent variable. Furthermore, if there is no significant difference
between each predictor categories, they are merged concerning the dependent variable.
For the regression trees, the dependent variable is continuous. Here instead of chi-square,
CHAID relies on F-test to find the difference between two population means. A new partition
(child node) is created if the F-test is significant. This means that the child node partition is
statistically dissimilar from the parent node. On the other hand, the categories will be merged
into a single node if the outcomes of the F- tests between target means are not significant.
Missing values are not replaced by CHAID; instead, they are handled as a single class, which can
be merged with other classes if need be. CHAID has no pruning function (Erener, Mutlu &
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 11
Düzgün, 2016). Additionally, it produces DTs with multiway characteristics meaning the DTs
tend to be wider rather than deeper. These make them hard and short to apply to real business
conditions. Although CHAID is not the most powerful since it finds it hard to detect the smallest
differences. Also, CHAID is not the fastest DT algorithm. On the contrary, CHAID is flexible,
easy to manage, and can be very powerful.
CART
This is a DT algorithm that produces regression trees or binary classifications. The
production of binary classification or regression tree depends on whether the target (or
dependent) variable is categorical or numeric, respectively. It handles unprocessed (raw) data
and can utilize the same variable more than once in different parts of the same DT. This is a way
of revealing complex interdependencies between sets of variables.
CART algorithm relies on a metric named Gini Impunity to come up with decision points
for classification tasks in the case of classification trees. By how mixed the classes are in the two
groups created by the split, Gini Impurity gives an idea of how fine a split is. There is a perfect
classification when all observations belong to the same label and a Gini Impunity value of 0 (Lin
& Luo, 2019). 0 is the minimum Gini Impunity value. However, we face the worst-case split
when all observations are equally distributed among different labels. In this case, the maximum
Gini Impunity value of 1 is given.
CART algorithms use splits that minimizes Least Square Deviation (LSD) in the case of
regression trees. Besides, the partitions chosen are those that minimize results over all possible
options. Between the predicted and observed value, the least square deviation (LSD) metric
minimizes the sum of the squared distances. Residual is the name assigned to the difference
between the predicted and observed values. Therefore, LSD selects the parameter estimates to
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 12
minimize the sum of the squared residuals. LSD can accurately capture more information on the
quality of the split as compared to other algorithms (Fong, Biuk-Aghai & Millham, 2018).
Besides, LSD is well suited for metric data. The production of a sequence of DTs, each of which
is a candidate to be an optimal tree, is the main idea behind the CART algorithm. By evaluating
the performance of every tree through testing the optimal tree is identified. Testing involves the
use of new data sets, not known by the DT. Also, the optimal tree can be identified by
performing cross-validation. Cross-validation involves subdividing the datasets into "n" number
of folds and perform tests on each fold.
For tree selection, CART doesn't rely on an integral performance measure. Instead, it uses
cross-validation or testing to measure DTs' performances. Only after these evaluations are done,
can the tree selection proceed.
ID3
This is a DT algorithm that is majorly used in the production of classification trees. The
Iterative dichotomize 3 (ID3) is not effective in building regression trees; thus, it is mainly used
in classification tasks. Although a technique like building a numerical interval can improve ID3
performance on regression trees. ID3 dichotomizes (splits data attributes) to establish the most
dominant features (Wei-ming & Yu, 2018). This procedure is performed iteratively in a top-
down approach to choose the DT nodes. In the selection for the most useful attribute, for
classification, ID3 uses the information gain metric. The information gains metric measures the
amount of information a feature gives about a class. ID3 maximizes this metric by splitting first
the attribute with the highest information gain. Information gain is interlinked with the entropy
concept, which measures the amount of randomness or uncertainties in the data (Yang, Guo &
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 13
Jin, 2018). Entropy values range from 0 to 1. 0 is awarded when the sample is completely
homogeneous and 1 when the sample is equally divided.
C4.5
C4.5 is an improvement version of ID3. C4.5 can handle both categorical and continuous
data; thus, it can generate classification and regression trees. Additionally, C4.5 ignores instances
that include non-existing data; thus, it can deal with undefined values. C4.5 uses gain ratio for its
splitting process, unlike ID3, which uses information gains. The gain ratio is an improved
version of information gain that reduces the bias on DTs with numerous amount of branches.
Besides, the gain ratio considers the intrinsic information of each split, thus, correcting the trend
of unfair favoritism toward attributes with multiple outcomes (Damanik et al., 2019).
Chapter 3: Evaluation criteria
The criteria used in evaluating the data mining software include;
Functionality
Ancillary task support
Performances
Usability.
Functionality
Data mining tool functionality helps in assessing how efficiently the software will overcome
different data mining shortcomings.
Table 1: functionality criteria
Criteria Description
Algorithm variety The software should be able to provide a variety of techniques used in
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 14
data mining, for instance, clustering, induction, decision trees, etc.
Model validation The mining tool should be able to support model creation on top of
model validation
Data type flexibility The data mining tool should be able to handle a variety of dissimilar
data without binning.
Data sampling
reporting
The software should allow for random data sampling for predictive
modeling. The tool should also be able to give detailed, as well as
summarized results.
Model exporting The validated data mining software should enable, for instance, SQL
and C programs to export data.
Prescribed
methodology
The software mining tool should provide a stepwise methodology for
data extraction to avoid spurious results.
Ancillary task support
These criteria provide the user with options of carrying out a variety of transformation, data
cleansing, visualization manipulation, and other data mining supportive tasks. These tasks
include and not limited to enrichment, selection, cleansing, data filtering, binning of continuous
data, randomizing, value substitution, deleting records, and generating derived variables. Data
sets that are truly sorted and ready for extraction are not easily found; therefore, data must be
fine-tuned for the model building phase (Sagar & Saha, 2017).
Table 2: Ancillary criteria
Criteria Description
Data cleansing This is the ability of the tools meant for data mining to help the user in improving
spurious values in the data sets. Besides, this criterion measures the ease with which a
tool can execute other cleansing operations.
Substitution value The software mining tool should be able to allow for global substitution of one data
value with another, for instance, replacing 'F' or 'E' with 0 or 1 for uniformity.
Data filtering The data mining software should select data subsets as defined by the user.
Binning To improve modeling efficiency, the software data mining tool should allow for the
binning of continuous data.
Randomization The data mining tool should allow for data randomization before the building of the
model.
Blanks handling Is the data mining tool efficient in handling blanks? The tool should allow for
substitution of blanks with derived values, for instance, median and averages.
Manipulation of
metadata
The data mining tool should present the user with categorical codes, deriving attribute
formulae, types, and data descriptions.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 15
Usability
This is the incorporation of different types of users and different levels without loss of usefulness
or functionality. Easy to use data mining tools faces the potential problem of potential misuse
(Parsaei, Rostami & Javidan, 2016). Data mining tools should be easily learned as well as
helping the user with data mining rather than data dredging skills. One of the iterative processes
is KDD. To generate more valid models, practitioners typically adjust modeling variables. A
good tool improves output by debugging problems (Gupta & Khanna, 2017). This is important in
providing a meaningful diagnostic.
Table 3: Usability criteria
Criteria Description
User interface The best data mining tool should have a user interface that is easy to navigate. The
user interface should use pictures to improve the vivid description.
Learning curve The tool should be easy to use as well as simple to use correctly.
User types Is the data mining software designed for advanced or beginners or intermediate
users? Or is it designed for a combination of user
types? Analysts and businesses should also find it easy to use the tool.
Visualization The tool should present data and modeling results. Besides, the tool should
communicate information through the use of graphical methods.
Reporting The data mining software should meaningful report errors. Error messages should
effectively help the users to debug problems. Besides, the user should investigate
how well the data mining tool spurious model building or accommodate errors.
Action history The data mining tool should be a repository of all data mining activities. Also, the
tool allows the user to re-execute the script by modifying parts of the data mining
history.
Domain
variety
The data mining tool should be in the position of being applied in a variety of
business organizations. Nevertheless, the data mining tool should adequately have
focused on one problem domain. On the other hand, the data mining tool should also
accurately focus on multiple domains.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 16
Performance
The performance of a data mining tool is greatly influenced by hardware configuration from a
computational perspective. Nevertheless, some data mining software is powerful and more
effective than others (Chen et al., 2018). The qualitative aspect of a data mining tool is measured
by how easily it manages data under a different variety of circumstances.
Table 4: performance criteria
Criteria Description
Efficiency The data mining tool should produce real-time results relative to the size of data, other
variables, and the limitation of the algorithm.
Software
architecture
What type of architecture does the software use? Is it the user client-server architecture or
a standard alone architecture? It is recommended for the user to choose an architecture.
Robustness The data mining tool should run consistently without crashing. And if not, at what point
does it crash? Is it near completion or at the start of the data mining? Furthermore, a good
tool should be left run alone without supervision.
Data size A good data mining tool should have the capability of handling the large size of data. The
user should find out if the performance is exponential or linear.
Platform
variety
Good data mining software should be able to run on different variety of computer
platforms, and most importantly, it should create a utility in the real business world.
Heterogeneou
s data access
The software mining tool should interface well with multiple data sources. These
varieties of data sources include common object request broker architecture (COBRA),
Open database connectivity (ODBC), and relational database management system
(RDBMS).
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 17
Chapter 4: Description of the DT induction software
SAS Enterprise Miners (EM) data mining software
Through a streamlined data mining process, SAS enterprise miners enable users to quickly
develop predictive and descriptive models. Besides, SAS enterprise is an analytic tool that is
very advanced (Wendler & Gröttrup, 2016). It enables users through its graphical interface to
navigate through the SAS SEMMA (sampling, exploration, modification, modelling and
assessment) five step approach. By selecting the right tab from enterprise miner’s toolbar, users
can then build a process flow. Furthermore, onto a pallet, they can drag and drop step specific
nodes. SAS EM can support and facilitate numerous techniques and algorithms. These include;
link analysis, market basket analysis, web path and sequences analysis, neural networks, decision
trees, logistic and linear regression and time series.
EM features for Data mining process
SAS rapid productive modeler: this can run as an add-on to Microsoft excel, being a
component of SAS enterprise miner, it enables users perform from within their Excel
spreadsheets the predictive modelling. By the help of the enterprise miners, data analysts
can customize models developed in Rapid Predictive Modeler.
Integration of R code: developers and analysts who relies on the R language; within an
enterprise miner process flow, they can integrate the transformations and models they
write.
Support for in-Hadoop and in- Database scoring: scoring algorithms created in SAS
enterprise miner combined with a SAS scoring Accelerator can be executed and deployed
within Hadoop environment or databases. The accelerators for scoring are available for,
DB2, Hadoop, pivotal, oracle, SAS Scalable Performance Data Server, IBM Netezza and
Teradata.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 18
Easy to use batch processing and GUI: more models are quick built, besides are quality
models.
Sophisticated data exploration preparation and summarization: with an interactive
data preparation tool that is powerful; SAS EM, develop segmentation guidelines, filter
outliers, address missing values.
High- performance capabilities: with high performance data mining tools, SAS EM
boosts performance.
Model reporting, management and comparison: with easy to use assessment features,
SAS EM quickly identifies the models with the best overall ROI and the best lift.
Scalable processing: with SAS server architecture and Java client EM can scale from
single- user systems to very large enterprise solutions.
Options for cloud deployment: users can enjoy the secure cloud-based environment
managed by SAS experts.
Chapter 5: Description of SPSS
SPSS stands for Statistical Package for Social Sciences. SPSS is used by statisticians to handle
complicated statistical data sets. This software package was created for the statistical analysis
and management of social science data. SPSS was acquired by IBM in 2009 from its pioneers
SPSS international. SPSS is common among users due to its English- like command language
and its straightforwardness. It is widely used by survey companies, market researchers,
healthcare researchers, educational researchers, government entities, data miners, and marketing
organizations for the processing and analysis of survey data.
The core functions of SPSS
The programs offered by SPSS are four in number. These programs include; modeler program,
statistics program, visualization designer, text analytics for surveys program.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 19
Statistics program
Statistical programs provided by SPSS include cross-tabulation, frequencies, and bivariate
statistics.
Modeler program
Using advanced statistical procedures, SPSS's modeler program enables researchers to build and
validate productive models (Wang & Johnson, 2019).
Text analytics for survey programs
Text analytics for survey programs provided by SPSS is used by survey administrators in
responding to open-ended survey questions.
Visualization designer
The visualization designer program provided by SPSS helps researchers to create density charts
and radical boxplots with ease (Rogalewicz & Sika, 2016).
Apart from the four programs, SPSS also is the tool for better data management. This allows
researchers to create derived data, perform file reshaping, and case selection (Tsai, Lin, & Ke,
2016). Nevertheless, SPSS is applied in the storage of the metadata dictionary by researchers.
The metadata dictionary is a central repository of information about data, for instance, origin,
usage, format, meaning, and relationship to other data.
Several statistical methods can be used alongside SPSS. These statistical methods include;
Linear regression is a numeral outcome prediction.
Bivariate statistics (mean, correlation), nonparametric tests, and analysis of variance
(ANOVA).
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 20
Prediction for identifying groups, including methodologies as factor analysis and cluster
analysis.
Descriptive statistics, including methodologies such as descriptive ratio statistics, cross-
tabulation, and frequencies.
The benefits of using SPSS
The data monitoring process is easily visualized using the IBM SPSS modeler intuitive graphical
interface. Structured (dates and numbers) and unstructured (text) are easily accessible from these
graphical interfaces. These unstructured and structured items are extracted from sources as;
survey data and operational databases.
IBM SPSS, coupled with IBM Cognos 8 business intelligence, produces deeper insight and more
accurate predictions by utilizing all the data assets (George, & Mallery, 2016).
IBM SPSS helps decision-makers by deploying models predictions and insight
Chapter 6: Comparative Analysis in terms of Relevant Criteria
1. Purpose and usability
SPSS can be used by both statisticians and non-statistician since it is the most available statistic
software. It has easy to use interfaces with simple drop and drag menus. SPSS find relevance in
many fields most importantly the field of social science.
On the contrary, SAS EM is a statistical programming language that is the most potent. Thus, it
possesses high quality statistics function codes. SAS EM is the leading commercial analytics tool
globally.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 21
2. Functionality
For voluminous data sets processing, SPSS is the most recommended. SPSS can only
synchronize and process data that should not be smaller than 100MB. Moreover, it is a waste of
time to write 1000 codes in SPSS Since, using the paste button it automatically writes the codes.
On the contrary, SAS EM utilizes sorting and splicing of data to process a massive amount of
data. SAS EM also offers drop and drag facility to customers.
3. Performance
Anyone globally can learn and study SPSS. This is because of its best user interface. In other
words, it is not necessary for the users to learn the code. Also, the paste function is provided
which automatically creates the syntax the executed steps in the user interface. On the other
hand, EM relies on the Proc SQL. This make it easy for those with SQL experience to learn SAS
EM.
4. Ancillary support
In data mining, IBM SPSS is ranked 4th while SAS EM is ranked 3rd. On the same note, IBM
SPSS has four reviews whereas there are 6 reviews for SAS EM. At a rating of 8.0 is the IBM
SPSS while at 7.6 rating is the SAS Enterprise miners.
Chapter 7: Conclusion
Experiments with different varieties of data sets and commercial tools have led to different
frameworks of the data mining tool assessment framework. Besides, it has led to a variety of
different methodologies for utilizing these frameworks. The framework considers ancillary task
support, performance, usability, and functionality in evaluating data mining software. The
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 22
evaluation criteria make use of the decision matrix concept to objectify a subjective process.
Data mining software is costly and also coupled with moderately steep learning curves. Before
settling on one, it's better to evaluate them. The evaluation criteria discussed in this paper helps
surveyors, mathematicians, entrepreneurs, health practitioners in choosing between SAS EM and
SPPS. Also, this paper has adequately explained the Decision tree algorithms that are relevant in
statistics. Pruning has been identified as the technique for eliminating and managing overfitting
in DTs algorithms. SPSS, in terms of functionality, overrides SAS EM; thus, it is common
among statisticians. Overall, the paper helps in gaining experience in the evaluation criteria used
in the assessment of different data mining software.
References
Wendler, T., & Gröttrup, S. (2016). Data mining with SPSS modeler: theory, exercises and
solutions. Springer.
Wang, J., & Johnson, D. E. (2019). An examination of discrepancies in multiple imputation
procedures between SAS® and SPSS®. The American Statistician, 73(1), 80-88.
Rai, K., Devi, M. S., & Guleria, A. (2016). Decision tree-based algorithm for intrusion detection.
International Journal of Advanced Networking and Applications, 7(4), 2828.
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 23
Jaworski, M., Duda, P., & Rutkowski, L. (2017). New splitting criteria for decision trees in
stationary data streams. IEEE transactions on neural networks and learning
systems, 29(6), 2516-2529.
Erener, A., Mutlu, A., & Düzgün, H. S. (2016). A comparative study for landslide susceptibility
mapping using GIS-based multi-criteria decision analysis (MCDA), logistic regression
(LR), and association rule mining (ARM). Engineering Geology, 203, 45-55.
Conrad, C., Ali, N., Kešelj, V., & Gao, Q. (2016, August). ELM: An extended logic matching
method on record linkage analysis of disparate databases for profiling data mining.
In 2016 IEEE 18th Conference on Business Informatics (CBI) (Vol. 1, pp. 1-6). IEEE.
Dušanka, D., Darko, S., Srdjan, S., Marko, A., & Teodora, L. (2017). A comparison of
contemporary data mining tools. In XVII International Scientific Conference on
Industrial Systems (IS'17), Novi Sad, Serbia, October (Vol. 4, No. 6).
Chen, W., Zhang, S., Li, R., & Shahabi, H. (2018). Performance evaluation of the GIS-based
data mining techniques of the best-first decision tree, random forest, and naïve Bayes tree
for landslide susceptibility modeling. Science of the total environment, 644, 1006-1018.
Srivastava, P. K., Petropoulos, G. P., Gupta, M., Singh, S. K., Islam, T., & Loka, D. (2019).
Deriving forest fire probability maps from the fusion of visible/infrared satellite data and
geospatial data mining. Modeling Earth Systems and Environment, 5(2), 627-643.
Sagar, K., & Saha, A. (2017). A systematic review of software usability studies. International
Journal of Information Technology, 1-24.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 24
Gupta, D., & Khanna, A. (2017). Software usability datasets. Int. J. Pure Appl. Math.
SCOPUS, 117(15), 1001-1014.
Sim, D. Y. Y., Teh, C. S., & Ismail, A. I. (2018). Improved boosted decision tree algorithms by
adaptive apriori and post-pruning for predicting obstructive sleep apnea. Advanced
Science Letters, 24(3), 1680-1684.
Sim, D. Y. Y., Teh, C. S., & Ismail, A. I. (2017). Improved boosting algorithms by pre-pruning
and associative rule mining on decision trees for predicting obstructive sleep
apnea. Advanced Science Letters, 23(11), 11593-11598.
Sim, D. Y. Y. (2019, September). Support Vector Machine Pre-pruning Approaches on Decision
Trees for Better Classification. In Proceedings of the 2019 2nd International Conference
on Electronics and Electrical Engineering Technology (pp. 30-36).
Parsaei, M. R., Rostami, S. M., & Javidan, R. (2016). A hybrid data mining approach for
intrusion detection on imbalanced NSL-KDD dataset. International Journal of Advanced
Computer Science and Applications, 7(6), 20-25.
Su, Y., Kwok, T. C., Cummings, S. R., Yip, B. H., & Cawthon, P. M. (2019). Can Classification
and Regression Tree Analysis Help Identify Clinically Meaningful Risk Groups for Hip
Fracture Prediction in Older American Men (The MrOS Cohort Study)?. JBMR
plus, 3(10), e10207.
Yang, S., Guo, J. Z., & Jin, J. W. (2018). An improved Id3 algorithm for medical data
classification. Computers & Electrical Engineering, 65, 474-487.
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 25
Wei-ming, D. U., & Yu, R. A. N. (2018). Reaserch of ID3 algorithm in decision tree. Science &
Technology Vision, (11), 61.
Damanik, I. S., Windarto, A. P., Wanto, A., Andani, S. R., & Saputra, W. (2019, August).
Decision Tree Optimization in C4. 5 Algorithm Using Genetic Algorithm. In Journal of
Physics: Conference Series (Vol. 1255, No. 1, p. 012012). IOP Publishing.
Akin, M., Eyduran, E., & Reed, B. M. (2017). Use of RSM and CHAID data mining algorithm
for predicting mineral nutrition of hazelnut. Plant Cell, Tissue and Organ Culture
(PCTOC), 128(2), 303-316.
Lin, S., & Luo, W. (2019). A New Multilevel CART Algorithm for Multilevel Data with Binary
Outcomes. Multivariate behavioral research, 54(4), 578-592.
Zhu, B., Xie, G., Yuan, Y., & Duan, Y. (2018, May). Investigating Decision Tree in Churn
Prediction with Class Imbalance. In Proceedings of the International Conference on
Data Processing and Applications (pp. 11-15).
Quirynen, R., Berntorp, K., & Di Cairano, S. (2018, June). Embedded optimization algorithms
for steering in autonomous vehicles based on nonlinear model predictive control. In 2018
Annual American Control Conference (ACC) (pp. 3251-3256). IEEE.
George, D., & Mallery, P. (2016). IBM SPSS statistics 23 step by step: A simple guide and
reference. Routledge.
Upadhyay, T., Pradesh, U., & Verma, S. (2019). Data mining methodology and its application.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
EVALUATION CRITERIA FOR DATA MINING SOFTWARES. 26
Fong, S., Biuk-Aghai, R. P., & Millham, R. C. (2018, February). Swarm search methods in weka
for data mining. In Proceedings of the 2018 10th International Conference on Machine
Learning and Computing (pp. 122-127).
Tsai, C. F., Lin, W. C., & Ke, S. W. (2016). Big data mining with parallel computing: A
comparison of distributed and MapReduce methodologies. Journal of Systems and
Software, 122, 83-92.
Rogalewicz, M., & Sika, R. (2016). Methodologies of knowledge discovery from data and data mining
methods in mechanical engineering. Management and production engineering review, 7(4), 97-
108.
Ramesh, G., Rajinikanth, T., & Vasumathi, D. (2017). Explorative Data Visualization Using Business
Intelligence and Data Mining Techniques. Int. J. Appl. Eng. Res, 12, 14008-14013.
Ikedinachi, A. P., Misra, S., Assibong, P. A., Olu-Owolabi, E. F., Maskeliūnas, R., & Damasevicius, R.
(2019). Artificial intelligence, smart classrooms and online education in the 21st century:
Implications for human development. Journal of Cases on Information Technology
(JCIT), 21(3), 66-79.
chevron_up_icon
1 out of 26
circle_padding
hide_on_mobile
zoom_out_icon
logo.png

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]