Case Study: Data Mining and Warehousing for Student Success Analysis

Verified

Added on 2022/09/12

AI Summary

This case study explores the application of data mining techniques to predict student success, utilizing data from the University of Belgrade. The assignment focuses on classification methods, comparing 'simple' and 'complex' implementation processes within the RapidMiner environment. The 'simple' process evaluates various classification algorithms, while the 'complex' process incorporates feature selection. The analysis involves data preparation, operator utilization (Select Attributes, Discretize, Label, Sub-processes), and the evaluation of algorithms through nested loops and operators like Filter Examples, Evaluate Algorithm, and Loop. The study examines predictors based on eleven attributes, aiming to predict student performance based on average grades and study programs. It includes the use of various operators such as 'Select Sub - Process', 'Testing' and 'Performance'. The results are analyzed using different algorithms like Decision Tree, Naive Bayes, Random Forest, W-LMT, and W-Simple Cart, with over fitting of data addressed via Wrapper Split Validation. The study concludes with a comparison of the complex and simple methods, showing the benefits of automated selection in improving classification accuracy. The results are presented in tables, highlighting the IT and Management tests and the outcomes of the algorithms.

Case Study #2
Data Mining & Warehousing
This particular case study aims at classification of data mining that is usually
utilized to predict the student success. It mainly deals with faster miner
methods which are used to describe the bunch of data as well as evaluation
of the results on the basis of the determined data. The data will gathered
and extracted from the real information of the students who are studying in
the University of Belgrade of Serbia. There are two study programs in the
faculty which are gee rally referred as the Information system and
management. About 366 student’s records will be used to gather the data
which will include both the information of the graduated students and
description of their success. There are some predictors for predicting the
success of the students in the first year examination (Lin, Yao and Zadeh
2013). These predictions will be carried out depending on eleven different
attributes. These predictors will produce an output variable which will
contain the average grade of the considered students. Depending on the
predicted result or the output variable an appropriate study program will be
held.
The implementation process can be done by two various processes. These
processes are namely, ‘complex’ and ‘simple’. The simple process of the
implementation method is usually used to determine the various
classification algorithms while the complex process of the implementation
method offers feature selection in every algorithm. In the process of the data
preparation, both simple and complex processes are used. In this particular
case study we utilize some operators. They are as follows:
 Select Attributes: These attributes are utilized to select the predictors
provided in the main dataset.
 Discretize: This particular attribute is used to differentiate the output
variable in three various categories namely, ‘Good’, ‘bad’ and
‘Excellent’.
 Label: This attribute is utilized to name the output variable as ‘Average
Grade’.
 Sub – processes: This attribute is provided in the main process and are
used to read the input variables like study program, grades and scores
of the first year examination. This attribute is also used to read the
output variable that is the success of the students.
EAU0116064 HAIZAM FAIZEEN Page | 1

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

In the Discretize operator, we include numerous entities with the upper limits
on their tags. For instance: We consider the limit of 8 ass ‘Bad’, 9 as ‘Good’
and 10 as ‘excellent’. The data separation method is done on the basis of
various research programs where each and every program can require
various classification models that are to be established and determined in a
separate manner. In case of the data isolation the ‘multiply’ operator is
utilized to provide two separate copies of the main dataset. The ‘’filter
examples’ operator is utilized to segregate the information on the basis of
the study programs. Then this information is forwarded to the ‘Evaluate
algorithm’ operator. This operator behaves as sub process. The ‘evaluate
algorithm’ contains a nested loop by the ‘Loop operator’ (Thuraisingham
2014). The ‘Filter Examples’ operator is used to segregate the study
programs. Each and every study program will be operated with help of some
inner operators. In the ‘Simple Validation’ process the ‘Loop’ operator is used
to segregate the testing and training data.
The splitting ratio is set to 0.9 and the remaining three operators are utilized
to calculate the results. The three operators are as follows:
 Set Macros: This operator is used to determine the category of the
study program.
 Provide Macros: This operator is used to give the values of ‘Set Macros’
to ‘Log’ operator.
 Log: This operator is used to execute study program, performance and
algorithm in the .csv file (in Microsoft Excel format)
The training comprises of the initial inner operator namely ‘Select Sub -
Process’ which includes five various classification algorithms in the inner
operators. They are:
 Decision Tree
 Naïve Bayes
 Random Forest
 W – LMT
 W – Simple Cart
The next internal operator is ‘Testing’ which comprises of the ‘Apply Model’.
The third inner operator is the ‘performance’ operator. The ‘Sub - process’
operator comprises of numerous processed that are used to perform one sub
process at a time. The IT and the Management systems treat this as one of
the comprehensive system. The Naïve Bayes classifier has a drawback which
is that it used small amount of data for determining the variance and means.
EAU0116064 HAIZAM FAIZEEN Page | 2

In the ‘Complex’ process of the implementation method, the primary aim will
be to ensure the available data can choice the automated function as well as
best algorithm. The processing of the data is done in the similar way of the
‘Simple’ process as mentioned above. The difference lies in the analysis
element. Moreover, over fitting of the data is occurred due to the list of
optimization features available in each algorithm. This over fitting of the data
can be reduced or omitted by referring the operator ‘Wrapper Split
Validation’. This is usually utilized to assess performance of the function
algorithms and is also helpful to segregate the testing and training data. The
rest there operators are used in the similar way of the ‘Simple’ process as
mentioned above. The split ratio in the ‘complex’ is same as the ‘easy’
process that is set to 0.9,
This operator comprises of three inner operator which are:
 Featuring weighting algorithm: The Featuring weighting algorithm is
shown with the ‘Optimize Selection’ operator
 Defining an algorithm: The ‘Algorithms’ operator is used to define the
algorithms
 Performance: The ‘Performance’ operator is used to determine the
outcome.
The ‘Optimize Selection’ operator comprises of five various internal
operators. First one is the ‘X - Validation’ that is usually utilized to verify five
algorithms. Second internal operator is the ‘Select Process’ which is used in
the same manner as discussed above in the ‘Simple’ process of the
implementation. The output of these operators is the weighs of the
attributed involved in the various algorithms. These results are forwarded to
‘Model building’ and act as the sub process of the ‘Wrapper Split Validation’.
The Model building utilizes ‘Select Sub Process’ operator for appending the
five iterations depending on the same value. Now, the users can compare
the result if the automated selection has enhanced the classification process.
The results are Complex (Backward Selection), Complex (Forward Selection)
and Simple. Two tables are shown to the IT and the management on the
basis of two research programs. The IT tests concluded that there is no
alteration in Complex (backward Selection) whereas the Management test
shown high accuracy in the back selection.
EAU0116064 HAIZAM FAIZEEN Page | 3

Reference
Lin, T.Y., Yao, Y.Y. and Zadeh, L.A. eds., 2013. Data mining, rough sets and
granular computing (Vol. 95). Physica.
Thuraisingham, B., 2014. Data mining: technologies, techniques, tools, and
trends. CRC press.
EAU0116064 HAIZAM FAIZEEN Page | 4