logo

1. Executive Summary Objective To examine the factors t

   

Added on  2023-01-11

23 Pages5105 Words51 Views
Data Science and Big DataStatistics and Probability
 | 
 | 
 | 
1
Executive Summary
Objective
To examine the factors that can be used to determine the pass rate for students in the GP (Grand
Pines) or MHS (Marble Hill School)) schools to aid in the process of decision making process in
the ABC Universal Education (ABCEU).
Approach
Using a data analysis approach which incorporates the use of machine learning algorithms which
include: Decision Trees, Random Forest, Generalized Linear Models in which case this paper
uses a logistic regression of the binomial family. After conducting feature selection, the models
are implemented in R and the performance of the models assessed through the comparison of
their respective accuracy performance and predictive power which is presented in the confusion
matrix obtained for each model.
Results
After implementing the GLM model twice, the second model returned an accuracy score of
77.75% while the Decision Tree model recorded an accuracy of 96.16 and the random forest had
an accuracy of 99.47%. In this regard, we chose the Random forest as the most relevant
algorithm and used the variable importance plot to analyze the most probable factors that can be
used to determine the pass rate of students.
Conclusion
Different machine learning algorithms perform differently under different situations and
depending on the original requirement of the exercise. Therefore, the use of an algorithm should
be based on the requirement. In order to access the optimal model performance metrics should be
defined i.e. confusion matrix in this paper.
1. Executive Summary Objective To examine the factors t_1

2
Data Exploration and Feature Selection
Data Exploration
In machine learning, the very basic objective is to try and gain an understanding of the data that
is presented to the analyst. In this respect, the question of what really is in the data crops up
crops up. Most often, machine leaning algorithms have been proven effective in offering a means
as to which the analyst can use to answer such a question (Sutton, 2018). Some of the popular
data exploration methods include visual data exploration and descriptive data analytics both of
which are used to gain understanding of factors such as the distribution of data attributes,
outliers, normality of the data and determine which factors are correlated, etcetera. In this paper
we will only explore univariate distributions of the data to examine: measures of location and
spread, asymmetry, outliers, missing data and gaps.
Descriptive
Table 1: List of data attributes
As evidenced from table 1, there are 33 variables but only 29 are predictor and 1 is the target
variable (G3). In addition, there were no missing observations in the data so we did not conduct
any imputing:
1. Executive Summary Objective To examine the factors t_2

3
Correlation
Table 2: correlation Statistics
Table 2 outlines the correlation statistics between the data attributes. From the table, it can be
noted that G3 which is our target variable and additionally contains information on the grades
scored by the students in the 3rd trimester weak negative correlations with factors such as age (-
0.16), failures (-0.17), going out with friends i.e. goout (-0.17), weekday alcohol consumption
i.e. Dalc (-0.11), and weekend alcohol consumption i.e. Walc (-0.14). The G3 attribute also
shows some weak positive correlation with mother’s education (Medu) as well as father’s
education (Fedu) both having correlation coefficients of 0.20.
Distribution of the target attribute
In table 3 below, we are presented with various descriptive statistics of the 3rd trimester grade
results.
1. Executive Summary Objective To examine the factors t_3

4
Table 3: descriptive statistics of G3
Next, we explore the distribution of G3, and G3 in relation to factors such as age, sex, and
school.
3rd Trimester grades
Figure 1
1. Executive Summary Objective To examine the factors t_4

5
Age and 3rd Trimester grades
Figure 2
Sex and 3rd Trimester grades
Figure 3
1. Executive Summary Objective To examine the factors t_5

6
School and 3rd Trimester grades
Figure 4
Outliers for the target attribute
Figure 5
The graph above indicates that there are some outliers I the dataset which are remove manually
before fitting the machine learning algorithms.
1. Executive Summary Objective To examine the factors t_6

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
FIT 3152: Data Analytics Assignment
|29
|3405
|427

Assignment on Statistics in R. Goals and Application
|13
|1059
|18

Data Mining Case Study 2022
|25
|1821
|23

Study on Detection of Breast Cancer
|4
|665
|193

Machine Learning In Banking Industries
|9
|1314
|14

Data Mining: A Solution for Business Problems
|7
|1117
|413