MIS772 Predictive Analytics Assignment A2: AQA Airline Analysis

Verified

Added on 2023/06/11

AI Summary

This report investigates the data collected by AQA to analyze airline quality using Rapidminer. The overall rating of airlines is the response variable, with other ratings (seat comfort, cabin staff, food, in-flight entertainment, value for money, and recommendation) as predictor variables. Three models—regression, neural net, and decision tree—are used to assess predictability. The analysis reveals that the decision tree model is the most effective, with 'recommendation' being the most significant driver for overall airline rating. Data exploration includes handling missing values and clustering analysis using K-means, determining that seven clusters provide optimal results. The models are optimized using cross-validation, splitting data into training and testing sets, and their performance is compared based on absolute error. The integrated solution demonstrates that the decision tree model remains the most suitable due to its lowest absolute error when predicting airline ratings.

MIS772 Predictive Analytics Assignment A2
1 of 11

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

MIS772 Predictive Analytics Assignment A2
Executive summary (one page)
In this assignment we investigate the usefulness of the data collected by AQA. The data collected by AQA towards
analysing the quality of the airlines. Rapidminer is used to analyse the dataset. The overall rating of the airlines has
been taken as the response variable. Other quantitative data on ratings (seat comfort, cabin staff, food beverages,
inflight entertainment, value money and recommended) collected by AQA have been taken as the predictor variable.
Three models have been used to check the predictability of the dataset, regression, Neural Net and decision tree.
Business Problem
Aim 1: Which amongst the independent variables is / are the most important driver of overall rating of an airline.
Solution to Business Problem
From the analysis it is found that the decision tree model is better able to predict the dataset. The absolute error of
the decision tree model is very near to 1. Moreover, it is seen that “recommendation” is the most important driver for
overall rating of an airline.
2 of 11

MIS772 Predictive Analytics Assignment A2
Data exploration and preparation in RapidMiner (one page)
The analysis of the dataset in RapidMiner shows that the variables are either polynomial or integer. The information
provided by AQA contains various information regarding the quality of service provided by Airlines. The information
regarding the name of the airline contains no missing data. From figure 2 it is found that most of the travellers have
flown in economy class. The least number of travellers have flown first class. There is presence of missing data in
the ratings provided by the customers. The information of ground service and wifi connectivity contains 39193 and
40831 missing data. As such it was thought prudent not to select the factors.
For the present analysis the name of the airlines, the ratings (except ground service and wifi connectivity) and
recommendation were selected. The overall rating is selected as the dependent variable while the other ratings
(except ground service and wifi connectivity) and recommendation are the independent variables.
The overall rating provided to an airline has a scale from 1 to 10. The ratings on seat comfort, cabin staff, food and
beverages and inflight entertainment have a scale of 1 to 5. The recommendation of an airline is in the form of wither
1 or 0. All missing data was imputed with their averages.
After the data was replaced from figure 1 it is found that the average overall rating of an airline is 6.035 with a
standard deviation of 3.033. The average ratings of seat comfort, cabin staff and value money are 3.077, 3.260 and
3.158 respectively. Food and beverages and value money ratings are found to be 2.844 and 2.295 respectively.
The overall rating is plotted as a histogram. The histogram shows that the highest frequency of travellers of provided
a rating of 6 overall to all the airlines (see figure 2). The lowest frequency of travellers has given a rating of 4 (see
figure 2).
Figure 1: Overall statistics for ratings
Figure 2: Bar chart of travellers, histogram of overall rating
3 of 11

MIS772 Predictive Analytics Assignment A2
Discovering Relationships and Data Transformation in RapidMiner (one page)
Clustering is a process of segregating the dataset into groups with similar characters. The airlines of the world have
been clustered on the basis of overall rating and recommendations of the travellers. K-mean clustering process is
used for the cluster analysis. K-mean clustering process is the simplest of the process. The process of clustering is
depicted in Figure 3.
Figure 3: Process of clustering
The number of centroids has to be evaluated such that the Euclidian distance is the least. Table 1 presents the
relation between clusters and Average Euclidian distance. From the table it is found that having 7 clusters provide
the best results.
Table 1: Relation of No. of clusters to average Euclidian Distance
No. of Clusters Average Euclidian Distance
2 3.429
3 2.799
4 2.310
5 2.171
6 1.985
7 1.863
8 1.865
4 of 11

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

MIS772 Predictive Analytics Assignment A2
Create a Model(s) in RapidMiner (two pages / page 1)
Three models were used to investigate the relation between overall rating and other rating and recommendations.
The process for regression model is given as followed (figure 4).
Figure 4: Linear Regression model
Overall rating can be predicted through an equation as (figure 5):
Overall rating = 1.452 + 3.139*recommendation + 0.334*value_money – 0.036*inflight_entertainment +
0.085*food_beverages + 0.247*cabin_staff + 0.289*seat comfort.
Figure 5: Results of Regression Analysis
From the regression model it is found that recommendation plays a vital role in overall rating of an airline. On the other
hand, inflight entertainment has a negative impact on overall rating. Moreover, the coefficient of the ratings are
statistically significant at 0.05 level of significance.
The process for Neural Net is given as followed (figure 6).
Figure 6: Process of Neural Net
5 of 11

MIS772 Predictive Analytics Assignment A2
Create a Model(s) in RapidMiner (two pages / page 2)
The neural net shows that food beverages is highly related to the overall rating.
Figure 7: Output for Neural Net
The process for Decision Tree is given as followed (figure 8).
Figure 8: Decision Tree process
From the decision tree model it is found that “recommendation” is the first stage in the decision tree (Figure 9).
Figure 9: Output of decision Tree
6 of 11

MIS772 Predictive Analytics Assignment A2
Evaluate and Improve the Model(s) in RapidMiner (three pages / page 1)
The models created above were optimised using cross validation. The data file was split into training and testing
dataset. The model was first created in the training data set and finally evaluated in the testing dataset.
The prediction of the model through GLM is 5.998. The GLM model has a root mean squared error of 1.317 with 0.000
variation. The squared correlation of the model is 0.832. Thus it can be said that 83.2% of model can be predicted from
the GLM model. The relative error of the model is 19.59%. Moreover, it is found that the overall rating is highly
supported by the Recommendation. The least support for GLM Model is from the inflight rating of the airlines.
An equation can be developed for the prediction of the overall rating through GLM Model (Figure 10).
Figure 10: GLM Model
The equation:
Overall Rating = 0.042 + 0.557*Value_money + 0.241*Seat_comfort + 3.363*recommendation +
0.020*inflight_entertainment + 0.106*food_beverages + 0.293*cabin_staff
7 of 11

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

MIS772 Predictive Analytics Assignment A2
Evaluate and Improve the Model(s) in RapidMiner (three pages / page 2)
The prediction of Decision tree model is 6.431. the root mean square of the decision tree model is 1.325 with a relative
error of 19.80%. The squared correlation of the model is 0.83. Thus 83.1% of the model can be predicted through
decision tree model. The recommendation variable is the best supporter for the decision tree model (Figure 11).
Figure 11: Prediction through Decision Tree
The decision tree model finds that the optimal depth at which prediction can be made is 7. Thus the model is predicted
at 7 levels. The model has divided the data set into two halves – below and above 0.500. When the recommendation is
below 0.500 then we have to take the cabin staff ratings. Similarly, when the recommendation is above 0.5 then the
value money rating needs to selected (Figure 12).
Figure 12: Output of Decision Tree
8 of 11

MIS772 Predictive Analytics Assignment A2
Evaluate and Improve the Model(s) in RapidMiner (three pages / page 3)
The prediction of the model through Neural Net is 9.045. The neural net model has a root mean squared error of 1.356
with 0.000 variation. The squared correlation of the model is 0.844. Thus it can be said that 84.4% of model can be
predicted from the Neural Net model. The relative error of the model is 19.22%. Moreover, it is found that the overall
rating is highly supported by the Value Money Rating. The least support for Neural Net Model is from the
recommendation of the airlines (Figure 13).
Figure 13: Neural Net Prediction
The comparison of the three models shows that while the absolute error of the GLM model is 0.994, for Neural Net it is
0.989 and for decision tree model it is 1.001. Thus since it is found that the absolute error of the decision tree model is
very near to 1, hence the decision tree model is the ideal model to predict the overall ratings of the quality of airlines
9 of 11

MIS772 Predictive Analytics Assignment A2
Provide an Integrated Solution in RapidMiner (one page)
A new Data set was created with 10 ratings. The first 9 ratings had a rating scale of 1 to 5. The rating 10 had a scale of
1 to 2. Rating 10 was taken as the dependent variable and all the other ratings as independent variable.
In this solution we compared the absolute error of all three models – Decision Tree, regression model, and Neural Net.
In rapid miner the process is shown as follows: (Figure 14)
Figure 14: Integrated Solution
1. The Multiply operator multiplies the data set for the given number of times.
2. Three Cross Validation operator is added to the multiplier.
3. Each of the cross validators point to a model
4. The model (Decision Tree, Regression and Neural Net) are added to the cross Validation operator.
5. The output of the validator is added to the Vote Validation.
6. Vote operator checks for absolute error of each of the models.
7. From the output it is found that:
Model Absolute Error
Decision Tree 0.990
Regression Model 1.149
Neural Net 1.380
Thus through the “vote” it is found that the Decision tree model is the most suitable model as it has the least
absolute error.
10 of 11