logo

Handling Class Imbalance Problem, Over-fitting, and Multiple Regression Models in Business Analytics

   

Added on  2023-06-10

16 Pages3033 Words408 Views
BUSINESS ANALYTICS
ISYS3375 FINAL ASSESSMENT
Student Name
[Pick the date]
Handling Class Imbalance Problem, Over-fitting, and Multiple Regression Models in Business Analytics_1
SECTION A
Question 1
Class imbalance problem is a common issue in which significant differences exist between the
prior probabilities of different classes. Web, biology, data mining, finance, telecommunication
and ecology are some of the major areas where class imbalance problem can be found.
Various ways to handle imbalanced datasets are highlighted below:
Data level approach or Resampling techniques
This deals with the imbalanced dataset.
1. Improving classification algorithms
2. Balancing classes in processed data by increasing the frequency of minor class or decreasing
the frequency of major class.
3. Selection of appropriate sampling method
Random under sampling
Random over sampling
Cluster based over sampling
Synthetic minority over sampling technique
Modified synthetic minority over sampling technique (MSMOTE)
Algorithmic Ensemble Techniques
This deals in handling imbalanced data with the help of resampling the original data in order to
provide the balanced classes.
1. This improves the performance of the single classifiers by developing many two stage
classifiers from initial dataset.
Bagging based
Boosting based (XG Boost, Gradient boosting)
It can be said that MSMOTE method along with the boosting method can be used to resolve the
issues of imbalanced dataset. However, based on the characteristics of the imbalanced dataset,
the appropriate model would be taken into consideration.
Question 2
Over-fitting is considered as pivotal concern in many business scenarios. This is because the
model over–fitting consumes more than required attributes which reduces the effectiveness of the
1
Handling Class Imbalance Problem, Over-fitting, and Multiple Regression Models in Business Analytics_2
model. In this, higher degree of polynomial might have higher level of accuracy for population
but it fails to test the selected data set. Hence, it is essential to avoid over-fitting of the dataset.
The main methods to avoid over-fitting are highlighted below:
1. Cross- validation
It is one round validation in which one will keep the lower variance and higher fold cross
validation. Further one sample would be taken as in time validation and rest of the sample for
training model.
2. Early stopping
In this, number of iterations run would be decided for avoiding over fitting.
3. Pruning
This method is more suitable in CART models. This method basically removes the nodes and
adds some predictive power.
4. Regularization
In this method a new term i.e. cost term would be incorporated in the model. In which the cost
term would force the coefficients of many variables to approach zero and therefore, the overall
cost can be reduced.
Question 3
Logistics regression is found useful typically when the dependent variable can be represented in
the form of a binary variable and hence it makes sense to estimate the odds ratio. On the contrary
linear regression makes more sense for regression involving dependent variable which is not
binary. Two examples are as follows.
2
Handling Class Imbalance Problem, Over-fitting, and Multiple Regression Models in Business Analytics_3
One example which would require the use of logistic regression is with regards to
approval of loan by the new customers. In this particular case, there would be a binary
dependent variable as the loan may be approved or not. Thus, in such a case using a
linear regression would not serve the purpose as with varied set of independent variables,
it would not be possible to capture the output in binary form. As a result, it makes sense
to use logistic regression which can easily ensure this and thus would be appropriate.
Another example would be in the context of passing or failing a particular exam based on
independent variables such as study time, presence on social media, lectures attended etc.
In this case also, the desired output would be captured as pass or fail and hence is binary
and therefore logistic regression would be preferred over linear regression. The logistic
regression would yield values between 0 and 1 which are essentially probability and
hence based on the same the odds of the two events can be computed. This is not the case
in linear regression which gives the absolute value of the dependent variable and not the
underlying probability.
SECTION B
Question 1
(a) The analyst found out 6 as the appropriate number of clusters by considering the output
shown in sheet 1-a-2-1 and also sheet 1-a-1-1 of the given output. The output in these two
selected sheets tends to highlight the output given when the data is based on 5 clusters
and 6 clusters respectively. The tables highlighting sum of square distances in cluster
need to be referred in both the sheets. It is apparent from cell D40 of sheet 1-a-1-1 that
the lowest intra cluster distance square is 3447.02 in case of five clusters. However, in
case of six clusters, this is lower as highlighted by cell D41 of sheet 1-a-2-1 giving a
value of 3188.82. Since the objective in clustering is to ensure that intra-cluster variation
is minimised, hence six clusters would be preferred over five clusters for the given data.
(b) The description of the six clusters by their average characteristics is carried out below.
Cluster 1 (Married elderly customers) – High priced product (average price = $1,071) is
bought by the married elderly (greater than 55 years) who may or may not be members
and does not involve the use of discount cards. The average product category lies
between 2 and 3.
3
Handling Class Imbalance Problem, Over-fitting, and Multiple Regression Models in Business Analytics_4

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Business Analytics - Discussion and Quantitative Questions
|17
|2504
|96

Business Analytics: Assessment Questions
|17
|3851
|143

Classification Methods in Machine Learning
|10
|1387
|220

Machine Learning on Health Tweets Case Study 2022
|25
|4450
|14

Assessment of FluffyGroCo's Briefing Note on Data Science Concepts
|14
|4970
|202

Machine Learning In Banking Industries
|9
|1314
|14