ANL307e - Performance and Robustness Evaluation of Logistic Regression

Verified

Added on 2023/04/20

AI Summary

This assignment solution discusses the evaluation of the performance and robustness of a logistic regression model applied to a breast cancer dataset from the UCI Machine Learning Repository. The logistic regression model, implemented using the 1010data function g_logreg(G;S;Y;XX;Z), predicts whether a cancer is benign or malignant based on variables such as ID number, diagnosis, radius-mean, texture-mean, and perimeter-mean. The dataset is divided into training (90%) and testing (10%) sets, and dummy variables are created for categorical columns. The model's performance is assessed using functions like score(XX;M;Z) to predict the probability of benign cancer and param(M;P;I) to obtain model coefficients. The logit of the predicted probability is calculated for visualization, and fit statistics are extracted. References to Hosmer & Lemeshow (2000) and Long et al. (2006) are included.

QUESTION 2
(c ) To discuss how to evaluate the performance and robustness of the logistic Regression model
for a given dataset.
A logistic regression being performed on a given dataset that contain information of Breast
cancer winscon (Diagnostic) to predict whether the cancer is benign or malignant. We are using
1010data function g_logreg(G;S;Y;XX;Z) being applied to Breast cancer winscon (Diagnostic)
Dataset which is obtained from UCI Machine Learning Repository. Its enables patients to know
whether the cancer is benign or malignant. Logistic regression uses various variables as stated
below in the dataset as predictors (Hosmer & Lemeshow, 2000). That is ID number, diagnosis,
radius-mean, texture-mean and perimeter-mean. For a response the column Y, which is yes if the
cancer is benign. The Breast cancer winscon (Diagnostic) dataset
(https://www.kaggle.com/uciml/breast-cancer-wisconsin-data#data.csv ), from this dataset we
create dummy variables for each of the five categorical columns. Since we would also want to
create a column separating training data and test data, we use 90% of the data as training data
and 10% of the data as test data. Based on the continuous variables in the original dataset and the
dummy variables we had created, we run the logistic regression model. The train column from
the previous step is then used as the second parameter of g_logreg(G;S;Y;XX;Z) function. Train
column acts as a selector so that the function will only train 90% of the data. We also specify
options z parameter to control convergence criteria. The logistic regression model or analysis is
therefore created using this dataset.
We then use score(XX;M;Z) function to predict probability that the cancer is benign using this
logistic regression analysis (Long et al., 2006). To obtain the model coefficients or constants we
use param(M;P;I) function to get b0 (the intercept) and the other variables b1, b2 and b3. We also

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

calculate the logit of the predicted probability which is very useful in visualization and we do
extract fit statistic using this model.

References
Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition).New York:
John Wiley & Sons, Inc.
Long, J. Scott, & Freese, Jeremy (2006). Regression Models for Categorical Dependent
Variables Using Stata (Second Edition). College Station, TX: Stata Press.