Paper on Prediction of Malignant Breast Cancer

Added on 2020-05-04

14 Pages4197 Words45 Views

Prediction of Malignant Breast Cancer 1
BREAST CANCER
PREDICTION OF MALIGNANT BREAST CANCER
Author: _____________________
ABSTRACT
This paper focuses on determining a set of variables that can be used to
predict malignant breast cancer. There are 30 variables of continuous
measure obtained from 569 cases of breast cancer. 62.75% of the cases
in the entire dataset had benign breast cancer while 37.25% had
malignant. The data was separated into two; train and test datasets to
allow prediction of the logistic model. It was found that malignant breast
cancer cases had higher average values of compactness and texture.
Finally, three variables; mean of radius, worst concave points and worst
area were used in the final model. Therefore, the final model obtained
was as follow -
P robability ( Malignant Breast Cancer )= 1
e(−4.6014−1.1763 RadiusMean+40.8538 Concave. pointsWorst−0.012 AreaWorst )
Keywords: Se – Standard error, sd – standard deviation, Inf – infinity
INTRODUCTION
Cancer is among the health conditions that have emerged in the last
two to three decades ago. It has been established that there is no definite
cure for this condition. This has led to the development of more scientific
research aimed at how medication or control procedures to this conditions
can be developed. In this paper, breast cancer data obtained from
kaggle.com is explored, inferential and regression statistical techniques
employed to determine whether the condition is benign or malignant
(Kaggle.com, 2016).
HYPOTHESIS
1. H0: There is no difference in average texture between benign and
malignant breast cancer
HA: Malignant texture mean is greater than benign breast cancer
2. H0: There is no difference in average compactness between benign
and malignant breast cancer
HA: Malignant breast cancer compactness average is higher than
benign.
DESCRIPTIVE STATISTICS
The descriptive statistics are provided in forms of tables and plots.
The plots have been developed using the ggplot2 package in r commander.
The information provided is for the training dataset, which was separated
from the main dataset to allow availability testing dataset to be used in
model prediction and diagnostics.
Table 1: Descriptive statistics
## vars n mean sd min max range
## radius_mean 1 400 14.32 3.58 6.98 28.11 21.13
## texture_mean 2 400 18.95 4.12 9.71 39.28 29.57
## perimeter_mean 3 400 93.32 24.65 43.79 188.50 144.71
## area_mean 4 400 673.22 357.28 143.50 2499.00 2355.50
## smoothness_mean 5 400 0.10 0.01 0.06 0.14 0.08
## compactness_mean 6 400 0.11 0.05 0.02 0.35 0.33
## concavity_mean 7 400 0.09 0.08 0.00 0.43 0.43
## concave.points_mean 8 400 0.05 0.04 0.00 0.20 0.20
## symmetry_mean 9 400 0.18 0.03 0.12 0.30 0.19
## fractal_dimension_mean 10 400 0.06 0.01 0.05 0.10 0.05
## radius_se 11 400 0.42 0.28 0.11 2.87 2.76
## texture_se 12 400 1.20 0.54 0.36 4.88 4.52
## perimeter_se 13 400 2.97 2.04 0.76 21.98 21.22
## area_se 14 400 42.28 43.62 7.23 525.60 518.37
## smoothness_se 15 400 0.01 0.00 0.00 0.03 0.03
## compactness_se 16 400 0.03 0.02 0.00 0.14 0.13
## concavity_se 17 400 0.03 0.03 0.00 0.40 0.40
## concave.points_se 18 400 0.01 0.01 0.00 0.05 0.05
## symmetry_se 19 400 0.02 0.01 0.01 0.08 0.07
## fractal_dimension_se 20 400 0.00 0.00 0.00 0.03 0.03
## radius_worst 21 400 16.60 4.96 7.93 33.13 25.20
## texture_worst 22 400 25.33 6.12 12.02 49.54 37.52
## perimeter_worst 23 400 109.43 34.40 50.41 229.30 178.89
## area_worst 24 400 917.18 583.20 185.20 3432.00 3246.80
## smoothness_worst 25 400 0.13 0.02 0.07 0.22 0.15
## compactness_worst 26 400 0.26 0.17 0.03 1.06 1.03
## concavity_worst 27 400 0.28 0.21 0.00 1.25 1.25
## concave.points_worst 28 400 0.12 0.07 0.00 0.29 0.29

Prediction of Malignant Breast Cancer 2
## symmetry_worst 29 400 0.30 0.07 0.16 0.66 0.51
## fractal_dimension_worst 30 400 0.08 0.02 0.06 0.21 0.15
Table 1 above shows means, standard deviation, minimum and
maximum values of the continuous variables in the breast cancer [train]
dataset. All these variables can be used as predictors to either breast cancer
case is benign or malignant.
Table 2: Descriptive Statistics by diagnosis
## Descriptive statistics by group
## group: Benign
## vars n mean sd min max range
se
## radius_mean 1 227 12.07 1.73 6.98 16.84 9.86
0.11
## texture_mean 2 227 17.12 3.35 9.71 33.81 24.10
0.22
## perimeter_mean 3 227 77.54 11.40 43.79 108.40 64.61
0.76
## area_mean 4 227 456.77 128.52 143.50 880.20 736.70
8.53
## smoothness_mean 5 227 0.09 0.01 0.06 0.13 0.07
0.00
## compactness_mean 6 227 0.08 0.03 0.02 0.22 0.20
0.00
## concavity_mean 7 227 0.05 0.05 0.00 0.41 0.41
0.00
## concave.points_mean 8 227 0.03 0.02 0.00 0.09 0.09
0.00
## symmetry_mean 9 227 0.18 0.03 0.12 0.27 0.16
0.00
## fractal_dimension_mean 10 227 0.06 0.01 0.05 0.09 0.04
0.00
## radius_se 11 227 0.29 0.12 0.11 0.88 0.77
0.01
## texture_se 12 227 1.19 0.57 0.36 4.88 4.52
0.04
## perimeter_se 13 227 1.99 0.76 0.76 5.12 4.36
0.05
## area_se 14 227 21.02 9.05 7.23 77.11 69.88
0.60
## smoothness_se 15 227 0.01 0.00 0.00 0.02 0.02
0.00
## compactness_se 16 227 0.02 0.02 0.00 0.11 0.10
0.00
## concavity_se 17 227 0.03 0.04 0.00 0.40 0.40
0.00
## concave.points_se 18 227 0.01 0.01 0.00 0.05 0.05
0.00
## symmetry_se 19 227 0.02 0.01 0.01 0.06 0.05
0.00
## fractal_dimension_se 20 227 0.00 0.00 0.00 0.03 0.03
0.00
## radius_worst 21 227 13.26 1.92 7.93 18.22 10.29
0.13
## texture_worst 22 227 22.41 4.81 12.02 41.78 29.76
0.32
## perimeter_worst 23 227 86.08 13.08 50.41 120.30 69.89
0.87
## area_worst 24 227 548.45 155.98 185.20 1032.00 846.80
10.35
## smoothness_worst 25 227 0.12 0.02 0.07 0.17 0.10
0.00
## compactness_worst 26 227 0.18 0.09 0.03 0.58 0.56
0.01
## concavity_worst 27 227 0.16 0.15 0.00 1.25 1.25
0.01
## concave.points_worst 28 227 0.07 0.04 0.00 0.18 0.18
0.00
## symmetry_worst 29 227 0.27 0.04 0.17 0.42 0.26
0.00
## fractal_dimension_worst 30 227 0.08 0.01 0.06 0.15 0.09
0.00
## --------------------------------------------------------
## group: Malignant
## vars n mean sd min max range
se
## radius_mean 1 173 17.27 3.22 10.95 28.11 17.16
0.24
## texture_mean 2 173 21.36 3.80 10.38 39.28 28.90
0.29
## perimeter_mean 3 173 114.03 21.88 71.90 188.50 116.60
1.66
## area_mean 4 173 957.23 362.54 361.60 2499.00 2137.40
27.56
## smoothness_mean 5 173 0.10 0.01 0.07 0.14 0.07
0.00
## compactness_mean 6 173 0.14 0.05 0.05 0.35 0.29
0.00
## concavity_mean 7 173 0.16 0.07 0.02 0.43 0.40
0.01
## concave.points_mean 8 173 0.09 0.03 0.02 0.20 0.18
0.00
## symmetry_mean 9 173 0.19 0.03 0.13 0.30 0.17
0.00
## fractal_dimension_mean 10 173 0.06 0.01 0.05 0.10 0.05
0.00
## radius_se 11 173 0.60 0.32 0.19 2.87 2.68
0.02
## texture_se 12 173 1.21 0.50 0.36 3.57 3.21
0.04
## perimeter_se 13 173 4.26 2.45 1.33 21.98 20.65
0.19
## area_se 14 173 70.17 54.10 13.99 525.60 511.61
4.11
## smoothness_se 15 173 0.01 0.00 0.00 0.03 0.03
0.00
## compactness_se 16 173 0.03 0.02 0.01 0.14 0.13
0.00
## concavity_se 17 173 0.04 0.02 0.01 0.14 0.13
0.00
## concave.points_se 18 173 0.01 0.01 0.01 0.04 0.04
0.00
## symmetry_se 19 173 0.02 0.01 0.01 0.08 0.07
0.00
## fractal_dimension_se 20 173 0.00 0.00 0.00 0.01 0.01
0.00
## radius_worst 21 173 20.98 4.28 12.84 33.13 20.29
0.33
## texture_worst 22 173 29.15 5.52 16.67 49.54 32.87
0.42
## perimeter_worst 23 173 140.09 29.24 85.10 229.30 144.20
2.22
## area_worst 24 173 1401.00 584.94 508.10 3432.00 2923.90
44.47
## smoothness_worst 25 173 0.15 0.02 0.09 0.22 0.13
0.00
## compactness_worst 26 173 0.38 0.17 0.05 1.06 1.01
0.01
## concavity_worst 27 173 0.44 0.17 0.02 1.10 1.08
0.01
## concave.points_worst 28 173 0.18 0.05 0.03 0.29 0.26
0.00
## symmetry_worst 29 173 0.33 0.08 0.16 0.66 0.51
0.01
## fractal_dimension_worst 30 173 0.09 0.02 0.06 0.21 0.15
0.00
Table 2 above shows the comparative descriptive statistics by
diagnosis. Observations on the means help in determining the variables that

Prediction of Malignant Breast Cancer 3
have greater differences, indicating potential predictors of a logistic
regression predicting the probability of a breast cancer case being benign or
malignant. For instance, there is a great difference between the area worst
variable between those diagnosed as having benign breast cancer and those
determined as malignant.
Table 3: Diagnosis by proportion
Diagnosis Benign Malignant
Proportio
n
56.75% 43.25%
According to the train data distribution, 56.75% of the breast cancer
cases diagnosed were determined as benign while 43.25% were malignant.
PLOTS
Based on the mean differences observed in the descriptive statistics,
variables with greatest differences are graphically represented in this
section.
Figure 1: Histogram of Radius, Texture, Perimeter, Area, Smoothness and
Compactness means faceted by Diagnosis
The distributions of the variables displayed above seem to vary
between those diagnosed with benign and those determined to have
malignant breast cancer. For instance, perimeter mean data for the benign
cases seem to be closely distributed compared to those with malignant.
Generally were can state that cases of malignant breast cancer have higher
variance on the radius, perimeter, area and compactness mean compared to
benign.

Prediction of Malignant Breast Cancer 4
Figure 2: Histograms of Concavity mean, Concave Points mean, Radius Se,
Perimeter Se, Area Se and Radius worst, Faceted by Diagnosis
Means of concavity & concave points and radius worst seem to have
a greater difference in average values between the benign and malignant
groups. The other variables (Radius, perimeter and area standard errors) are
approximately similar data distributions although the variances vary.
Figure 3: Histogram of worst values of Perimeter, Area, Compactness
and Concave Points, Faceted by Diagnosis
The average and distribution of worst values of the perimeter,
concave points, compactness and area of breast cancer are different for
benign and malignant groups.
HYPOTHESIS TESTING
MALIGNANT TEXTURE MEAN IS GREATER THAN BENIGN
BREAST CANCER
Table 4: First hypothesis output
## Welch Two Sample t-test
##
## data: texture_mean[train$diagnosis == "Malignant"] and
texture_mean[train$diagnosis == "Benign"]
## t = 11.651, df = 344.36, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater
than 0

End of preview

Want to access all the pages? Upload your documents or become a member.

Interpretation of Correlation Matrix

|1342

Derivatives: Statistics and Analysis

|26

|4972

|402

The Probability Of Loyalty

|1806

|41

Business Finance

|18

|2844

|466

Paper on Prediction of Malignant Breast Cancer

End of preview

Interpretation of Correlation Matrixlg...

Derivatives: Statistics and Analysislg...

The Probability Of Loyaltylg...

Business Financelg...

Interpretation of Correlation Matrix

Derivatives: Statistics and Analysis

The Probability Of Loyalty

Business Finance