Data Mining Project: Analyzing Titanic Dataset Survival Factors
VerifiedAdded on 2023/04/11
|15
|3119
|313
Report
AI Summary
This report analyzes the Titanic dataset to determine the factors that influenced passenger survival. The analysis begins with a description of the dataset, including its variables and the distribution of passengers across different classes, genders, and survival statuses. The research questions focus on identifying significant variables, understanding factors leading to survival and death, and exploring relationships between passenger characteristics and survival chances. The methodology employs regression analysis and classification techniques. Regression analysis utilizes generalized linear models, specifically logistic regression, to assess the significance of variables like passenger class, sex, age, and family size. The report also uses filtering techniques to create datasets based on survival and passenger characteristics. Classification involves decision tree modeling using the Hunt’s algorithm, incorporating data cleaning techniques to handle missing values and group variables. The findings include the identification of significant variables through regression analysis and the use of graphical representations to illustrate survival rates based on passenger class, sex, and family size. Decision tree modelling is used to analyze and classify the data. The report concludes with insights into the factors that contributed to passenger survival and death, providing a comprehensive analysis of the Titanic dataset.

Description of the dataset
The titanic datasets
The titanic datasets contain many data frames that describes the accident the Titanic ship
encountered during her maiden voyage. The data frame titanic3 which is one of the Titanic’s
datasets that describes the survival status of individual passengers on the Titanic. The data
frame contains e does not contain information about the crew, but it only gives the census on
the passengers. The data frame contains data on 1309 passengers and 14 variables. The
variables are Pclass for Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd), survival for survival (0 =
No; 1 = Yes), name for Names of the passengers, sex for gender, age for Ages of passengers,
sibsp for Number of Siblings/Spouses on board, parch for number of parents/children on
board, ticket for Ticket number, fare for Passenger Fare (British pound) ,cabin for cabins of
passengers, embarked for Port of embarkation (C = Cherbourg; Q = Queenstown; S =
Southampton), boat for lifeboats, body for body identification number, and home.dest for
home/destination.
The data frame contains both categorical and continuous variables. The variables pclass, sex,
embarked, home.dest and survived are categorical variables with levels. Sibsp, parch, ticket,
cabin, boat and body are count data. Then, the continuous data are age and fare. Then name is
a character variable.
From the dataset 500 passengers survived while 809 died. There were 466 females and 843
males. In passenger class 1 there were 323 passengers, 200 survived and 123 died. Passenger
class 2 there were 277 passengers, 119 survived and 158 died. Passenger class 3 had 709
passengers, out of this there were 181 survivors and 528 died. Most passengers in class1 had
cabins. There is some missing information in variables like age and body. Some passengers
did not record their home destinations. Those who survived most of them boarded a boat out
of the ship.
Research Questions
Which variables are significant in explaining the survival of the passengers?
What factors that led to passengers’ survival and death?
Is there a relationship between passengers’ sex, numbers family members and the chances of
survival?
The titanic datasets
The titanic datasets contain many data frames that describes the accident the Titanic ship
encountered during her maiden voyage. The data frame titanic3 which is one of the Titanic’s
datasets that describes the survival status of individual passengers on the Titanic. The data
frame contains e does not contain information about the crew, but it only gives the census on
the passengers. The data frame contains data on 1309 passengers and 14 variables. The
variables are Pclass for Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd), survival for survival (0 =
No; 1 = Yes), name for Names of the passengers, sex for gender, age for Ages of passengers,
sibsp for Number of Siblings/Spouses on board, parch for number of parents/children on
board, ticket for Ticket number, fare for Passenger Fare (British pound) ,cabin for cabins of
passengers, embarked for Port of embarkation (C = Cherbourg; Q = Queenstown; S =
Southampton), boat for lifeboats, body for body identification number, and home.dest for
home/destination.
The data frame contains both categorical and continuous variables. The variables pclass, sex,
embarked, home.dest and survived are categorical variables with levels. Sibsp, parch, ticket,
cabin, boat and body are count data. Then, the continuous data are age and fare. Then name is
a character variable.
From the dataset 500 passengers survived while 809 died. There were 466 females and 843
males. In passenger class 1 there were 323 passengers, 200 survived and 123 died. Passenger
class 2 there were 277 passengers, 119 survived and 158 died. Passenger class 3 had 709
passengers, out of this there were 181 survivors and 528 died. Most passengers in class1 had
cabins. There is some missing information in variables like age and body. Some passengers
did not record their home destinations. Those who survived most of them boarded a boat out
of the ship.
Research Questions
Which variables are significant in explaining the survival of the passengers?
What factors that led to passengers’ survival and death?
Is there a relationship between passengers’ sex, numbers family members and the chances of
survival?
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Is there a relationship between the variables?
Methodology
The analytic approach that can be used to answer the research questions can be regression
analysis and classification of the dataset.
In regression analysis will be able to find relationship among the explanatory variables. To
get the relationship between covariates pclass, sex, embarked, home.dest, Sibsp, parch, ticket,
cabin, boat, body and the response survived filtering technique can be applied. Filtering data
is important in creating other datasets from a data frame with variables that are categorical.
From the dataset titanic3 we can extract information about those who survived and those who
did not survive. Through this approach we can obtain many datasets that we can use to draw
conclusions and make predictions.
Classification involves machine learning using algorithms. From this we can have training set
contains collection of correctly labelled data while test set can be the part of data or can be
processed by the model. We use the training set to learn the data trend, make conclusion and
make prediction. The test set ids used to test the model. The supervised data is to be
processed by some sort of machine learning algorithm to create a model. Decision tree
modelling can be used to get a model that well fit the data.
Regression Analysis
Generalized linear models can be applied to model the data. Since our response variable is
binary, logistic regression can be used to model the data and the significance of explanatory
variables. The covariates are pclass, sex, age, sibsp, parch, fare and embarked.
survived bernoulli( p[i])
logit ( p [ i ] ) =b 0+b 1∗pclass+ b 2∗sex+b 3∗age+b 4∗sibsp+b 5∗parch+b 6∗fare+b 7∗embarked
summary(glm(survived~factor(pclass)+factor(sex)
+age+sibsp+parch+fare+factor(embarked)))
Call:
glm(formula = survived ~ factor(pclass) + factor(sex) + age +
Methodology
The analytic approach that can be used to answer the research questions can be regression
analysis and classification of the dataset.
In regression analysis will be able to find relationship among the explanatory variables. To
get the relationship between covariates pclass, sex, embarked, home.dest, Sibsp, parch, ticket,
cabin, boat, body and the response survived filtering technique can be applied. Filtering data
is important in creating other datasets from a data frame with variables that are categorical.
From the dataset titanic3 we can extract information about those who survived and those who
did not survive. Through this approach we can obtain many datasets that we can use to draw
conclusions and make predictions.
Classification involves machine learning using algorithms. From this we can have training set
contains collection of correctly labelled data while test set can be the part of data or can be
processed by the model. We use the training set to learn the data trend, make conclusion and
make prediction. The test set ids used to test the model. The supervised data is to be
processed by some sort of machine learning algorithm to create a model. Decision tree
modelling can be used to get a model that well fit the data.
Regression Analysis
Generalized linear models can be applied to model the data. Since our response variable is
binary, logistic regression can be used to model the data and the significance of explanatory
variables. The covariates are pclass, sex, age, sibsp, parch, fare and embarked.
survived bernoulli( p[i])
logit ( p [ i ] ) =b 0+b 1∗pclass+ b 2∗sex+b 3∗age+b 4∗sibsp+b 5∗parch+b 6∗fare+b 7∗embarked
summary(glm(survived~factor(pclass)+factor(sex)
+age+sibsp+parch+fare+factor(embarked)))
Call:
glm(formula = survived ~ factor(pclass) + factor(sex) + age +

sibsp + parch + fare + factor(embarked))
Deviance Residuals:
Min 1Q Median 3Q Max
-1.05700 -0.24316 -0.07105 0.24038 1.02804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.283e+00 2.791e-01 4.597 4.82e-06 ***
factor(pclass)2 -1.740e-01 4.124e-02 -4.218 2.68e-05 ***
factor(pclass)3 -3.251e-01 4.080e-02 -7.969 4.21e-15 ***
factor(sex)male -4.923e-01 2.614e-02 -18.835 < 2e-16 ***
age -5.743e-03 9.588e-04 -5.989 2.91e-09 ***
sibsp -4.807e-02 1.463e-02 -3.285 0.00105 **
parch 9.967e-03 1.614e-02 0.618 0.53692
fare 5.005e-05 2.890e-04 0.173 0.86252
factor(embarked)C -8.607e-02 2.762e-01 -0.312 0.75538
factor(embarked)Q -2.904e-01 2.811e-01 -1.033 0.30184
factor(embarked)S -1.941e-01 2.760e-01 -0.703 0.48214
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1500737)
Null deviance: 252.52 on 1044 degrees of freedom
Residual deviance: 155.18 on 1034 degrees of freedom
(265 observations deleted due to missingness)
AIC: 996.55
Number of Fisher Scoring iterations: 2
From the analysis above, observing the p-values of the variables it can be concluded that
passenger’s class, sex, age and sibsp are significant in determining their probability of
surviving.
Through filtration, datasets of passengers’ classes and sex(male and female) was extracted to
help to determine the factors that can determine passengers’ survival.
The variable family is added in each of the dataset to determine the total number of children,
parents, siblings or spouses each passenger had.
Deviance Residuals:
Min 1Q Median 3Q Max
-1.05700 -0.24316 -0.07105 0.24038 1.02804
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.283e+00 2.791e-01 4.597 4.82e-06 ***
factor(pclass)2 -1.740e-01 4.124e-02 -4.218 2.68e-05 ***
factor(pclass)3 -3.251e-01 4.080e-02 -7.969 4.21e-15 ***
factor(sex)male -4.923e-01 2.614e-02 -18.835 < 2e-16 ***
age -5.743e-03 9.588e-04 -5.989 2.91e-09 ***
sibsp -4.807e-02 1.463e-02 -3.285 0.00105 **
parch 9.967e-03 1.614e-02 0.618 0.53692
fare 5.005e-05 2.890e-04 0.173 0.86252
factor(embarked)C -8.607e-02 2.762e-01 -0.312 0.75538
factor(embarked)Q -2.904e-01 2.811e-01 -1.033 0.30184
factor(embarked)S -1.941e-01 2.760e-01 -0.703 0.48214
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1500737)
Null deviance: 252.52 on 1044 degrees of freedom
Residual deviance: 155.18 on 1034 degrees of freedom
(265 observations deleted due to missingness)
AIC: 996.55
Number of Fisher Scoring iterations: 2
From the analysis above, observing the p-values of the variables it can be concluded that
passenger’s class, sex, age and sibsp are significant in determining their probability of
surviving.
Through filtration, datasets of passengers’ classes and sex(male and female) was extracted to
help to determine the factors that can determine passengers’ survival.
The variable family is added in each of the dataset to determine the total number of children,
parents, siblings or spouses each passenger had.
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

0
1
2
3
4
5
0
1
2
3
4
5
female male
0 10 20 30 40 50 60 70 80
survived by sex and family in pclass 1
1
0
Axis Title
Sex and Family
In this passengers’ class many people survived. Considering the variable family and sex,
highest number of those who survived did not have their family on the ship.
Females who had the highest number of family members 4 to 5 children, parents, siblings or
spouses, survived compared to those who had 1 and 3 children, parents, siblings or spouses.
Males with children, parents, siblings or spouses and those with no families, most of them did
not survive.
female male female male female male female male female female
0 1 2 3 4 5
0
20
40
60
80
100
120
6
104
4
25
2
13
4
36
12
22
1
19
11 15
1 1 1
Survived against sex and family in pclass 2
0
1
1
2
3
4
5
0
1
2
3
4
5
female male
0 10 20 30 40 50 60 70 80
survived by sex and family in pclass 1
1
0
Axis Title
Sex and Family
In this passengers’ class many people survived. Considering the variable family and sex,
highest number of those who survived did not have their family on the ship.
Females who had the highest number of family members 4 to 5 children, parents, siblings or
spouses, survived compared to those who had 1 and 3 children, parents, siblings or spouses.
Males with children, parents, siblings or spouses and those with no families, most of them did
not survive.
female male female male female male female male female female
0 1 2 3 4 5
0
20
40
60
80
100
120
6
104
4
25
2
13
4
36
12
22
1
19
11 15
1 1 1
Survived against sex and family in pclass 2
0
1
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

In class 2 few passengers survived. The largest population who survived were females both
with families on board and those didn’t have children, parents, siblings or spouses.
female
male
female
male
female
male
female
male
female
male
female
male
female
male
female
male
female
male
0 1 2 3 4 5 6 7 10
0 50 100 150 200 250 300 350
Survived against sex and family in pclass 3
1
0
In class a small size of population survived, and more females survived than males. Those
who large number of children, parents, siblings or spouses did not make to survive.
with families on board and those didn’t have children, parents, siblings or spouses.
female
male
female
male
female
male
female
male
female
male
female
male
female
male
female
male
female
male
0 1 2 3 4 5 6 7 10
0 50 100 150 200 250 300 350
Survived against sex and family in pclass 3
1
0
In class a small size of population survived, and more females survived than males. Those
who large number of children, parents, siblings or spouses did not make to survive.

Most
of
female survived in all passengers’ classes. In class 1 both who had their families in class 1
had 1 to 2 children, parents, sibling or spouse and those who didn’t have had high probability
of surviving.
In class 2 the passengers had families comprising of 1,2,3,4, to 5 children, parents, sibling or
spouse. They still survived even if the had large number of families.
In passengers’ class 3 those females who survived did not have family members or had few
members on board.
0 1 2 3 4 5 0 1 2 3 0 1 2 3 4 5 6 7 10
1 2 3
0
200
400
600
800
1000
1200
76 28 10 1 1 2
208
50 26 8
957
96 84
12 15 39 18 15 1832 17 8 3 1 24 2 22 2
159
27 30 3 3 3
Survived against pclass and family for male
0
1
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 6 7 10
1 2 3
0
20
40
60
80
100
120
140
160
180
Survived against pclass and family for female
0
1
of
female survived in all passengers’ classes. In class 1 both who had their families in class 1
had 1 to 2 children, parents, sibling or spouse and those who didn’t have had high probability
of surviving.
In class 2 the passengers had families comprising of 1,2,3,4, to 5 children, parents, sibling or
spouse. They still survived even if the had large number of families.
In passengers’ class 3 those females who survived did not have family members or had few
members on board.
0 1 2 3 4 5 0 1 2 3 0 1 2 3 4 5 6 7 10
1 2 3
0
200
400
600
800
1000
1200
76 28 10 1 1 2
208
50 26 8
957
96 84
12 15 39 18 15 1832 17 8 3 1 24 2 22 2
159
27 30 3 3 3
Survived against pclass and family for male
0
1
0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 6 7 10
1 2 3
0
20
40
60
80
100
120
140
160
180
Survived against pclass and family for female
0
1
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

In passengers’ class 1,2 and 3 most of male did not have their family on board. Those who
had their families in class 1 had 1 to 2 children, parents, sibling or spouse ,most of them did
not survive.
In class 2 male passengers had 1, 2 to 3 children, parents, sibling or spouse and few survived.
In class 3 the highest number that survived did not have children, parents, sibling or spouse
on board. But those who had their families in this class had 1 to 2 children, parents, sibling or
spouse.
Classification
Decision tree modelling
Semi-supervised learning can be used to get training dataset and test set. We can get another
titanic dataset which may be treat as a sample. Then the knowledge from this sampled dataset
is applied to titanic3 dataset to help with mining, analysis, classification, and interpretation.
The sampled dataset contains only five variables pclass, sex, sibsp, age and parch and one
outcome Survived. A sample of 56 observations is extracted using random sampling
approach. From this sample, supervised learning is applied to get the training set and test set.
Supervised learning requires training data to be 70% of available data and testing data to be
30% of available data. The training data will comprise of 40 first observations while
remaining 16 observations will be our test dataset. Training dataset is partitioned into
had their families in class 1 had 1 to 2 children, parents, sibling or spouse ,most of them did
not survive.
In class 2 male passengers had 1, 2 to 3 children, parents, sibling or spouse and few survived.
In class 3 the highest number that survived did not have children, parents, sibling or spouse
on board. But those who had their families in this class had 1 to 2 children, parents, sibling or
spouse.
Classification
Decision tree modelling
Semi-supervised learning can be used to get training dataset and test set. We can get another
titanic dataset which may be treat as a sample. Then the knowledge from this sampled dataset
is applied to titanic3 dataset to help with mining, analysis, classification, and interpretation.
The sampled dataset contains only five variables pclass, sex, sibsp, age and parch and one
outcome Survived. A sample of 56 observations is extracted using random sampling
approach. From this sample, supervised learning is applied to get the training set and test set.
Supervised learning requires training data to be 70% of available data and testing data to be
30% of available data. The training data will comprise of 40 first observations while
remaining 16 observations will be our test dataset. Training dataset is partitioned into
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

di erent sized decision tree. The result from both training set and test set of the sample areff
compared with the dataset.
Data cleaning technique should be applied to our sampled data. Replacing missing values,
labelling variables through groping. Grouping the ages enables us to know status of
passengers if he/she was a child, teenage or adult. In the variables sibsp and parch has level
which can be replaced by labels.
The following codes help in data cleaning:
tit1 <- tit %>%
mutate(Age = case_when(
Age <= 12 ~ "Child",
Age > 12 & Age < 20 ~ "Teenage",
Age > 20 ~ "Adult"
),
SiblingSpouse = case_when(
SiblingSpouse == 0 ~ "no",
SiblingSpouse > 0 ~ "yes"),
ParentChild = case_when(
ParentChild == 0 ~ "no",
ParentChild > 0 ~ "yes")
)
To get the training set and the test, we will extract the first 40 observations from the sample
dataset for training set and the remaining to be test set. Another column can be created for
numbering the passengers which can act as the passengers id.
dtrain <- titanic %>%
filter(Id <= 40)
dtest <- titanic %>%
filter(Id > 40)
# A tibble: 16 x 7
Id Survived Pclass sex age sibs parch
<int> <int> <int> <chr> <chr> <chr> <chr>
1 41 0 1 male Adult no yes
2 42 1 2 female Adult no no
compared with the dataset.
Data cleaning technique should be applied to our sampled data. Replacing missing values,
labelling variables through groping. Grouping the ages enables us to know status of
passengers if he/she was a child, teenage or adult. In the variables sibsp and parch has level
which can be replaced by labels.
The following codes help in data cleaning:
tit1 <- tit %>%
mutate(Age = case_when(
Age <= 12 ~ "Child",
Age > 12 & Age < 20 ~ "Teenage",
Age > 20 ~ "Adult"
),
SiblingSpouse = case_when(
SiblingSpouse == 0 ~ "no",
SiblingSpouse > 0 ~ "yes"),
ParentChild = case_when(
ParentChild == 0 ~ "no",
ParentChild > 0 ~ "yes")
)
To get the training set and the test, we will extract the first 40 observations from the sample
dataset for training set and the remaining to be test set. Another column can be created for
numbering the passengers which can act as the passengers id.
dtrain <- titanic %>%
filter(Id <= 40)
dtest <- titanic %>%
filter(Id > 40)
# A tibble: 16 x 7
Id Survived Pclass sex age sibs parch
<int> <int> <int> <chr> <chr> <chr> <chr>
1 41 0 1 male Adult no yes
2 42 1 2 female Adult no no

3 43 0 3 male Adult no no
4 44 1 2 female Child yes yes
5 45 0 3 male Child yes yes
6 46 0 3 male Adult no no
7 47 0 1 male Adult yes no
8 48 0 3 male Child yes yes
9 49 1 2 female Adult no no
10 50 0 3 male Teena~ no no
From the training set there were 17 passengers who survived and 23 did not survive out of the
40 passengers. There were 18 males and 22 females and pclass 1, 2 to 3 there were 10, 8 and
22 respectively. The number of children, teenage and adults who boarded the ship were 6, 8
and 26 respectively. Those who had sibling/spouse were 19 and those with no sibling/spouse
were 21.
The algorithm
The Hunt’s algorithm is used on training decision tree model, generates a Decision tree by
top-down or divides and surmounts move toward. The sample/row data with supplementary
class, the algorithm uses an attribute test to split the data into slighter subsets. Hunt’s
algorithm maintains best split for every stage according to some threshold value as greedy
fashion.
A decision tree is a tree with each node representing an attribute, each branch representing a
decision and each leaf representing an outcome(categorical or continuous value). The idea is
to create a tree for the entire data and process a single outcome at every leaf node.
In Hunt’s algorithm we can use ID3 (Iterative Dichotomiser 3) which uses entropy function
and information gain as metrics. In ID3 method we choose the best attribute by checking the
one with the highest information gain in ID3. Through this algorithm the ID3 can be used to
make conclusion on the prediction of passengers’ survival. The following formula is used for
calculation:
H ( S ) =∑
c ∈C
−P ( c ) log 2 P(c)
Where H(S) is the entropy of the training set and P(c) is the proportion of number of elements
in the training set.
4 attributes that is pclass, sex, sibsp and parch and one outcome Survived.
Out of 40 passengers 17 survived and 23 did not. Entropy for survived is given by
4 44 1 2 female Child yes yes
5 45 0 3 male Child yes yes
6 46 0 3 male Adult no no
7 47 0 1 male Adult yes no
8 48 0 3 male Child yes yes
9 49 1 2 female Adult no no
10 50 0 3 male Teena~ no no
From the training set there were 17 passengers who survived and 23 did not survive out of the
40 passengers. There were 18 males and 22 females and pclass 1, 2 to 3 there were 10, 8 and
22 respectively. The number of children, teenage and adults who boarded the ship were 6, 8
and 26 respectively. Those who had sibling/spouse were 19 and those with no sibling/spouse
were 21.
The algorithm
The Hunt’s algorithm is used on training decision tree model, generates a Decision tree by
top-down or divides and surmounts move toward. The sample/row data with supplementary
class, the algorithm uses an attribute test to split the data into slighter subsets. Hunt’s
algorithm maintains best split for every stage according to some threshold value as greedy
fashion.
A decision tree is a tree with each node representing an attribute, each branch representing a
decision and each leaf representing an outcome(categorical or continuous value). The idea is
to create a tree for the entire data and process a single outcome at every leaf node.
In Hunt’s algorithm we can use ID3 (Iterative Dichotomiser 3) which uses entropy function
and information gain as metrics. In ID3 method we choose the best attribute by checking the
one with the highest information gain in ID3. Through this algorithm the ID3 can be used to
make conclusion on the prediction of passengers’ survival. The following formula is used for
calculation:
H ( S ) =∑
c ∈C
−P ( c ) log 2 P(c)
Where H(S) is the entropy of the training set and P(c) is the proportion of number of elements
in the training set.
4 attributes that is pclass, sex, sibsp and parch and one outcome Survived.
Out of 40 passengers 17 survived and 23 did not. Entropy for survived is given by
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

P ( survived )=−17/40 ¿
P ( notsurvived )=−23/40 ¿
H ( S)=0.5246+ 0.4591=0.9837
The entropy of the training set for surviving data set is 0.9837.
The survival chance of an individual was high regarding to some factors.
Calculating the information gain (IG(A)) for each attribute:
IG ( A , S )=H ( S )−∑
t ∈T
−P ( t ) H (t )
T is the subset created by splitting the training set S by attribute A such that S is the union of
all the attributes.
H(S) is the entropy of set S.
P(t) is the proportion of number of elements in t to the number of elements in set S.
H(t) is the entropy of subset t.
The attribute with the largest IG is used to split the set on this iteration.
Gain(S, ex)
P ( males ) =−2
18 log 2 ( 2
18 )− 16
18 log 2 ( 16
18 )=0.5033
P ( females ) =−15
22 log 2 ( 15
22 )− 7
22 log2 ( 7
22 )=0.9024
The chances of female surviving are higher than males, thus most of survivors were females.
Entropy information for sex
I ( sex )= 18
40 ∗0.5033+ 22
40∗0.9024=0.7228
Gain ( sex ) =0.9837−0.7228=0.2608
Gain(S, age)
P ( Adult )=−11
26 log2 ( 11
26 )− 15
26 log2 ( 15
26 )=0.9829
P ( Teenage )=−4
8 log 2 ( 4
8 )− 4
8 log2 ( 4
8 )=1
P ( notsurvived )=−23/40 ¿
H ( S)=0.5246+ 0.4591=0.9837
The entropy of the training set for surviving data set is 0.9837.
The survival chance of an individual was high regarding to some factors.
Calculating the information gain (IG(A)) for each attribute:
IG ( A , S )=H ( S )−∑
t ∈T
−P ( t ) H (t )
T is the subset created by splitting the training set S by attribute A such that S is the union of
all the attributes.
H(S) is the entropy of set S.
P(t) is the proportion of number of elements in t to the number of elements in set S.
H(t) is the entropy of subset t.
The attribute with the largest IG is used to split the set on this iteration.
Gain(S, ex)
P ( males ) =−2
18 log 2 ( 2
18 )− 16
18 log 2 ( 16
18 )=0.5033
P ( females ) =−15
22 log 2 ( 15
22 )− 7
22 log2 ( 7
22 )=0.9024
The chances of female surviving are higher than males, thus most of survivors were females.
Entropy information for sex
I ( sex )= 18
40 ∗0.5033+ 22
40∗0.9024=0.7228
Gain ( sex ) =0.9837−0.7228=0.2608
Gain(S, age)
P ( Adult )=−11
26 log2 ( 11
26 )− 15
26 log2 ( 15
26 )=0.9829
P ( Teenage )=−4
8 log 2 ( 4
8 )− 4
8 log2 ( 4
8 )=1
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

P ( Child ) =−2
6 log2 ( 2
6 )− 4
6 log 2 ( 4
6 ) =0.9183
Teenage had a high chance of surviving since half of them survived.
Entropy information for age
I ( age )= 26
40∗0.9829+ 8
40∗1+ 6
40 ∗0.9183=0.9766
Gain ( age )=0.9837−0.9766=0.00707
Gain(S, sibsp)
P ( no ) =−10
21 log 2 ( 10
21 )− 11
21 log2 ( 11
21 ) =0.9984
P ( yes )=−7
19 log2 ( 7
19 )−12
19 log 2 ( 12
19 )=0.9495
Passengers with no family members had a high chance of surviving compared to those who
boarded the ship with their family.
Entropy information for sibling/spouse
I ( sibling /spouse )= 21
40 ∗0.9984+ 19
40 ∗0.9495=0.9752
Gain ( sibling / spouse )=0.9837−0.9752=0.0085
Gain(S, pclass)
P ( class 1 )=−5
10 log2 ( 5
10 )− 5
10 log 2 ( 5
10 )=1
P ( class 2 )=−5
8 log2 ( 5
8 )− 3
8 log2 ( 3
8 )=0.9544
P ( class 3 ) =−7
22 log 2 ( 7
22 )− 15
22 log 2 ( 15
22 ) =0.9024
6 log2 ( 2
6 )− 4
6 log 2 ( 4
6 ) =0.9183
Teenage had a high chance of surviving since half of them survived.
Entropy information for age
I ( age )= 26
40∗0.9829+ 8
40∗1+ 6
40 ∗0.9183=0.9766
Gain ( age )=0.9837−0.9766=0.00707
Gain(S, sibsp)
P ( no ) =−10
21 log 2 ( 10
21 )− 11
21 log2 ( 11
21 ) =0.9984
P ( yes )=−7
19 log2 ( 7
19 )−12
19 log 2 ( 12
19 )=0.9495
Passengers with no family members had a high chance of surviving compared to those who
boarded the ship with their family.
Entropy information for sibling/spouse
I ( sibling /spouse )= 21
40 ∗0.9984+ 19
40 ∗0.9495=0.9752
Gain ( sibling / spouse )=0.9837−0.9752=0.0085
Gain(S, pclass)
P ( class 1 )=−5
10 log2 ( 5
10 )− 5
10 log 2 ( 5
10 )=1
P ( class 2 )=−5
8 log2 ( 5
8 )− 3
8 log2 ( 3
8 )=0.9544
P ( class 3 ) =−7
22 log 2 ( 7
22 )− 15
22 log 2 ( 15
22 ) =0.9024

Passengers who were in the first had high chances of surviving since half of them survived.
Even if out 8, 5 passengers survived in class 2 but still chances of surviving in this class are
low.
Entropy information for class
I ( class )= 10
40∗1+ 8
40 ∗0.9544+ 22
40 ∗0.9024=0.9372
Gain ( class ) =0.9837−0.9372=0.0465
Gain(S, parch)
P ( no ) =−13
30 log 2 ( 13
30 )− 17
30 log 2 (17
30 )=0.9871
P ( yes )=−4
10 log 2 ( 4
10 )− 6
10 log2 ( 6
10 )=0.9710
The passengers with no parents or children had high probability of surviving compared to
those who had family members on board.
Entropy information for parent/child
I ( parch ) = 30
40∗0.9871+ 10
40∗0.971=0.9831
Gain ( parch ) =0.9837−0.9831=0.00063
Sex has the largest information gain of 0.2608, followed by passengers’ class with 0.0465 ,
sibling spouse0.0085, age 0.00707 , and parent/child 0.00063. In conclusion sex best factor
that explains the probability of an individual surviving in a ship accident. Passengers’ class
also provides information on survival chances. Hence sex and pclass are variables that can
best split the tree.
The r code for Hunt’s algorithm
attach(dtrain)
outcome = 'Survived'
f <- paste(outcome,'>0 ~ ',paste(selVars = c('age', 'sex', 'sibsp', 'parch'), collapse = ' + '),sep='')
tmodel <- rpart(f,data = dtrain,
Even if out 8, 5 passengers survived in class 2 but still chances of surviving in this class are
low.
Entropy information for class
I ( class )= 10
40∗1+ 8
40 ∗0.9544+ 22
40 ∗0.9024=0.9372
Gain ( class ) =0.9837−0.9372=0.0465
Gain(S, parch)
P ( no ) =−13
30 log 2 ( 13
30 )− 17
30 log 2 (17
30 )=0.9871
P ( yes )=−4
10 log 2 ( 4
10 )− 6
10 log2 ( 6
10 )=0.9710
The passengers with no parents or children had high probability of surviving compared to
those who had family members on board.
Entropy information for parent/child
I ( parch ) = 30
40∗0.9871+ 10
40∗0.971=0.9831
Gain ( parch ) =0.9837−0.9831=0.00063
Sex has the largest information gain of 0.2608, followed by passengers’ class with 0.0465 ,
sibling spouse0.0085, age 0.00707 , and parent/child 0.00063. In conclusion sex best factor
that explains the probability of an individual surviving in a ship accident. Passengers’ class
also provides information on survival chances. Hence sex and pclass are variables that can
best split the tree.
The r code for Hunt’s algorithm
attach(dtrain)
outcome = 'Survived'
f <- paste(outcome,'>0 ~ ',paste(selVars = c('age', 'sex', 'sibsp', 'parch'), collapse = ' + '),sep='')
tmodel <- rpart(f,data = dtrain,
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide
1 out of 15
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
Copyright © 2020–2026 A2Z Services. All Rights Reserved. Developed and managed by ZUCOL.


