Predicting Titanic Passenger Survivability: A Data Analysis Report

Verified

Added on  2021/06/18

|11
|1948
|132
Report
AI Summary
This report presents a comprehensive analysis of the Titanic disaster, focusing on predicting passenger survivability based on various factors. The study explores the impact of ticket class, gender, age, embarkation port, and name prefixes on survival rates. Exploratory data analysis is conducted to identify significant predictors, followed by the development and deployment of a logistic regression model. The model incorporates interactions between key variables such as class, gender, and name prefixes. The results indicate that ticket class, sex, age, and embarkation port significantly influence survival probability. For example, passengers in the first class had a higher chance of survival, while males were less likely to survive. The report concludes with predictions for the test dataset, providing insights into the survival rates of different passenger groups. This analysis offers a valuable understanding of the factors that determined survival during the Titanic disaster and showcases the application of data analysis and predictive modeling techniques.
Document Page
Titanic Survivability Prediction 1
Titanic Survivability Prediction
Name
Course Number
Date
Faculty Name
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Titanic Survivability Prediction 2
Titanic Survivability Prediction
Introduction
The Titanic tragedy was one of the most devastating and deadliest events ever happened in the
modern history. People have developed models to predict the probability that an individual in the
passenger liner would have survived, considering different factors such as class, gender and age
among others. A lot of machine learning activities and predictive methods have been tried to
develop a model with the highest predictive power of survivability in the incident. Therefore, this
paper is focused on developing a predictive model to predict the probability that an individual
would have survived the accident given different factors, which affected the victims differently.
The passenger liner was divided into 3 classes –first class being in the topmost, second class in
the middle and third class being at the bottom. This already shows that people in the third class
were more likely to die compared to the other classes. However, it is important to prove this
hypothesis, hence supporting our ideas and theories.
It has been documented that most people die because there were no enough lifesaving jackets,
which rendered most of the people who could have survived death. Due to theories of nature,
scarcity of the lifesaver jackets exposed men more compared to the other groups – women and
children. In addition, this effect would have been affected by levels of class. It would be
hypothesized that men in the first class were more romantic compared to those in second and
third classes. Therefore, the trends of survivability would vary between class for men and
women. In an ideal situation, men and women in the third class would have struggled in the
same manner to save their lives. Due to the life dynamics, it is possible to predict their
survivability. As much as the survivability levels would have been due to chance, these dynamics
can explain to some level of confidence. Exploratory data analysis will be conducted to identify
Document Page
Titanic Survivability Prediction 3
the possible significant predictors of survivability. Further, these variables will be used to
generate the model and predict the probability of surviving in the test dataset(Friendly, 2012).
Methods
Data Preparation
The titanic data has several missing values but for the predictive model, we decide to
leave them because most of them are affecting variables which might not be used in the model
such as the cabin number. Due to the high number of missing values of the cabin variable, it will
not be used in the model, hence no need to remove the missing values. Several variables will be
transformed to allow better analysis. For instance, the Pclass variable which denotes the class of
the individuals will be transformed into a factor, to allow categorical analysis. Similarly, the
survived variable is transformed into a categorical variable. This transformation will be done for
the entire dataset because the test and train sets should be identical except for the dependent
variable. Duplicates will be checked using the name variable and if any repeated entries based on
the name are identified, they will be removed. Below is the data dictionary for the variables to be
used in the model.
Table 1: Data dictionary
Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1 st (Upper)
2 = 2 nd (Middle)
3 = 3 rd (Lower)
Document Page
Titanic Survivability Prediction 4
sex Sex
Age Age in years
sibsp Number of siblings/spouses aboard the Titanic
parch Number of parents/children aboard the Titanic
fare Passenger fare
embarked Port of Embarkation C = Cherbourg, Q =
Queenstown, S =
Southampton
Data Sampling
The data has already been sampled into the train and test dataset to allow fitting and
testing the model. Therefore, the two sets will be combined to allow data transformations as
described in data preparation section above. Therefore, using row binding capability in R, we
will set the train set to be above the test set. Further, the first 891 rows will be used in the model
for training and the last 418 rows will be used for testing(Aczel & Sounderpandian, 2008).
Building the model
342 (38.38%) people survived and 549 (61.62%) died according to the training dataset.
24.67% were in the first class, 20.65% in the second class and 55.11% in the third class.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Titanic Survivability Prediction 5
Figure 1: A stacked chart of Ticket class by survival
In figure 1 above, the survival rates were not homogenous across the three classes, with
more people dying in third class as compared to the second and first classes. The ticket class will
be used in the model because the variation in survival rates in the classes can explain the
significant variation of the survivability.
Document Page
Titanic Survivability Prediction 6
Figure 2: Bar plot of title by Class and survivability
Individuals with titles “MR” were more likely to die in all the three classes. Mrs, Miss
and Master more likely to die in the third class than second and first.
Figure 3: A plot of sex by Class and Survivability
Document Page
Titanic Survivability Prediction 7
More male died in all the classes than females and the proportions of females who died in the
three class reduce significantly from third class to first class.
Figure 4: Plots of Age and survivor by class
Older men in all classes were more likely to die compared to women. However, women
in the third class were as much likely to die as the men.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Titanic Survivability Prediction 8
Figure 5: Misses and class by survival
Model Deployment
According to the data exploration performed in this paper, the best model includes ticket
class, sex, age, those who embarked in Cherbourg, those with “Initials” in their names, and
interaction between “Mr” and those with “Miss.” and ticket class. The model output is shown in
the table below.
## Call:
## glm(formula = Survived ~ Pclass + Sex + Age + Cherbourg + Mr *
## Pclass + Miss, family = "binomial", data = data.combined[1:891,
## ])
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7267 -0.5948 -0.3919 0.4420 2.7783
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
Document Page
Titanic Survivability Prediction 9
## (Intercept) 4.413578 0.668546 6.602 4.06e-11 ***
## Pclass2 -0.798375 0.514949 -1.550 0.12105
## Pclass3 -3.240434 0.484136 -6.693 2.18e-11 ***
## Sexmale -2.212197 0.452587 -4.888 1.02e-06 ***
## Age -0.039344 0.009028 -4.358 1.31e-05 ***
## Cherbourg 0.602874 0.273361 2.205 0.02742 *
## Mr -1.347515 0.505354 -2.666 0.00767 **
## Miss -0.642062 0.374143 -1.716 0.08615 .
## Pclass2:Mr -1.454355 0.693869 -2.096 0.03608 *
## Pclass3:Mr 1.452154 0.536542 2.707 0.00680 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 964.52 on 713 degrees of freedom
## Residual deviance: 598.46 on 704 degrees of freedom
## (177 observations deleted due to missingness)
## AIC: 618.46
## Number of Fisher Scoring iterations: 5
All the variables included in the model are statistically significant at 5% except the people in
second class compared to those in first class.
## 2.5 % 97.5 %
## (Intercept) 82.56432240 23.81985244 330.60925240
## Pclass2 0.45005968 0.15780025 1.21054007
## Pclass3 0.03914691 0.01417240 0.09569679
## Sexmale 0.10945990 0.04397148 0.26088598
## Age 0.96142006 0.94412173 0.97820425
## Cherbourg 1.82736332 1.06714173 3.12224439
## Mr 0.25988530 0.09344233 0.68709905
## Miss 0.52620644 0.25033234 1.09056495
Document Page
Titanic Survivability Prediction 10
## Pclass2:Mr 0.23355092 0.05775232 0.89872360
## Pclass3:Mr 4.27230712 1.54432344 12.79926951
The table above includes exponents of the coefficient in the model, which will allow
better interpretation of the model. Since all of them are significant, we will be interpreted all of
them with respect to the survivability of the people who were in the passenger liner. People in
the second class were less likely to survive by 55% compared to those in the first class.
Similarly, those in the third class were less likely to survive around 94% compared to those in
first class section. Male individuals in the passenger liner were less likely to survive by 89%,
having controlled for class, age, those embarked in Cherbourg and those with Mr. and Miss. in
their name prefixes. Increasing age by 1 year reduces the odds of surviving by around 4% and
those who boarded in Cherbourg were more likely to survive by 83% factors controlling for
other factors in the model. Male individuals in the passenger with name prefixes “Mr.” were less
likely to survive by 74% and those with “Miss.” The prefix was less likely to survive by 48%.
Males with name prefix “Mr.” in second class were less likely to survive by around 77%
compared to their counterparts in first class. Males with “Mr.” name prefixes in third class 4 fold
more likely to survive compared to those in the first class(El-Masri, 2013; D. Hosmer,
Lemeshow, & Sturdivant, 2013).
Model Prediction
33.63% (112) were predicted to have survived in the test dataset and 66.37% (221) to
have died. Of those who survived, 16% (18) were men and 84% (94) were women. 15% of the
survivors were from third class, 26.6% from second class and 56.25% from first class(D. W.
Hosmer, Lemeshow, & Sturdivant, 2013).
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Titanic Survivability Prediction 11
References
Aczel, A. D., & Sounderpandian, J. (2008). Complete: Business Statistics. ASSOCIATION OF
BUSINESS INFORMATION …. Retrieved from http://scholar.google.com/scholar?
hl=en&btnG=Search&q=intitle:McGraw-Hill/Irwin+=%3E?#1
El-Masri, M. M. (2013). Odds ratio, 109(6), 14. https://doi.org/10.1054/ebog.2000.0196
Friendly, M. (2012). Visualizing Categorical Data: Data, Stories, and Pictures. Mosaic A Journal
For The Interdisciplinary Study Of Literature, 1–9. Retrieved from
papers2://publication/uuid/D6901171-8BDA-4D99-A6C5-A25D0A9672BD
Hosmer, D., Lemeshow, S., & Sturdivant, R. X. (2013). Model-Building Strategies and Methods
for Logistic Regression. In Applied Logistic Regression (pp. 89–151).
https://doi.org/10.1002/0471722146.ch4
Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied Logistic Regression Third
Edition. Applied Logistic Regression. https://doi.org/10.1002/0471722146
chevron_up_icon
1 out of 11
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]