Titanic: Machine Learning, Data Analysis, and Survival Prediction

Verified

Added on 2019/09/22

AI Summary

This project focuses on analyzing the Titanic dataset to predict passenger survival using machine learning techniques. The analysis begins with data preprocessing, including outlier removal, and then employs a decision tree model for classification. The project uses cross-validation to assess the model's performance. The analysis aims to identify the features responsible for survival or death, which is a supervised classification problem. The project includes the use of data analysis tools for a user-friendly approach to the analysis, which helps to perform the analysis without going into the deep core of mathematics. The project provides details on the data, the features, and the approach used to create the prediction. The project predicts the survival of the passengers based on the given features in the dataset.

Data:
Title:
Titanic: Machine Learning from Disaster
Link:
https://www.kaggle.com/c/titanic/data
Overview:
The data has been split into two groups:
 training set (train.csv)
 test set (test.csv)
The training set was used to build your machine learning models.
The test set should be used to see how well your model performs on unseen data. For the test set,
we do not provide the ground truth for each passenger. It is your job to predict these outcomes.
For each passenger in the test set, use the model you trained to predict whether or not they
survived the sinking of the Titanic.
Data Dictionary
Variable Definition Key
survival Survival 0 = No, 1 = Yes
Pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Sex Sex
Age Age in years
Sibsp
# of siblings / spouses aboard the
Titanic
Parch
# of parents / children aboard the
Titanic

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Ticket Ticket number
Fare Passenger fare
Cabin Cabin number
embarked Port of Embarkation
C = Cherbourg, Q = Queenstown, S =
Southampton
Variable Notes
pclass: A proxy for socio-economic status (SES)
1st = Upper
2nd = Middle
3rd = Lower
age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp: The dataset defines family relations in this way...
Sibling = brother, sister, stepbrother, stepsister
Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way...
Parent = mother, father
Child = daughter, son, stepdaughter, stepson
Some children travelled only with a nanny, therefore parch=0 for them.
Question 2:
Please check the twbx file.

Question 3:
a) The aim is to able to find out whether any specific features was responsible for the
survival or death of an individual on the titanic disaster.
b) This a supervised classification problem, since we already have the labels for the various
attributes, the definition of the attributes is given in the data overview.
c) Decision tree
d) We need to remove the outliers before starting predicting on the dataset
e) Data will be accessed using cross-validation where the data is divided into two groups –
training and testing data.
Question 4:
Please check the (.rmb) file.
Question 5:
Screenshots can be taken from the (.rmb) file.
Question 6:
Actual
Prediction
Yes No
Yes 333 107
No 165 701
Accuracy= 333+701
333+107+165+701 =79.17 %
Recall= 701
701+107 =86.75 %

Precision= 701
701+165 =80.95 %
Question 7:
Using the data analysis tools were helpful in performing the analysis. This is mainly because that
one can perform the analysis without every going into the deep core of mathematics involved in
the analysis and can simply use a user friendly graphic interface to perform the analysis.
Question 8:
Please check the .rwb file