Titanic Survival

The goal of this project is to train a Binary Classifier to determine whether or not someone survives the infamous Titanic shipwreck provided some relevant features. You can find a copy of the dataset used here.

Required Packages

Jupyter: Jupyter mediapackage. Install all the Jupyter components in one go.
Numpy: Fundamental package for array computing in Python.
Pandas: Powerful data structures for data analysis, time series, and statistics.
Scikit-Learn: A set of python modules for machine learning and data mining.
Matplotlib: Python plotting package.
Seaborn: Statistical data visualization.
Tabulate: Pretty-print tabular data.
Kagglehub: Access Kaggle resources anywhere.

Installing the Packages

After cloning the repository, switch to your workspace directory and run the following command.

python3 -m pip install -r requirements.txt

Downloading the Dataset

I use a dataset gotten from Kaggle, the notebook downloads the csv file containing the data we will work with to your workspace directory under the "datasets" folder.

Exploratory Data Analysis (EDA)

Before anything else, I get a feel of the dataset we are working with, I inspect its shape, column (features), and other statistical information like mean and standard deviation.

During Target Variable Analysis, I'm inspecting the dataset for any imbalances. This could potentially impact what metrics we evaluate our model against.

I also look for any missing data and explore possible relationships each feature could have with survival.

Most Promising Indicators of Survival

Selected Metric	Plot
Survival by Gender
Survival by Passenger Class
Survival by Age

Correlation Based on Numerical Features

	Survived	Pclass	Age	SibSp	Parch	Fare
Survived	1	-0.338481	-0.0649104	-0.0353225	0.0816294	0.257307
Pclass	-0.338481	1	-0.339898	0.0830814	0.0184427	-0.5495
Age	-0.0649104	-0.339898	1	-0.233296	-0.172482	0.0966884
SibSp	-0.0353225	0.0830814	-0.233296	1	0.414838	0.159651
Parch	0.0816294	0.0184427	-0.172482	0.414838	1	0.216225
Fare	0.257307	-0.5495	0.0966884	0.159651	0.216225	1

EDA Observations

From our analysis, we can see some patterns hinting at survival, most especially:

Gender: Females were twice as likely to survive as males.
Passenger Class: Passengers in higher ranking classes such as 1 and 2, were more likely to survive.
Age: A good majority of individuals between the ages 20 - 40 survived.
Status (Countess, Lady, etc): People with higher status in general had better chances of survival.
Had Cabin: If you were in a cabin, your chances of survival were also much higher.

From the data, survival seems to depend heavily on gender, wealth, and age bracket.

Feature Engineering and Preprocessing Pipeline

Now we've determined what features are of interest to us, I then proceed to engineer additional features such as FamilySize and Title and drop others that don't seem to have any correlation (or importance) such as Name and Ticket.

A number of preprocessing is still needed before we can feed the data to our models. For numerical data, we want to impute whatever missing values, from our analysis "Age" is one such feature missing some values.

I use the median as the imputation strategy. The other thing we have to take into account is standard scaling, this can potentially improve our model's performance.

For categorical data like Sex and Title, we use most frequent as the imputation strategy then pass it through a One Hot Encoder to get relevant extra features.

Model Experimentation

1/ Logistic Classifier

So first of all, I trained a Logistic Regression Classifier and got fairly good results without any Hyperparameter Tuning.

Avg Precision of 85%	Low False Positives

This model performs pretty well to give it credit. Although the number of True Negatives is concerning.

Classification Report

survival	precision	recall	f1-score	support
0	0.81	0.86	0.84	44
1	0.76	0.68	0.72	28

	precision	recall	f1-score	support
accuracy			0.79	72
macro avg	0.78	0.77	0.78	72
weighted avg	0.79	0.79	0.79	72

2/ Random Forest Classifier

Next I train an ensemble Decision Tree (i.e Random Forest) Classifier, again without any tuning, and seems to perform better than the Logistic Classifier even though it decreases in average precision by 1%. The recall is much higher at the 0.5 threshold from the classification report.

Avg Precision of 84%	Very Low False Negatives

Classification Report

survival	precision	recall	f1-score	support
0	0.84	0.86	0.85	44
1	0.78	0.75	0.76	28

	precision	recall	f1-score	support
accuracy			0.82	72
macro avg	0.81	0.81	0.81	72
weighted avg	0.82	0.82	0.82	72

3/ Support Vector Machine (SVM) Classifier

An SVM was the final model I trained and evaluated. It records the lowest performance on this dataset on average compared to the Logistic and Random Forest Classifiers.

Avg Precision of 83%	Higher FP and FN rates

This model recorded more False Positives and False Negatives than any other.

Classification Report

survival	precision	recall	f1-score	support
0	0.83	0.91	0.87	44
1	0.83	0.71	0.77	28

	precision	recall	f1-score	support
accuracy			0.83	72
macro avg	0.83	0.81	0.82	72
weighted avg	0.83	0.83	0.83	72

Decision

After experimenting and evaluating the above models, I decided to go with the Logistic and Random Forest Classifier due to more promising results.

Hyperparameter Tuning

I tuned a few of the model hyperparameters for both Logistic Regression and Random Forest and use GridSearch Cross Validation on 5 folds to get the best estimator.

rf_params_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

logistic_params_grid = {
    'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}

This runs 5 folds for each 216 candidates totalling 1080 fits for Random Forest, while Logistic Regression totals 120 fits.

Decision

Model	Validation Set Average Precision
Random Forest	0.8443
Logistic Regression	0.8528

The best estimator from Logistic Regression beats the best Random Forest estimator with an average accuracy of 85%. This will be used as my final model on the test set.

Final Model Performance on Test Set

Now that I've determined the best performing model, I evaluate its performance against the test set.

Model Weighted Average Precision: 81%

survival	precision	recall	f1-score	support
0	0.83	0.88	0.85	110
1	0.79	0.71	0.75	69

	precision	recall	f1-score	support
accuracy			0.82	179
macro avg	0.81	0.80	0.80	179
weighted avg	0.81	0.82	0.81	179

Avg Precision of 80%	Confusion Matrix

Conclusion

This report suggests inherent randomness in predicting survival. Future improvements can come from a larger dataset, more features or possibly ensemble techniques.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
datasets/yasserh/titanic-dataset		datasets/yasserh/titanic-dataset
diagrams		diagrams
notebooks		notebooks
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Titanic Survival

Required Packages

Installing the Packages

Downloading the Dataset

Exploratory Data Analysis (EDA)

Most Promising Indicators of Survival

Correlation Based on Numerical Features

EDA Observations

Feature Engineering and Preprocessing Pipeline

Model Experimentation

1/ Logistic Classifier

Classification Report

2/ Random Forest Classifier

Classification Report

3/ Support Vector Machine (SVM) Classifier

Classification Report

Decision

Hyperparameter Tuning

Decision

Final Model Performance on Test Set

Model Weighted Average Precision: 81%

Conclusion

About

Uh oh!

Releases

Packages

Languages

dev-xero/titanic-survival

Folders and files

Latest commit

History

Repository files navigation

Titanic Survival

Required Packages

Installing the Packages

Downloading the Dataset

Exploratory Data Analysis (EDA)

Most Promising Indicators of Survival

Correlation Based on Numerical Features

EDA Observations

Feature Engineering and Preprocessing Pipeline

Model Experimentation

1/ Logistic Classifier

Classification Report

2/ Random Forest Classifier

Classification Report

3/ Support Vector Machine (SVM) Classifier

Classification Report

Decision

Hyperparameter Tuning

Decision

Final Model Performance on Test Set

Model Weighted Average Precision: 81%

Conclusion

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages