Skip to content

dev-xero/titanic-survival

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Titanic Survival

The goal of this project is to train a Binary Classifier to determine whether or not someone survives the infamous Titanic shipwreck provided some relevant features. You can find a copy of the dataset used here.

Required Packages

  • Jupyter: Jupyter mediapackage. Install all the Jupyter components in one go.
  • Numpy: Fundamental package for array computing in Python.
  • Pandas: Powerful data structures for data analysis, time series, and statistics.
  • Scikit-Learn: A set of python modules for machine learning and data mining.
  • Matplotlib: Python plotting package.
  • Seaborn: Statistical data visualization.
  • Tabulate: Pretty-print tabular data.
  • Kagglehub: Access Kaggle resources anywhere.

Installing the Packages

After cloning the repository, switch to your workspace directory and run the following command.

python3 -m pip install -r requirements.txt

Downloading the Dataset

I use a dataset gotten from Kaggle, the notebook downloads the csv file containing the data we will work with to your workspace directory under the "datasets" folder.

Exploratory Data Analysis (EDA)

Before anything else, I get a feel of the dataset we are working with, I inspect its shape, column (features), and other statistical information like mean and standard deviation.

During Target Variable Analysis, I'm inspecting the dataset for any imbalances. This could potentially impact what metrics we evaluate our model against.

I also look for any missing data and explore possible relationships each feature could have with survival.

Most Promising Indicators of Survival

Selected Metric Plot
Survival by Gender survival_by_gender
Survival by Passenger Class survival_by_p_class
Survival by Age survival_by_age

Correlation Based on Numerical Features

Survived Pclass Age SibSp Parch Fare
Survived 1 -0.338481 -0.0649104 -0.0353225 0.0816294 0.257307
Pclass -0.338481 1 -0.339898 0.0830814 0.0184427 -0.5495
Age -0.0649104 -0.339898 1 -0.233296 -0.172482 0.0966884
SibSp -0.0353225 0.0830814 -0.233296 1 0.414838 0.159651
Parch 0.0816294 0.0184427 -0.172482 0.414838 1 0.216225
Fare 0.257307 -0.5495 0.0966884 0.159651 0.216225 1

EDA Observations

From our analysis, we can see some patterns hinting at survival, most especially:

  • Gender: Females were twice as likely to survive as males.
  • Passenger Class: Passengers in higher ranking classes such as 1 and 2, were more likely to survive.
  • Age: A good majority of individuals between the ages 20 - 40 survived.
  • Status (Countess, Lady, etc): People with higher status in general had better chances of survival.
  • Had Cabin: If you were in a cabin, your chances of survival were also much higher.

From the data, survival seems to depend heavily on gender, wealth, and age bracket.

Feature Engineering and Preprocessing Pipeline

Now we've determined what features are of interest to us, I then proceed to engineer additional features such as FamilySize and Title and drop others that don't seem to have any correlation (or importance) such as Name and Ticket.

A number of preprocessing is still needed before we can feed the data to our models. For numerical data, we want to impute whatever missing values, from our analysis "Age" is one such feature missing some values.

I use the median as the imputation strategy. The other thing we have to take into account is standard scaling, this can potentially improve our model's performance.

For categorical data like Sex and Title, we use most frequent as the imputation strategy then pass it through a One Hot Encoder to get relevant extra features.

Model Experimentation

1/ Logistic Classifier

So first of all, I trained a Logistic Regression Classifier and got fairly good results without any Hyperparameter Tuning.

Avg Precision of 85% Low False Positives
pr_curve_logistic cm_logistic

This model performs pretty well to give it credit. Although the number of True Negatives is concerning.

Classification Report

survival precision recall f1-score support
0 0.81 0.86 0.84 44
1 0.76 0.68 0.72 28
precision recall f1-score support
accuracy 0.79 72
macro avg 0.78 0.77 0.78 72
weighted avg 0.79 0.79 0.79 72

2/ Random Forest Classifier

Next I train an ensemble Decision Tree (i.e Random Forest) Classifier, again without any tuning, and seems to perform better than the Logistic Classifier even though it decreases in average precision by 1%. The recall is much higher at the 0.5 threshold from the classification report.

Avg Precision of 84% Very Low False Negatives
pr_curve_logistic cm_logistic

Classification Report

survival precision recall f1-score support
0 0.84 0.86 0.85 44
1 0.78 0.75 0.76 28
precision recall f1-score support
accuracy 0.82 72
macro avg 0.81 0.81 0.81 72
weighted avg 0.82 0.82 0.82 72

3/ Support Vector Machine (SVM) Classifier

An SVM was the final model I trained and evaluated. It records the lowest performance on this dataset on average compared to the Logistic and Random Forest Classifiers.

Avg Precision of 83% Higher FP and FN rates
pr_curve_svm cm_svm

This model recorded more False Positives and False Negatives than any other.

Classification Report

survival precision recall f1-score support
0 0.83 0.91 0.87 44
1 0.83 0.71 0.77 28
precision recall f1-score support
accuracy 0.83 72
macro avg 0.83 0.81 0.82 72
weighted avg 0.83 0.83 0.83 72

Decision

After experimenting and evaluating the above models, I decided to go with the Logistic and Random Forest Classifier due to more promising results.

Hyperparameter Tuning

I tuned a few of the model hyperparameters for both Logistic Regression and Random Forest and use GridSearch Cross Validation on 5 folds to get the best estimator.

rf_params_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}
logistic_params_grid = {
    'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga']
}

This runs 5 folds for each 216 candidates totalling 1080 fits for Random Forest, while Logistic Regression totals 120 fits.

Decision

Model Validation Set Average Precision
Random Forest 0.8443
Logistic Regression 0.8528

The best estimator from Logistic Regression beats the best Random Forest estimator with an average accuracy of 85%. This will be used as my final model on the test set.

Final Model Performance on Test Set

Now that I've determined the best performing model, I evaluate its performance against the test set.

Model Weighted Average Precision: 81%

survival precision recall f1-score support
0 0.83 0.88 0.85 110
1 0.79 0.71 0.75 69
precision recall f1-score support
accuracy 0.82 179
macro avg 0.81 0.80 0.80 179
weighted avg 0.81 0.82 0.81 179
Avg Precision of 80% Confusion Matrix
pr_curve_final cm_final

Conclusion

This report suggests inherent randomness in predicting survival. Future improvements can come from a larger dataset, more features or possibly ensemble techniques.

About

Training a Binary Classifier capable of determining whether or not you survive the Titanic.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published