The goal of this project is to train a Binary Classifier to determine whether or not someone survives the infamous Titanic shipwreck provided some relevant features. You can find a copy of the dataset used here.
- Jupyter: Jupyter mediapackage. Install all the Jupyter components in one go.
- Numpy: Fundamental package for array computing in Python.
- Pandas: Powerful data structures for data analysis, time series, and statistics.
- Scikit-Learn: A set of python modules for machine learning and data mining.
- Matplotlib: Python plotting package.
- Seaborn: Statistical data visualization.
- Tabulate: Pretty-print tabular data.
- Kagglehub: Access Kaggle resources anywhere.
After cloning the repository, switch to your workspace directory and run the following command.
python3 -m pip install -r requirements.txtI use a dataset gotten from Kaggle, the notebook downloads the csv file containing the data we will work with to your workspace directory under the "datasets" folder.
Before anything else, I get a feel of the dataset we are working with, I inspect its shape, column (features), and other statistical information like mean and standard deviation.
During Target Variable Analysis, I'm inspecting the dataset for any imbalances. This could potentially impact what metrics we evaluate our model against.
I also look for any missing data and explore possible relationships each feature could have with survival.
| Selected Metric | Plot |
|---|---|
| Survival by Gender | ![]() |
| Survival by Passenger Class | ![]() |
| Survival by Age | ![]() |
| Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|
| Survived | 1 | -0.338481 | -0.0649104 | -0.0353225 | 0.0816294 | 0.257307 |
| Pclass | -0.338481 | 1 | -0.339898 | 0.0830814 | 0.0184427 | -0.5495 |
| Age | -0.0649104 | -0.339898 | 1 | -0.233296 | -0.172482 | 0.0966884 |
| SibSp | -0.0353225 | 0.0830814 | -0.233296 | 1 | 0.414838 | 0.159651 |
| Parch | 0.0816294 | 0.0184427 | -0.172482 | 0.414838 | 1 | 0.216225 |
| Fare | 0.257307 | -0.5495 | 0.0966884 | 0.159651 | 0.216225 | 1 |
From our analysis, we can see some patterns hinting at survival, most especially:
- Gender: Females were twice as likely to survive as males.
- Passenger Class: Passengers in higher ranking classes such as 1 and 2, were more likely to survive.
- Age: A good majority of individuals between the ages 20 - 40 survived.
- Status (Countess, Lady, etc): People with higher status in general had better chances of survival.
- Had Cabin: If you were in a cabin, your chances of survival were also much higher.
From the data, survival seems to depend heavily on gender, wealth, and age bracket.
Now we've determined what features are of interest to us, I then proceed to engineer additional features such as FamilySize and Title and drop others that don't seem to have any correlation (or importance) such as Name and Ticket.
A number of preprocessing is still needed before we can feed the data to our models. For numerical data, we want to impute whatever missing values, from our analysis "Age" is one such feature missing some values.
I use the median as the imputation strategy. The other thing we have to take into account is standard scaling, this can potentially improve our model's performance.
For categorical data like Sex and Title, we use most frequent as the imputation strategy then pass it through a One Hot Encoder to get relevant extra features.
So first of all, I trained a Logistic Regression Classifier and got fairly good results without any Hyperparameter Tuning.
| Avg Precision of 85% | Low False Positives |
|---|---|
![]() |
![]() |
This model performs pretty well to give it credit. Although the number of True Negatives is concerning.
| survival | precision | recall | f1-score | support |
|---|---|---|---|---|
| 0 | 0.81 | 0.86 | 0.84 | 44 |
| 1 | 0.76 | 0.68 | 0.72 | 28 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| accuracy | 0.79 | 72 | ||
| macro avg | 0.78 | 0.77 | 0.78 | 72 |
| weighted avg | 0.79 | 0.79 | 0.79 | 72 |
Next I train an ensemble Decision Tree (i.e Random Forest) Classifier, again without any tuning, and seems to perform better than the Logistic Classifier even though it decreases in average precision by 1%. The recall is much higher at the 0.5 threshold from the classification report.
| Avg Precision of 84% | Very Low False Negatives |
|---|---|
![]() |
![]() |
| survival | precision | recall | f1-score | support |
|---|---|---|---|---|
| 0 | 0.84 | 0.86 | 0.85 | 44 |
| 1 | 0.78 | 0.75 | 0.76 | 28 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| accuracy | 0.82 | 72 | ||
| macro avg | 0.81 | 0.81 | 0.81 | 72 |
| weighted avg | 0.82 | 0.82 | 0.82 | 72 |
An SVM was the final model I trained and evaluated. It records the lowest performance on this dataset on average compared to the Logistic and Random Forest Classifiers.
| Avg Precision of 83% | Higher FP and FN rates |
|---|---|
![]() |
![]() |
This model recorded more False Positives and False Negatives than any other.
| survival | precision | recall | f1-score | support |
|---|---|---|---|---|
| 0 | 0.83 | 0.91 | 0.87 | 44 |
| 1 | 0.83 | 0.71 | 0.77 | 28 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| accuracy | 0.83 | 72 | ||
| macro avg | 0.83 | 0.81 | 0.82 | 72 |
| weighted avg | 0.83 | 0.83 | 0.83 | 72 |
After experimenting and evaluating the above models, I decided to go with the Logistic and Random Forest Classifier due to more promising results.
I tuned a few of the model hyperparameters for both Logistic Regression and Random Forest and use GridSearch Cross Validation on 5 folds to get the best estimator.
rf_params_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2']
}logistic_params_grid = {
'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
'penalty': ['l1', 'l2'],
'solver': ['liblinear', 'saga']
}This runs 5 folds for each 216 candidates totalling 1080 fits for Random Forest, while Logistic Regression totals 120 fits.
| Model | Validation Set Average Precision |
|---|---|
| Random Forest | 0.8443 |
| Logistic Regression | 0.8528 |
The best estimator from Logistic Regression beats the best Random Forest estimator with an average accuracy of 85%. This will be used as my final model on the test set.
Now that I've determined the best performing model, I evaluate its performance against the test set.
| survival | precision | recall | f1-score | support |
|---|---|---|---|---|
| 0 | 0.83 | 0.88 | 0.85 | 110 |
| 1 | 0.79 | 0.71 | 0.75 | 69 |
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| accuracy | 0.82 | 179 | ||
| macro avg | 0.81 | 0.80 | 0.80 | 179 |
| weighted avg | 0.81 | 0.82 | 0.81 | 179 |
| Avg Precision of 80% | Confusion Matrix |
|---|---|
![]() |
![]() |
This report suggests inherent randomness in predicting survival. Future improvements can come from a larger dataset, more features or possibly ensemble techniques.










