Information Technology : Beyond the Iceberg: Predicting Who Survived the Titanic

This blog explores using machine learning to predict survival rates for passengers on the RMS Titanic. It's a supervised learning problem where we build a model based on historical data.

Problem Statement:

The sinking of the Titanic is one of the most infamous shipwrecks in history. In 1912, during her first voyage, the “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

Given information about passengers (ie name, age, gender, socio-economic class, etc) on Titanic, can we build a predictive model to predict their survival during the disaster?

Data and Features:

This code uses a limited dataset with features like age, gender, and social class.

Code (using Logistic Regression):

Python
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder  # For encoding categorical features
from sklearn.metrics import accuracy_score  # For model evaluation

# Create sample data Passenger 16 and 17 are Jack and Ruth 
data = {
  "PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17],
  "Pclass": [1, 3, 2, 3, 1, 2, 3, 1, 3, 2, 3, 1, 2, 3, 1, 3, 1],  # Social class (1 = Upper, 2 = Middle, 3 = Lower)
  "Sex": ["female", "male", "female", "male", "female", "male", "female", "male", "female", "male", "female", "male", "female", "male", "female", "male", "female"],
  "Age": [30, 25, 40, 60, 22, 35, 6, 70, 18, 48, 55, 20, 38, 65, 28, 20, 17],
  "SibSp": [1, 0, 1, 2, 1, 0, 4, 1, 0, 3, 1, 2, 0, 1, 0, 0, 0],  # Number of siblings/spouses aboard
  "Parch": [0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 2, 0, 1],  # Number of parents/children aboard
  "Fare": [71, 8, 13, 8, 35, 26, 7.75, 56, 8.05, 26, 16, 31, 80, 30, 8, 0, 50],  # Ticket fare
  "Survived": [1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, None, None]  # 1 = Survived, 0 = Died, None = Unknown (target variable)
}

# Convert data to Pandas DataFrame
data = pd.DataFrame(data)

# Feature selection
features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare"]

# Separate target variable (handling missing values)
target = "Survived"

# **For model training, exclude the last 2 passengers with missing survival data**
data_train = data.dropna(subset=[target])  # Remove rows with missing survival information
X_train = data_train[features]
y_train = data_train[target]

# Encode categorical features (Sex in this case)
encoder = LabelEncoder()

X_train_encoded = X_train.copy()
X_train_encoded["Sex"] = encoder.fit_transform(X_train["Sex"])
X_train = X_train_encoded.copy()
# Use X_train_encoded for further processing


# Split the remaining data for testing (including the last 2 passengers)
X_test = data[features]
y_test = data[target]

# Create and train the model
model = LogisticRegression()
model.fit(X_train, y_train)


def print_passenger_details(passenger_data, prediction):
  survival = "Survived" if prediction == 1 else "Died"
  print(f"\nPassenger Details:")
  for key, value in passenger_data.items():
    print(f"{key}: {value}")
  print(f"Predicted Survival: {survival}")


X_test_encoded = X_test.copy()
X_test_encoded["Sex"] = encoder.fit_transform(X_test["Sex"])
X_test = X_test_encoded.copy()
# Make predictions on the entire test set (including the last 2 passengers)
predictions = model.predict(X_test)

# Print details for the last 2 passengers (without modifying the data)
# Print details for the last 2 passengers (without modifying the data)
for i in range(len(data) - 2, len(data)):  # Start from the second-last element (index -2)
  passenger_data = data.iloc[i].to_dict()  # Get a copy of passenger data
  prediction = predictions[i]
  print_passenger_details(passenger_data, prediction)

# Evaluate model performance using the actual test set (excluding passengers with missing survival data)
accuracy = accuracy_score(y_test[:-2], predictions[:-2])  # Exclude last 2 predictions and targets
print(f"\nModel Accuracy on Test Set (excluding passengers with missing survival data): {accuracy:.2f}")

Output:

The code now makes a prediction for 6 passengers based on the provided features.

Passenger Details: #Jack Dawson
PassengerId: 16Pclass: 3
Sex: male
Age: 20
SibSp: 0
Parch: 0
Fare: 0.0
Survived: nan
Predicted Survival: Died
Passenger Details: #Rose Bukateer
PassengerId: 17
Pclass: 1
Sex: female
Age: 17
SibSp: 0
Parch: 1
Fare: 50.0
Survived: nan
Predicted Survival: Survived
Model Accuracy on Test Set (excluding passengers with missing survival data): 1.00

Information Technology

Wednesday, June 5, 2024

Beyond the Iceberg: Predicting Who Survived the Titanic

No comments:

Post a Comment

Day 13 of 21: Error Analysis Techniques for Machine Learning Models