This blog explores using machine learning to predict survival rates for passengers on the RMS Titanic. It's a supervised learning problem where we build a model based on historical data.
Problem Statement:
The sinking of the Titanic is one of the most infamous shipwrecks in history. In 1912, during her first voyage, the “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
Given information about passengers (ie name, age, gender, socio-economic class, etc) on Titanic, can we build a predictive model to predict their survival during the disaster?
Data and Features:
This code uses a limited dataset with features like age, gender, and social class.
Code (using Logistic Regression):
Python
# Import libraries import pandas as pd from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.preprocessing import LabelEncoder # For encoding categorical features from sklearn.metrics import accuracy_score # For model evaluation # Create sample data Passenger 16 and 17 are Jack and Ruth data = { "PassengerId": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17], "Pclass": [1, 3, 2, 3, 1, 2, 3, 1, 3, 2, 3, 1, 2, 3, 1, 3, 1], # Social class (1 = Upper, 2 = Middle, 3 = Lower) "Sex": ["female", "male", "female", "male", "female", "male", "female", "male", "female", "male", "female", "male", "female", "male", "female", "male", "female"], "Age": [30, 25, 40, 60, 22, 35, 6, 70, 18, 48, 55, 20, 38, 65, 28, 20, 17], "SibSp": [1, 0, 1, 2, 1, 0, 4, 1, 0, 3, 1, 2, 0, 1, 0, 0, 0], # Number of siblings/spouses aboard "Parch": [0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 2, 0, 1], # Number of parents/children aboard "Fare": [71, 8, 13, 8, 35, 26, 7.75, 56, 8.05, 26, 16, 31, 80, 30, 8, 0, 50], # Ticket fare "Survived": [1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, None, None] # 1 = Survived, 0 = Died, None = Unknown (target variable) } # Convert data to Pandas DataFrame data = pd.DataFrame(data) # Feature selection features = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare"] # Separate target variable (handling missing values) target = "Survived" # **For model training, exclude the last 2 passengers with missing survival data** data_train = data.dropna(subset=[target]) # Remove rows with missing survival information X_train = data_train[features] y_train = data_train[target] # Encode categorical features (Sex in this case) encoder = LabelEncoder() X_train_encoded = X_train.copy() X_train_encoded["Sex"] = encoder.fit_transform(X_train["Sex"]) X_train = X_train_encoded.copy() # Use X_train_encoded for further processing # Split the remaining data for testing (including the last 2 passengers) X_test = data[features] y_test = data[target] # Create and train the model model = LogisticRegression() model.fit(X_train, y_train) def print_passenger_details(passenger_data, prediction): survival = "Survived" if prediction == 1 else "Died" print(f"\nPassenger Details:") for key, value in passenger_data.items(): print(f"{key}: {value}") print(f"Predicted Survival: {survival}") X_test_encoded = X_test.copy() X_test_encoded["Sex"] = encoder.fit_transform(X_test["Sex"]) X_test = X_test_encoded.copy() # Make predictions on the entire test set (including the last 2 passengers) predictions = model.predict(X_test) # Print details for the last 2 passengers (without modifying the data) # Print details for the last 2 passengers (without modifying the data) for i in range(len(data) - 2, len(data)): # Start from the second-last element (index -2) passenger_data = data.iloc[i].to_dict() # Get a copy of passenger data prediction = predictions[i] print_passenger_details(passenger_data, prediction) # Evaluate model performance using the actual test set (excluding passengers with missing survival data) accuracy = accuracy_score(y_test[:-2], predictions[:-2]) # Exclude last 2 predictions and targets print(f"\nModel Accuracy on Test Set (excluding passengers with missing survival data): {accuracy:.2f}")
Output:
The code now makes a prediction for 6 passengers based on the provided features.
No comments:
Post a Comment