My First Kaggle Competition


Today, I tackle the famous Titanic dataset for my first Kaggle competition.

Kaggle Machine Learning Competition: Predicting Titanic Survivors

https://www.kaggle.com/c/titanic

titanic

Competition Description

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

About the Data

The data is split up into a training set and a test set.

The training set is used to build my machine learning models. For the training set, I am provided the outcome (or "ground truth") for each passenger.

The test set is used to see how well my model performs on unseen data. The ground truth is not provided, so it is my job to predict these outcomes.

There' also gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary

Variable Definition Key
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
sex ex
Age Age in years
sibsp # of siblings / spouses aboard the Titanic
parch # of parents / children aboard the Titanic
ticket Ticket Number
fare Passenger fare
cabin Cabin Number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southhampton

Variable Notes

pclass: A proxy for socio-economic status (SES) 1st = Upper. 2nd = Middle. 3rd = Lower.

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister. Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father. Child = daughter, son, stepdaughter, stepson. Some children travelled only with a nanny, therefore parch=0 for them.

Alright, that's all the info we need. Lets get started.

Setup Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

Data Exploration

Read in training data:

In [27]:
train = pd.read_csv('train.csv')
In [28]:
train.tail()
Out[28]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.00 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.00 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.45 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.00 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.75 NaN Q
In [29]:
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

I have 5 'object' types in my data. Later on I turn those into dummy variables so that my algorithm works.

There is also a good amount of data missing in 'Age', a ton missing in 'Cabin', and 2 missing in 'Embarked'.

In [30]:
train.describe()
Out[30]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Data Visualization

In [31]:
sns.set_style('darkgrid')
sns.countplot(train['Survived'])
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a060c67080>

Most people did not survive. I'll dive into each feature to see who was more likely to survive:

Feature: Passenger Class

In [32]:
sns.countplot(x='Survived',hue='Pclass',data=train)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a0609e4978>

Passenger class is pretty significant to if a passenger survived or not. Those in third class had a very low chance for survival in comparison to first and second class.

Feature: Sex

In [33]:
sns.countplot(x='Survived',hue='Sex',data=train)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a060d08470>

Sex also seems to be a very important factor for survival. Females were very likely to live while males did not fare too well.

Feature: Age

In [34]:
train['Age'].hist(bins=40,edgecolor='black')
Out[34]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a060d4e898>

We saw earlier that the mean age was around 30 in output 6. This graph shows that the bulk of people are aged between 20-40 with a small cohort of children aged 0-16 on board.

Feature: Siblings/Spouses

In [35]:
sns.countplot(train['SibSp'])
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a060daf668>

Most people don't have siblings or spouses on board, with a small percent having one spouse and only a tiny percent having more than 1 spouse.

Feature: Parents/Children

In [36]:
sns.countplot(train['Parch'])
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a060e59048>

Again, most people don't have many spouses on board, with a small chunk with 1 or 2 parents/children.

Feature: Fare

In [37]:
train['Fare'].hist(bins=40,figsize=(12,4),edgecolor='black')
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a060ed5320>

Most people paid very little, which makes sense because the majority of people were in third class.

Feature: Cabin

We saw earlier that there appeared to be a lot of cabin missing. Lets visualize this:

In [38]:
sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='plasma')
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a060f4feb8>

'Cabin' seems pretty unsavable, so I'll just remove this feature.

In [39]:
train.drop('Cabin',axis=1,inplace=True)

Feature: Embarked

In [40]:
sns.countplot(train['Embarked'])
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a061face48>

Most passengers embarked from Southhampton, so it should be safe to fill in the 2 empty values with 'S'.

Data Cleaning

I removed the 'Cabin' feature earlier. Here I only need to fill the 'Age' and 'Embarked' columns. I'll fill in the 2 empty embarked columns with 'S', but more exploration needs to be done on age.

In [41]:
train['Embarked'] = train['Embarked'].fillna('S')
In [42]:
plt.figure(figsize=(12, 6))
sns.boxplot(x='Pclass',y='Age',data=train)
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a061fd8400>

As I suspected, higher classes have a higher average age. This makes sense, because older individuals are more likely to accumulate wealth and afford a better ticket.

To fill in 'Age', I'll write a function that fills in the average age based on passenger class where 'Age' is empty. I'll then use panda's apply feature to apply this function to the 'Age' column.

In [43]:
def fill_age(cols):
    #define columns for use in lambda expression
    Age = cols[0]
    Pclass = cols[1]
    
    #returns average age for respective class if cell is empty
    if pd.isnull(Age):
        if Pclass == 1:
            return train[train['Pclass']==1]['Age'].mean()

        elif Pclass == 2:
            return train[train['Pclass']==2]['Age'].mean()

        else:
            return train[train['Pclass']==3]['Age'].mean()

    else:
        return Age
In [44]:
train['Age'] = train[['Age','Pclass']].apply(fill_age,axis=1)

Replace Categorical Variables

In [45]:
train.head(2)
Out[45]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C
In [46]:
train.dtypes[train.dtypes.map(lambda x: x == 'object')]
Out[46]:
Name        object
Sex         object
Ticket      object
Embarked    object
dtype: object

I'll get rid of 'PassengerId' since it is just another index. 'Name' and 'Ticket' are also not useful data. I will remove those variables and provide dummy variables for 'Sex' and 'Embarked'.

In [47]:
dum_sex = pd.get_dummies(train['Sex'],drop_first=True)
dum_embarked = pd.get_dummies(train['Embarked'],drop_first=True)
In [48]:
train.drop(['PassengerId','Name','Sex','Ticket','Embarked'],axis=1,inplace=True)

Now I combine my dummy variables with my updated dataframe.

In [49]:
train = pd.concat([train,dum_sex,dum_embarked],axis=1)
In [50]:
train.head()
Out[50]:
Survived Pclass Age SibSp Parch Fare male Q S
0 0 3 22.0 1 0 7.2500 1 0 1
1 1 1 38.0 1 0 71.2833 0 0 0
2 1 3 26.0 0 0 7.9250 0 0 1
3 1 1 35.0 1 0 53.1000 0 0 1
4 0 3 35.0 0 0 8.0500 1 0 1

Create Machine Learning Model

I was torn over which model to use, but I ended up chosing logistic regression for a few reasons.

Logistic regression is a simple algorithm and the go-to method for binary classification. This model translates a linear regression into a form usable for binary classification called the sigmoid function: CodeCogsEqn

where y is the linear equation:
CodeCogsEqn2 sigmoid

Essentially, if a p (survival in this case) is .5 or greater, it is given a value of 1. If it is a value below .5, it is assigned a value of 0. Thus, binary classification occurs.

In [51]:
from sklearn.linear_model import LogisticRegression

fit training data to logistic regression model:

In [52]:
logmodel = LogisticRegression()
logmodel.fit(train.drop('Survived',axis=1),
             train['Survived'])
Out[52]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Bring in test data:

In [53]:
test = pd.read_csv('test.csv')
In [54]:
test.tail()
Out[54]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
413 1305 3 Spector, Mr. Woolf male NaN 0 0 A.5. 3236 8.0500 NaN S
414 1306 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C
415 1307 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S
416 1308 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S
417 1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C
In [55]:
sns.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='plasma')
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a062473320>

The test data has all of the 'Embarked' column, but is missing a fare value! I'll have to find this value and replace it appropriately.

In [56]:
test[test['Fare'].isnull()]
Out[56]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
152 1044 3 Storey, Mr. Thomas male 60.5 0 0 3701 NaN NaN S

The missing value is third class, so I'll take the average of that class and impute the missing value.

In [57]:
test['Fare'][152] = test[test['Pclass']==3]['Fare'].mean()
test['Fare'][152]
C:\Users\Ryan\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
Out[57]:
12.459677880184334
  • apply fill_age lambda expression to fill in 'Age'
  • get dummy variables for categorical variables
  • create new data frame with these new values
In [58]:
test['Age'] = test[['Age','Pclass']].apply(fill_age,axis=1)
dum_sex = pd.get_dummies(test['Sex'],drop_first=True)
dum_embarked = pd.get_dummies(test['Embarked'],drop_first=True)
In [59]:
test.drop(['Name','Sex','Ticket','Cabin','Embarked'],axis=1,inplace=True)
In [60]:
test = pd.concat([test,dum_sex,dum_embarked],axis=1)
In [61]:
test.head()
Out[61]:
PassengerId Pclass Age SibSp Parch Fare male Q S
0 892 3 34.5 0 0 7.8292 1 1 0
1 893 3 47.0 1 0 7.0000 0 0 1
2 894 2 62.0 0 0 9.6875 1 1 0
3 895 3 27.0 0 0 8.6625 1 0 1
4 896 3 22.0 1 1 12.2875 0 0 1

I now have my data set up to run predictions on my test data!

In [62]:
y_pred = logmodel.predict(test.drop('PassengerId',axis=1))
In [63]:
y_pred
Out[63]:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0],
      dtype=int64)

Create Submission File

In [64]:
submission = pd.DataFrame({
        'PassengerId': test['PassengerId'],
        'Survived': y_pred
    })
submission.to_csv('titanic.csv', index=False)
In [65]:
submission.head()
Out[65]:
PassengerId Survived
0 892 0
1 893 0
2 894 0
3 895 0
4 896 1

Compare with the given example submission called 'gender_submission':

In [66]:
example = pd.read_csv('gender_submission.csv')
In [67]:
example.head()
Out[67]:
PassengerId Survived
0 892 0
1 893 1
2 894 0
3 895 0
4 896 1

The submission looks good! Lets submit it and see how I did:

submission_pic

75.598% prediction rate. Not bad, though there's still lots of work that needs to be done. I think I can improve my prediction accuracy through feature engineering (maybe a column for if someone is a child), gradient boosting, or using a different machine learning model.

I hope you enjoyed this post! I'd love to hear any comments, questions, or suggestions down below. Until next time!