Today, I tackle the famous Titanic dataset for my first Kaggle competition.

Kaggle Machine Learning Competition: Predicting Titanic Survivors¶

https://www.kaggle.com/c/titanic

titanic

Competition Description¶

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

About the Data¶

The data is split up into a training set and a test set.

The training set is used to build my machine learning models. For the training set, I am provided the outcome (or "ground truth") for each passenger.

The test set is used to see how well my model performs on unseen data. The ground truth is not provided, so it is my job to predict these outcomes.

There' also gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

Data Dictionary¶

Variable	Definition	Key
survival	Survival	0 = No, 1 = Yes
pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
sex	ex
Age	Age in years
sibsp	# of siblings / spouses aboard the Titanic
parch	# of parents / children aboard the Titanic
ticket	Ticket Number
fare	Passenger fare
cabin	Cabin Number
embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southhampton

Variable Notes¶

pclass: A proxy for socio-economic status (SES) 1st = Upper. 2nd = Middle. 3rd = Lower.

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way... Sibling = brother, sister, stepbrother, stepsister. Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way... Parent = mother, father. Child = daughter, son, stepdaughter, stepson. Some children travelled only with a nanny, therefore parch=0 for them.

Alright, that's all the info we need. Lets get started.

Setup Imports¶

In [1]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

Data Exploration¶

Read in training data:

In [27]:

train = pd.read_csv('train.csv')

In [28]:

train.tail()

Out[28]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.00	NaN	S
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.00	B42	S
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q

In [29]:

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

I have 5 'object' types in my data. Later on I turn those into dummy variables so that my algorithm works.

There is also a good amount of data missing in 'Age', a ton missing in 'Cabin', and 2 missing in 'Embarked'.

In [30]:

train.describe()

Out[30]:

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

Data Visualization¶

In [31]:

sns.set_style('darkgrid')
sns.countplot(train['Survived'])

Out[31]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a060c67080>

Most people did not survive. I'll dive into each feature to see who was more likely to survive:

Feature: Passenger Class¶

In [32]:

sns.countplot(x='Survived',hue='Pclass',data=train)

Out[32]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a0609e4978>

Passenger class is pretty significant to if a passenger survived or not. Those in third class had a very low chance for survival in comparison to first and second class.

Feature: Sex¶

In [33]:

sns.countplot(x='Survived',hue='Sex',data=train)

Out[33]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a060d08470>

Sex also seems to be a very important factor for survival. Females were very likely to live while males did not fare too well.

Feature: Age¶

In [34]:

train['Age'].hist(bins=40,edgecolor='black')

Out[34]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a060d4e898>

We saw earlier that the mean age was around 30 in output 6. This graph shows that the bulk of people are aged between 20-40 with a small cohort of children aged 0-16 on board.

Feature: Siblings/Spouses¶

In [35]:

sns.countplot(train['SibSp'])

Out[35]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a060daf668>

Most people don't have siblings or spouses on board, with a small percent having one spouse and only a tiny percent having more than 1 spouse.

Feature: Parents/Children¶

In [36]:

sns.countplot(train['Parch'])

Out[36]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a060e59048>

Again, most people don't have many spouses on board, with a small chunk with 1 or 2 parents/children.

Feature: Fare¶

In [37]:

train['Fare'].hist(bins=40,figsize=(12,4),edgecolor='black')

Out[37]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a060ed5320>

Most people paid very little, which makes sense because the majority of people were in third class.

Feature: Cabin¶

We saw earlier that there appeared to be a lot of cabin missing. Lets visualize this:

In [38]:

sns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='plasma')

Out[38]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a060f4feb8>

'Cabin' seems pretty unsavable, so I'll just remove this feature.

In [39]:

train.drop('Cabin',axis=1,inplace=True)

Feature: Embarked¶

In [40]:

sns.countplot(train['Embarked'])

Out[40]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a061face48>

Most passengers embarked from Southhampton, so it should be safe to fill in the 2 empty values with 'S'.

Data Cleaning¶

I removed the 'Cabin' feature earlier. Here I only need to fill the 'Age' and 'Embarked' columns. I'll fill in the 2 empty embarked columns with 'S', but more exploration needs to be done on age.

In [41]:

train['Embarked'] = train['Embarked'].fillna('S')

In [42]:

plt.figure(figsize=(12, 6))
sns.boxplot(x='Pclass',y='Age',data=train)

Out[42]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a061fd8400>

As I suspected, higher classes have a higher average age. This makes sense, because older individuals are more likely to accumulate wealth and afford a better ticket.

To fill in 'Age', I'll write a function that fills in the average age based on passenger class where 'Age' is empty. I'll then use panda's apply feature to apply this function to the 'Age' column.

In [43]:

def fill_age(cols):
    #define columns for use in lambda expression
    Age = cols[0]
    Pclass = cols[1]
    
    #returns average age for respective class if cell is empty
    if pd.isnull(Age):
        if Pclass == 1:
            return train[train['Pclass']==1]['Age'].mean()

        elif Pclass == 2:
            return train[train['Pclass']==2]['Age'].mean()

        else:
            return train[train['Pclass']==3]['Age'].mean()

    else:
        return Age

In [44]:

train['Age'] = train[['Age','Pclass']].apply(fill_age,axis=1)

Replace Categorical Variables¶

In [45]:

train.head(2)

Out[45]:

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C

In [46]:

train.dtypes[train.dtypes.map(lambda x: x == 'object')]

Out[46]:

Name        object
Sex         object
Ticket      object
Embarked    object
dtype: object

I'll get rid of 'PassengerId' since it is just another index. 'Name' and 'Ticket' are also not useful data. I will remove those variables and provide dummy variables for 'Sex' and 'Embarked'.

In [47]:

dum_sex = pd.get_dummies(train['Sex'],drop_first=True)
dum_embarked = pd.get_dummies(train['Embarked'],drop_first=True)

In [48]:

train.drop(['PassengerId','Name','Sex','Ticket','Embarked'],axis=1,inplace=True)

Now I combine my dummy variables with my updated dataframe.

In [49]:

train = pd.concat([train,dum_sex,dum_embarked],axis=1)

In [50]:

train.head()

Out[50]:

	Survived	Pclass	Age	SibSp	Fare	male	S
0	0	3	22.0	1	7.2500	1	1
1	1	1	38.0	1	71.2833	0	0
2	1	3	26.0	0	7.9250	0	1
3	1	1	35.0	1	53.1000	0	1
4	0	3	35.0	0	8.0500	1	1

Create Machine Learning Model¶

I was torn over which model to use, but I ended up chosing logistic regression for a few reasons.

Logistic regression is a simple algorithm and the go-to method for binary classification. This model translates a linear regression into a form usable for binary classification called the sigmoid function: CodeCogsEqn

where y is the linear equation:

Essentially, if a p (survival in this case) is .5 or greater, it is given a value of 1. If it is a value below .5, it is assigned a value of 0. Thus, binary classification occurs.

In [51]:

from sklearn.linear_model import LogisticRegression

fit training data to logistic regression model:¶

In [52]:

logmodel = LogisticRegression()
logmodel.fit(train.drop('Survived',axis=1),
             train['Survived'])

Out[52]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Bring in test data:¶

In [53]:

test = pd.read_csv('test.csv')

In [54]:

test.tail()

Out[54]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
413	1305	3	Spector, Mr. Woolf	male	NaN	0	0	A.5. 3236	8.0500	NaN	S
414	1306	1	Oliva y Ocana, Dona. Fermina	female	39.0	0	0	PC 17758	108.9000	C105	C
415	1307	3	Saether, Mr. Simon Sivertsen	male	38.5	0	0	SOTON/O.Q. 3101262	7.2500	NaN	S
416	1308	3	Ware, Mr. Frederick	male	NaN	0	0	359309	8.0500	NaN	S
417	1309	3	Peter, Master. Michael J	male	NaN	1	1	2668	22.3583	NaN	C

In [55]:

sns.heatmap(test.isnull(),yticklabels=False,cbar=False,cmap='plasma')

Out[55]:

<matplotlib.axes._subplots.AxesSubplot at 0x1a062473320>

The test data has all of the 'Embarked' column, but is missing a fare value! I'll have to find this value and replace it appropriately.

In [56]:

test[test['Fare'].isnull()]

Out[56]:

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
152	1044	3	Storey, Mr. Thomas	male	60.5	0	0	3701	NaN	NaN	S

The missing value is third class, so I'll take the average of that class and impute the missing value.

In [57]:

test['Fare'][152] = test[test['Pclass']==3]['Fare'].mean()
test['Fare'][152]

C:\Users\Ryan\Anaconda3\lib\site-packages\ipykernel_launcher.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.

Out[57]:

12.459677880184334

apply fill_age lambda expression to fill in 'Age'
get dummy variables for categorical variables
create new data frame with these new values

In [58]:

test['Age'] = test[['Age','Pclass']].apply(fill_age,axis=1)
dum_sex = pd.get_dummies(test['Sex'],drop_first=True)
dum_embarked = pd.get_dummies(test['Embarked'],drop_first=True)

In [59]:

test.drop(['Name','Sex','Ticket','Cabin','Embarked'],axis=1,inplace=True)

In [60]:

test = pd.concat([test,dum_sex,dum_embarked],axis=1)

In [61]:

test.head()

Out[61]:

	PassengerId	Pclass	Age	SibSp	Parch	Fare	male	Q	S
0	892	3	34.5	0	0	7.8292	1	1	0
1	893	3	47.0	1	0	7.0000	0	0	1
2	894	2	62.0	0	0	9.6875	1	1	0
3	895	3	27.0	0	0	8.6625	1	0	1
4	896	3	22.0	1	1	12.2875	0	0	1

I now have my data set up to run predictions on my test data!

In [62]:

y_pred = logmodel.predict(test.drop('PassengerId',axis=1))

In [63]:

y_pred

Out[63]:

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
       0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0,
       0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1,
       0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0],
      dtype=int64)

Create Submission File¶

In [64]:

submission = pd.DataFrame({
        'PassengerId': test['PassengerId'],
        'Survived': y_pred
    })
submission.to_csv('titanic.csv', index=False)

In [65]:

submission.head()

Out[65]:

	PassengerId	Survived
0	892	0
1	893	0
2	894	0
3	895	0
4	896	1

Compare with the given example submission called 'gender_submission':

In [66]:

example = pd.read_csv('gender_submission.csv')

In [67]:

example.head()

Out[67]:

	PassengerId	Survived
0	892	0
1	893	1
2	894	0
3	895	0
4	896	1

The submission looks good! Lets submit it and see how I did:

submission_pic

75.598% prediction rate. Not bad, though there's still lots of work that needs to be done. I think I can improve my prediction accuracy through feature engineering (maybe a column for if someone is a child), gradient boosting, or using a different machine learning model.

I hope you enjoyed this post! I'd love to hear any comments, questions, or suggestions down below. Until next time!

Beyond Data

My First Kaggle Competition