introduction

The kaggle titanic competition is the ‘hello world’ exercise for data science. Its purpose is to

Predict survival on the Titanic using Excel, Python, R & Random Forests

In this post I will go over my solution which gives score 0.79426 on kaggle public leaderboard. The code can be found on github. In short, my solution involves soft majority voting on logistic regression and random forest classifiers.

pre-processing

For this dataset, the pre-processing step includes the following operations

  • imputing missing values
  • mapping nominal features

The data can be loaded using read_csv() function.

# load data
df = pd.read_csv("train.csv")
df.info(null_counts=True)

The function info(null_counts=True) provides an overview of the data type of the columns and the missing value counts, as seen below.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

Several columns do not provide much information to the passenger’s survival thus I drop them. They are passenger’s id PassengerId, name Name, ticket number Ticket. The Cabin column is also dropped due to its incompleteness: only about 200 entries are available.

There are missing values in the Age and Embarked columns. I fill them with the following code.

# fill in missing values
df.Age.fillna(df.Age.median(), inplace=True)
df.Embarked.fillna('S', inplace=True)

# one-hot encoding on nominal features 'Sex' and 'Embarked'
df.drop(['Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1, inplace=True)
df = pd.get_dummies(df)

Here the missing Age values are filled with median value and the missing Embarked values are filled with S, which occurs 644 times out of the 889 existing entries.

The two columns Embarked and Sex contain string values that are nominal categorical features. The get_dummies() function converts them into one-hot encoded vectors. For example, the Sex column is expanded into two columns Sex_female and Sex_male whose values are either 0 or 1.

training

My strategy is to do majority voting with multiple classifiers. This is conveniently implemented by the sklearn.ensemble.VotingClassifier. The relevant code is shown below. The logistic regression and random forest classifiers have accuracy of 0.796 and 0.832 respectively. The combined result has an accuracy of 0.833 on the training data. The code is below.

predictors = df.columns.tolist()
predictors.remove('Survived')

clf1 = LogisticRegression(C=1, random_state=1, warm_start=True)
clf2 = RandomForestClassifier(random_state=1, n_estimators=10, 
                min_samples_split=5, min_samples_leaf=2, max_features=3)
eclf = VotingClassifier([('lr', clf1), ('rf', clf2)], voting='soft')

scores = cross_val_score(eclf, df[predictors], df['Survived'], cv=15)
print('CV accuracy: %.3f +/- %.3f' % (scores.mean(), scores.std()))

I also played with more classifiers, such as SVM, naive Bayes, K nearest neighbors, etc. Each of them gives an accuracy around 0.8. When combined into the ensemble classifier, the performance gets slightly worse on the training data. Actually even for the two classifier case, the ensemble classifier does not always perform better than the individual ones. I guess all these classifiers do well on say 75% of the data and unfortunately it is the same 75% of the data. Recall each classifier gets about 80% of the data right. Maybe the extra 5% of the correct classification from different methods does not overlap, making voting ineffective.

Let me demonstrate this explanation in the following pictures. For simplicity, let’s assume there are 3 classifiers for voting. When used individually, each gets 80% of the data right.

In the pictures, the horizontal direction denotes the data and the three horizontal bars denote the three classifiers. The blue region denote the part of the data where the classifiers give correct result. Ideally, we want something like below to happen, such that 100% accuracy will be achieved using majority voting.

In the worst case, the three classifiers get correct result on the same 70% of the data and the remaining three 10%s do not overlap. Then we only get 70% accuracy for the ensemble classifier, which is worse than the accuracy from individual classifier!

It’s not clear to me how to make sure the first case rather than the second case happens for our classifiers. Please drop me a comment if you have insights on this.

feature engineering

Up to this point, my score on the kaggle public leaderboard is 0.78469. To get better result, I tried several tricks from various blogs, for example

  • Untraviolet analytics’ blog
  • Trevor Stephens’ blog

The one that worked for me is to add a title feature. The idea is that the survival rate may depend on how ‘important’ the person is.

title_map = {'Ms':'Mrs', 'Mlle':'Mrs', 'Mme':'Mrs', 'Capt':'Sir',              
     'Dr':'Sir', 'Rev':'Sir', 'Major':'Sir', 'Col':'Sir', 'Don':'Sir', 
     'the Countess':'Lady', 'Jonkheer':'Lady', 'Dona':'Lady' } 

After adding the title feature, my score was boosted to 0.79426.

what next?

According to the kaggle forum, a submission with 82-84% accuracy is considered very good. Anything beyond 85% may require some cheating.

It will be nice to push my score beyond 80%. There are several possibilities:

  • more feature engineering??
    • Extract the title from Name column, like here
    • Maybe also family size
    • Turn Age into child/adult
  • fine tuning the individual classifiers more
  • change the classifier
  • stacking

Please let me know if you have other suggestions on improving the score.