• Home
  • Projects
  • Substack
Substack

Rapid prototyping: Titanic passenger survival using fastai tabular learner

🚢 A deep learning solution for the Kaggle Titanic survival competition in under one hour
By Przemek, January 2024 (last update July 2024)

In this post we’re going to use fastai to solve the Titanic passenger survival prediction problem on Kaggle.

The landing page of the Titanic competition on Kaggle. It introduces the competition where the task is to predict the survival outcome of the Titanic passengers.

Titanic passenger survival

In this prediction problem, we’re given descriptions of Titanic passengers, along with their survival outcome (survived vs died). Based on this, we need to train a machine learning model that is later evaluated on another passenger data file, where the survival outcomes are hidden (Kaggle knows them, but we don’t).

The training file (first few lines):

PassengerIdPclassNameSexAgeSibSpParchTicketFareCabinEmbarkedSurvived
13Braund, Mr. Owen Harrismale22.010A/5 211717.25S0
21Cumings, Mrs. John Bradleyfemale38.010PC 1759971.2833C85C1
33Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.925S1
41Futrelle, Mrs. Jacques Heathfemale35.01011380353.1C123S1
53Allen, Mr. William Henrymale35.0003734508.05S0
63Moran, Mr. Jamesmale003308778.4583Q0
71McCarthy, Mr. Timothy Jmale54.0001746351.8625E46S0
83Palsson, Master. Gosta Leonardmale2.03134990921.075S0
93Johnson, Mrs. Oscar Wfemale27.00234774211.1333S1
102Nasser, Mrs. Nicholasfemale14.01023773630.0708C1

fastai

We’re going to use fastai .

fastai a high-level deep learning library providing ready-to-use neural network architectures for a number of common problem types. The downside of using such library is that we don’t get to learn all the fascinating details of how neural networks are designed and trained. The upside is that we get a working solution in no time.

The welcome banner on the fastai documentaton website

Behind the scenes, fastai uses the popular PyTorch library. So overall our techs stack will look like this:

  • 🤖 fastai: high-level deep learning library providing ready-to-use neural network architectures
  • 🔥 PyTorch low-level deep learning library. It provides an optimized framework for defining and running neural networks
  • đź’» Hardware The actual CPU and GPU on the computer

Tabular learner

fastai comes with ready-to-go neural network architectures for common classes of problems. One of these classes is for making predictions based on tabular data, which is exactly what we need for the Titanic competition.

fastai tabular learner documentation

To use the Tabular learner, we need to feed it the passenger data. The framework can handle two types of features:

  • continuous: where the value of the feature is a number on some numeric scale. For example: age, ticket price
  • categorical: where the values represent some abstract categories that are not part of numeric scale. For example: port of embarkation, ticket class, sex

Continuous or categorical? Sometimes it’s not obvious which type a given feature is. For example, we could argue that “ticket class” is a continuous variable. However, to me it’s more realistic to model it as a categorical variable: different ticket classes may have different accomodations (cabins in different parts of the ship) that could affect survival chances in a way that doesn’t fit a neat numeric scale.

There’s no obvious way of handling opaque features such as ticket number, passenger name or cabin number, so we’re going to simply ignore it. Let’s see some code! The train.csv data file comes from Kaggle.

import pandas as pd

df = pd.read_csv('../input/titanic/train.csv')

# Drop the opaque features we're going to ignore
df = df.drop(['Name', 'Cabin', 'Ticket'], axis=1)

df.columns

This produces the list of the remaining features: ['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']. Let’s configure the tabular learner indicating which are continuous, which are categorical, and which is the variable we learn how to predict.

from fastai.tabular.all import *

splits = RandomSplitter(valid_pct=0.2)(range_of(df))
to = TabularPandas(df, procs=[Categorify, FillMissing, Normalize],
                   cat_names = ['Pclass', 'Sex', 'Embarked'],
                   cont_names = ['Age', 'Fare', 'SibSp', 'Parch'],
                   y_names='Survived',
                   y_block=CategoryBlock,
                   splits=splits)

fastai takes care of preprocessing:

  • FillMissing replaces missing data points with averages/most common values for each feature. This way we don’t have to discard an entire passenger if we’re missing an entry for one of their features
  • Normalize scales continuous variables, so that they fit the range of 0.0 to 1.0. This helps the neural network train better (bigger numbers tend to grow too much when they’re multiplied together).
  • Categorify handles categorical variables using embeddings, more on this below

Peeking inside the box

To use the Tabular learner, we don’t need to understand the neural network behind it. But it’s instructive to take a peek!

dls = to.dataloaders(bs=64)
learn = tabular_learner(dls, metrics=accuracy)
print(learn.model)

This prints out a detailed description of the underlying PyTorch neural network that fastai set up:

TabularModel(
  (embeds): ModuleList(
    (0): Embedding(4, 3)
    (1): Embedding(3, 3)
    (2): Embedding(4, 3)
    (3): Embedding(3, 3)
  )
  (emb_drop): Dropout(p=0.0, inplace=False)
  (bn_cont): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (layers): Sequential(
    (0): LinBnDrop(
      (0): Linear(in_features=16, out_features=200, bias=False)
      (1): ReLU(inplace=True)
      (2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): LinBnDrop(
      (0): Linear(in_features=200, out_features=100, bias=False)
      (1): ReLU(inplace=True)
      (2): BatchNorm1d(100, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (2): LinBnDrop(
      (0): Linear(in_features=100, out_features=2, bias=True)
    )
  )
)

Let’s look at the main pieces:

The four “embedding” modules. These are needed for each “categorical” variable in our input: sex, point of embarkation, class, etc. Embeddings represent each category as an abstract vector of numbers, allowing the model to learn hidden relations between category elements. For example, if people who embarked in Southampton had similar survival outcomes to those who embarked in Cherbourg, but different than those in Queenstown, the model will be able to learn that.

Linear modules. Linear(in_features=16, out_features=200) is the core of the network. Here the 16 input attributes of each passengers are connected to 200 artificial neurons: mathematical formulas that will try to learn relations between data points and their survival outcomes. We also have a second layer of these, connecting 200 neurons in layer 1 with 100 neurons in layer 2.

Output. At the end, the last linear layer connects the 100 neurons in layer 2 to just 2 output features, corresponding to two survival outcomes: survived or perished.

With the neural network in place, we need just one more line of code to train it on the training data:

learn.fit_one_cycle(20)
fastai training loop output

Results

When submitted on Kaggle, the fastai neural network solution reaches the accuracy of 78% out of the box.

fastai tabular learner gets 78% accuracy on Kaggle Titanic competition out of the box

This is much better than the baseline 62% for a solution that simply predicts that everyone dies and a bit better than 76% I got a few months back when experimenting with decision forests. Not bad!

Conclusion

With a high-level framework like fastai, it’s easy to train a neural network for a specific problem using one of the built-in architectures. These can be used as black boxes, but we can also peek at the underlying neural network architecture to understand how it works. In either case, rapid prototyping can be very helpful regardless of our expertise level, as it allows us to quickly try and compare different solutions.

Happy training!

    Topics

    • Deep learning

    Outline

    • Titanic passenger survival
    • fastai
    • Tabular learner
    • Peeking inside the box
    • Results
    • Conclusion

    References

    • 🏆 Kaggle Titanic Competition
    • đź’ľ Source code
    • 📝 Newsletter post

    If you liked this and want more ...

    People trying to get along with computers. Things we can do with AI, things we better do ourselves. An occasional segway to Steinbeck's post-rodeo hangover đź’«.

    ... check out my weekly column