An introduction to Pytorch and Fastai v2 on the MNIST dataset.
Building a digit classifier with deep learning.
- How ?
- Who does this blog post concern ?
- Type of model built
- Why ?
- Downloading the data
- The pytorch way
- The Fastai way
- Evaluating our model's inference on the testing dataset!
- How was it ?
How ?
We will build a deep learning model for digit classification on the MNIST dataset using the Pytorch library first and then using the fastai library based on Pytorch to showcase how easy it makes building models.
Who does this blog post concern ?
This is addressed to people that have basic knowledge about deep learning and want to start building models. I will explain some aspects of deep learning but don't expect a full course starting scratch!
Type of model built
We won't create a brand new architecture for our neural net. Actually, in the first part using Pytorch, we will only include linear layers with some non-linearity between them. No convolution etc.. We aren't aiming at building a state of the art model.
Why ?
I made this as part of the homework recommendation from the Deep Learning for Coders with Fastai and PyTorch book I am currently reading. Go check it out !
from fastai.vision.all import *
import torchvision
import torchvision.transforms as transforms
from livelossplot import PlotLosses
URLs.MNIST
Using fatai's untar_data
procedure, we will download and decompress the data from the above url in one go. The data will only be downloaded the first time.
Take a look at the documentation if you want to learn more.
path = untar_data(URLs.MNIST, dest="/workspace/data")
Path.BASE_PATH = path
path.ls()
As you can see, the data was already split into a training and testing dataset for our convenience! Let's take a peek into what is inside.
(path/"training").ls()
We have a different directory for every digit, each of them containing images (see below) of their corresponding digit.
This makes labeling easy. The label of each image is the name of its parent directory!
(path/"training/1").ls()
For example, the 1 directory contains 6742 images. One is displayed below.
image = Image.open((path/"training/1").ls()[0])
image
image.size
image.mode
This image and all the others in the data we just downloaded are 28x28 grayscale images ('L' mode means gray-scale).
transform = transforms.Compose(
[transforms.Grayscale(), transforms.ToTensor(), transforms.Normalize([0.5], [0.5])]
)
Above are the transformations we will make to each of the images when creating our Pytorch datasets.
Step 1: Converting into a grayscale image, i.e. fusing the RGB color channels into a grayscale one (from what would be a [3, 28, 28] tensor to a [1, 28, 28]).
loader
parameter of ImageFolder
(see next cell) loads 3 channels even if the original image only has one. I couldn’t bother creating a custom loader so this does the trick.
Step 2: Converting the grayscale image (with pixel values in the range [0, 255] into a 3 dimensional [1, 28, 28] pytorch tensor (with values in the range [0, 1]).
Step 3: We normalize with mean = 0.5 and std = 0.5 to get values from pixels in the range [-1, 1]. (pixel = (image - mean) / std maps 0 to -1 and 1 to 1).
full_dataset = torchvision.datasets.ImageFolder((path/"training").as_posix(), transform = transform)
# Splitting the above dataset into a training and validation dataset
train_size = int(0.8 * len(full_dataset))
valid_size = len(full_dataset) - train_size
training_set, validation_set = torch.utils.data.random_split(full_dataset, [train_size, valid_size])
# Dataset using the "testing" folder
testing_set = torchvision.datasets.ImageFolder((path/"testing").as_posix(), transform = transform)
We just built 3 datasets. A training dataset, a validation dataset and a testing dataset.
Our images in the "training" folders were divided randomnly into the testing and validation dataset with a ratio of 80% and 20% of the images respectively.
Testing dataset: Used to calculate our gradients and update our weights using the loss obtained forwarding the data through the network.
Validation dataset: Used to assess model performance on unseen data during the training. We tune our hyperparameters (learning rate, batch size, number of epochs, network structure etc.) to improve this performance.
Testing dataset: Used to get a final, unbiased, performance assessment. This data wasn't seen during the whole model building process.
In pytorch, a "Data loader combines a dataset and a sampler, and provides an iterable over the given dataset". Look at the documentation to learn more.
bs = 64
The bs
variable above corresponds to the batch size. This is the number of observations forwarded at a time in our neural network (and used to calculate our mean loss and then our gradients for the training).
train_loader = torch.utils.data.DataLoader(training_set, batch_size=bs, shuffle=True)
validation_loader = torch.utils.data.DataLoader(validation_set, batch_size=bs)
dataloaders = {
"train": train_loader,
"validation": validation_loader
}
We created a training and a testing data loader we will iterate on during our buidling process. The shuffle
argument is set to True for the training data loader, meaning we will reshuffle the data at every epoch.
Training a neural network
Deep learning is like making a dish. I like to see the neural network's architecture as the plates / the cutlery / cooking tools, the weights as the ingredients and the hyperparameters as the cooking time / temperature / seasonning etc.
Creating the architecture
Without the proper tools, it would be impossible to make the dish you want and for it to be good, even if you found all the ingredients that satisfy your needs.
pytorch_net = nn.Sequential(
nn.Flatten(),
nn.Linear(28*28, 128),
nn.ReLU(),
nn.Linear(128, 50),
nn.ReLU(),
nn.Linear(50,10),
nn.LogSoftmax(dim=1))
Here we chose a simple but good enough network architecture. It may not be state of the art but as you will see, it still performs quite well!
Flatten
: flattens our [1,28,28] tensor into a [1,784] tensor. Our model doesn't care if it was a square image to start with, it just sees numbers, and a's long as the same pixel in our original image gets mapped to the same input variable (one of the 784 values) each time, our model will be able to learn.
We won't be doing any spatial treatment (like convolution , pooling etc.), so we just start be turning our input tensor into a feature vector that will be used by our classifier.
Linear
: linear layer with an additive bias (bias
parameter is set to True
by default).
ReLU
: stands for Rectified linear unit is an activation function, also called a nonlinearity. It replaces every negative number with 0 (See plot below). By adding a nonlinear function between each linear layer, they become somewhat decoupled from each other and can each do its own useful work. Meaning with nonlinearity between linear layers we can now learn nonlinear relations!
LogSoftmax
: applies log(Softmax(x)) to the last layer. Softmax maps all the values to [0, 1] and add up to 1 (probability distribution). log(Softmax) maps these values to [-inf, 0].
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
lr = 1e-2
nb_epoch = 77
Before moving on, we should define a bunch of variables.
"A torch.device
is an object representing the device on which a torch.Tensor
is or will be allocated". Head here for more info.
Here we want to perform our computations on a GPU if it is available.
lr
is our learning rate hyperparameter representing the size of the step we take when applying SGD.
nb_epoch
is our number of epochs, meaning the number of complete passes through the training dataset.
optimizer = torch.optim.SGD(pytorch_net.parameters(), lr=lr)
The optimizer
object above will handle the stochastic gradient descent (SGD) step for us. We need to pass it our model's parameters (so it can step on them) and a learning rate.
criterion = nn.NLLLoss()
We chose pytorch's nn.NLLLoss()
for our loss function. It stands for negative log likelihood loss and is useful to train a classification problem with more than 2 classes. It expects log-probabilities as input for each class, which is our case after applying LogSoftmax
.
LogSoftmax
layer in the last layer of our network and using NLLLoss
, we could have used CrossEntropyLoss
instead which is a loss that combines the two into one single class. Read the doc for more.
def train_model(model, criterion, optimizer, dataloaders, num_epochs=10):
liveloss = PlotLosses() # Live training plot generic API
model = model.to(device) # Moves and/or casts the parameters and buffers to device.
for epoch in range(num_epochs): # Number of passes through the entire training & validation datasets
logs = {}
for phase in ['train', 'validation']: # First train, then validate
if phase == 'train':
model.train() # Set the module in training mode
else:
model.eval() # Set the module in evaluation mode
running_loss = 0.0 # keep track of loss
running_corrects = 0 # count of carrectly classified inputs
for inputs, labels in dataloaders[phase]:
inputs = inputs.to(device) # Perform Tensor device conversion
labels = labels.to(device)
outputs = model(inputs) # forward pass through network
loss = criterion(outputs, labels) # Calculate loss
if phase == 'train':
optimizer.zero_grad() # Set all previously calculated gradients to 0
loss.backward() # Calculate gradients
optimizer.step() # Step on the weights using those gradient w -= gradient(w) * lr
_, preds = torch.max(outputs, 1) # Get model's predictions
running_loss += loss.detach() * inputs.size(0) # multiply mean loss by the number of elements
running_corrects += torch.sum(preds == labels.data) # add number of correct predictions to total
epoch_loss = running_loss / len(dataloaders[phase].dataset) # get the "mean" loss for the epoch
epoch_acc = running_corrects.float() / len(dataloaders[phase].dataset) # Get proportion of correct predictions
# Logging
prefix = ''
if phase == 'validation':
prefix = 'val_'
logs[prefix + 'log loss'] = epoch_loss.item()
logs[prefix + 'accuracy'] = epoch_acc.item()
liveloss.update(logs) # Update logs
liveloss.send() # draw, display stuff
We have our everything needed to cook our meal ! The actual cooking takes place in the above function, which handles the training phase and validation phase.
The above graph illustrates what is going on during our training phase. We use our model to make predictions and calculate our loss (NLLLoss
here) based on the real labels, then calulate the gradients using loss.backward()
(computes dloss/dx for every parameter x which has requires_grad=True
, which is the case for nn.Parameters()
that we use under the hood) and step the weights with our optimizer before repeating the process.
The stop condition in our case is just the number of epochs.
The validation phase is basically the same process without calculating gradients and stepping since we are only intersted on measuring model performance.
BatchNorm
and Dropout
have a different behavior during training and evaluation. This is not the case of our model but it is still good habit)
train_model(pytorch_net, criterion, optimizer, dataloaders, nb_epoch)
After 80 epochs we get 97.7% accuracy on the validation data, which is very good for a simple model such as this one!
Even though our validation loss and accuracy stabilized themselves after around 50 epochs, I kept going for a couple epochs just in case I could squeeze a bit more out.
torch.save(pytorch_net, 'models/pytorch-97.7acc.pt')
Let's save our trained model for inference using torch.save
.
The Fastai way
"fastai is a deep learning library which provides practitioners with high-level components that can quickly and easily provide state-of-the-art results in standard deep learning domains, and provides researchers with low-level components that can be mixed and matched to build new approaches.".
Read the docs to learn more!
Data preparation
block = DataBlock(
blocks=(ImageBlock, CategoryBlock),
get_items=get_image_files,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_y=parent_label,
batch_tfms=aug_transforms(mult=2., do_flip=False))
The DataBlock
class is a "generic container to quickly build Datasets
and DataLoaders
".
-
blocks
: This is the way of telling the API that our inputs are images and our targets are categories. Types are represented by blocks, here we useImageBlock
andCategoryBlock
for inputs and targets respectively. -
get_items
: expects a function to assemble our items inside the data block.get_image_files
searches subfolder for all image filenames recursively. -
splitter
: Controls how our validation set is created.RandomSplitter
splits items between training and validation (withvalid_pct
portion in validation) randomnly. -
get_y
: expects a function to label data according to file name.parent_label
labels items with the parent folder name. -
batch_tfms
: These are transformations applied to batched data samples on the GPU.aug_transforms
is an "utility function to create a list of flip, rotate, zoom, warp and lighting transforms". (Here we disabled flipping because we don't want to train on mirrored images and use twice the amount of augmentation compared to the default.) These augmentations are only done on the training set, we don't want to evaluate our model's performance on distorted images.
In fact, an entirely untrained neural network knows nothing whatsoever about how images behave. It doesn’t even recognize that when an object is rotated by one degree, it still is a picture of the same thing! So actually training the neural network with examples of images where the objects are in slightly different places and slightly different sizes helps it to understand the basic concept of what an object is, and how it can be represented in an image." (Deep Learning for Coders with Fastai and PyTorch)
This doesn't acutally build the datasets and data loaders since we didn't actually gave it our images yet. But once we do, it knows exactly how to deal with them!
loaders = block.dataloaders(path/"training")
loaders.train.show_batch(max_n=4, nrows=1)
block.dataloaders
creates a DataLoaders
object from the source we give it. Here we gave it the training folder.
The sample of images shown are outputs from the created training data loader. As you can see, they are correctly labeled and quite distorted due to the batch augmentations we made.
We use the default value of 64 for our batch size (bs
parameter).
learn = cnn_learner(loaders, resnet34, metrics=accuracy)
cnn_learner
builds a convolutional neural network style learner from dataloaders and an architecture. In our case we use the ResNet architecture. The 34 refers to the number of layers in this variant of the architecture.
cnn_learner
has a parameter called pretrained
which defaults to True
, that sets the weights in our model to values already trained by experts to recognize thousands of categories on the ImageNet dataset.
When using a pretrained model, cnn_learner will remove the last layer since that is always specifically customized to the original training task (i.e. ImageNet dataset classification), and replace it with one or more new layers with randomized weights (called the head), of an appropriate size for the dataset you are working with.
learn.lr_find()
learn.lr_find()
explores learning rates in a given range ([1e-7, 10] by default) over a number of iterations (100 default) and plots the loss versus the learning rates on a log scale.
lr_min
and lr_steep
indicators above and choose a learning rate between them.
learn.fine_tune(12, base_lr=1e-2, cbs=[ShowGraphCallback()])
learn.fine_tune
: "Fine tune with freeze
for freeze_epochs
then with unfreeze
for epochs
using discriminative LR" (docs)
By default a pretrained Learner
is in a frozen state, meaning that only the head of the model will train while the body stays frozen.
To resume, fine_tune
trains the head (automatically added by cnn_learner
with random weights) without the body for a few epochs (defaults to 1) and then unfreezes the Learner
and trains the whole model for a number of epochs (here we chose 12) using discriminative learning rates (which means it applies different learning rates for different parts of the model).
cbs
expects a list of callbacks. Here we passed ShowGraphCallback
which updates a graph of training and validation loss (as seen above).
CONGRATS!
After training our model for a while, we get around 99.5% accuracy on our validation set with minimal effort!
learn.export("models/fastai-99acc.pkl")
learn.export
saves the definition of how to create our DataLoaders
on top of saving the architecture and parameters of the model.
Saving the Dataloaders
allows us to transform the data for inference in the same manner as our validation set by default, so data augmentation will not be applied.
interp = ClassificationInterpretation.from_learner(learn)
ClassificationInterpretation.from_learner()
constructs an ClassificationInterpretation
object from a learner
.
It gives a handful of interpretation methods for classification models.
interp.plot_confusion_matrix()
The above confusion matrix helps us visualize where our model made mistakes. It like the most confused number were 0 with 6, 6 with 8, 5 with 3 and 7 with 2.
interp.plot_top_losses(10)
We can also visualize which images resulted in the largest loss. It seems like the upper-left 9 was mislabeled as a 3, so our network was right. For some of those, even humans could have mistaken them !
def test_model(model, criterion, test_loader):
model = model.to(device) # Moves and/or casts the parameters and buffers to device.
test_loss = 0.0 # keep track of loss
test_corrects = 0 # count of carrectly classified inputs
with torch.no_grad(): # Disable gradient calculation
for inputs, labels in test_loader:
inputs = inputs.to(device) # Perform Tensor device conversion
labels = labels.to(device)
outputs = model(inputs) # forward pass through network
loss = criterion(outputs, labels) # Calculate loss
_, preds = torch.max(outputs, 1)
test_loss += loss * inputs.size(0) # multiply mean loss by the number of elements
test_corrects += torch.sum(preds == labels.data) # add number of correct predictions to total
avg_loss = test_loss / len(test_loader.dataset) # get the "mean" loss for the epoch
avg_acc = test_corrects.float() / len(test_loader.dataset) # Get proportion of correct predictions
return avg_loss.item(), avg_acc.item()
Our testing procedure is basically the same as our validation phase from the training procedure apart from the absence of epochs. (To be expected since they serve the same purpose!)
We infer predictions from our inputs by batches, then calculate the loss from them (how "far" they were from the real labels) and record the loss and the number of correctly labeled inputs, before averaging it all at the end.
testing_loader = torch.utils.data.DataLoader(testing_set, batch_size=bs)
Creation of a testing DataLoader
to be passed to our testing procedure.
pytorch_loss, pytorch_accuracy = test_model(pytorch_net, criterion, testing_loader)
def print_loss_acc(loss, acc):
print("Loss : {:.6f}".format(loss))
print("Accuracy : {:.6f}".format(acc))
print_loss_acc(pytorch_loss, pytorch_accuracy)
The results on the testing data are approximately the same as on the validation set!
learn = load_learner('models/fastai-99acc.pkl')
test_dl = learn.dls.test_dl(get_image_files(path/"testing"), with_labels=True)
test_dl
creates a test dataloader from test_items
(list of image paths) using validation transforms of dls
.
We set with_labels
to True
because we want the labels of each image to check the inference accuracy of our model.
fastai_loss, fastai_accuracy = learn.validate(dl=test_dl)
learn.validate
returns the calculated loss and the metrics of the model on the dl
data loader.
print_loss_acc(fastai_loss, fastai_accuracy)
Our loss and accuracy are slightly better than on our validation set!
And that's about it! Not so hard heh! We now have a two digits classification models ready to be used for inference!
This is one of my first blog posts and it took me some time. Any feedback is welcome!
Was the whole model building process easier than expected?
Would you have done some parts a different way?
Was my explaination any good?
Please feel free to comment or annotate the text directly using Hypothes.is if you spotted any errors or have any questions!