Lecture 17 - Model Selection, Hyperparameter Tuning

View notebook on Github Open In Collab

17.1 Model Selection

Model selection in Machine Learning is selecting one final model for a given task, e.g., that will be deployed in production. In general, selecting the “best” model should be based not only on the obtained values of relevant performance metrics (accuracy, specificity, sensitivity), but also based on other considerations, such as computational expense, available resources, model complexity, maintainability, and similar.

An important phase in selecting a candidate Machine Learning model is hyperparameter tuning. In the previous lecture on scikit-learn, we mentioned that parameters (weights) of the model have values that are updated in an iterative process during training, whereas hyperparameters (tuning parameters) are a set of parameters that control the complexity and performance of the model, and are selected (tuned) by the user. The hyperparameters are not updated during model training, and they stay constant. Hyperparameter tuning (also known as hypertuning) is the process of selecting hyperparameter values to find a model that generalizes well to unseen data. In the previous lectures, we explained that scikit-learn offers functions for Grid Search and Random Search, which search for solutions over different values of the hyperparameters and select an optimal set of hyperparameters for a given performance metric.

Another important aspect for model selection is the evaluation of model performance. It is particularly important to evaluate candidate models on unseen data during training, and this is typically achieved by splitting the available data into training and test datasets. We also learned that k-fold cross-validation can be used to draw folds from the available data, and evaluate the models on multiple different folds by resampling the available data. With k-fold cross-validation, each data point can appear only in one of the folds when evaluating the model performance.

Model selection in general can involve various data preprocessing techniques, evaluating different feature engineering strategies, applying Ensemble Methods to aggregate the predictions from several individual Machine Learning models, etc.

In this lecture, we will focus on hyperparameter tuning with neural networks. Namely, neural networks are more sensitive to hyperparameter tuning than conventional Machine Learning models (such as Liner Regression, k-Nearest Neighbors, etc.). We saw in the previous lectures that even if we use default values for the models in scikit-learn without any hyperparameter tuning, the models can still achieve solid performance. This is rarely the case with neural networks, as they usually require at least some hyperparameter tuning.

One note is to not confuse hyperparameter tuning with model fine-tuning, which refers to using a pretrained model on a large dataset and fine-tuning the model parameters on a smaller dataset.

Hyperparameters in Neural Networks

Let’s examine again the ConvNet that we used in a previous lecture to classify images in the CIFAR-10 dataset, and let’s try to identify the hyperparameters of the model.

# define the layers in the model
inputs = Input(shape=(32, 32, 3))
conv1a = Conv2D(filters=32, kernel_size=3, padding='same')(inputs)
conv1b = Conv2D(filters=32, kernel_size=3, padding='same')(conv1a)
pool1 = MaxPooling2D()(conv1b)
conv2a = Conv2D(filters=64, kernel_size=3, padding='same')(pool1)
conv2b = Conv2D(filters=64, kernel_size=3, padding='same')(conv2a)
pool2 = MaxPooling2D()(conv2b)
conv3a = Conv2D(filters=128, kernel_size=3, padding='same')(pool2)
conv3b = Conv2D(filters=128, kernel_size=3, padding='same')(conv3a)
pool3 = MaxPooling2D()(conv3b)
flat = Flatten()(pool3)
dense1 = Dense(128, activation='relu')(flat)
dropout1 = Dropout(0.25)(dense1)
dense2 = Dense(64, activation='relu')(dropout1)
dropout2  = Dropout(0.25)(dense2)
outputs = Dense(10, activation='softmax')(dropout2)

# define the model with inputs and outputs
cifar_cnn = Model(inputs, outputs)

# compile the model
cifar_cnn.compile(optimizer=Adam(learning_rate=1e-3), loss='categorical_crossentropy',  metrics=['accuracy'])

# train the model
cifar_cnn.fit(train_data, train_label_onehot, epochs=10, batch_size=128)

Hyperparameters in the above model include:

  • Learning rate of the optimizer

  • Batch size

  • Number of training epochs

  • Number of Convolutional layers

  • Number of convolutional filters in the Convolutional layers

  • Kernel size of the convolutional filters

  • Type of padding in the Convolutional layers

  • Number of Dense layers

  • Number of neurons in each Dense layer

  • Type of activation functions used in the layers

  • Number of Dropout layers

  • Dropout rate in the Dropout layers

  • Type of optimizer (e.g., Adam, SGD, Nadam, RMSProp)

  • Other parameters used in the optimizer (e.g., momentum)

  • Type of initialization for the parameters in the model

There can be other hyperparameters depending on the network, however, one immediate observation is that neural networks have a large number of hyperparameters, and tuning all hyperparameters can be challenging as it may take significant time and resources.

On the other hand, not all of the hyperparameters have significant impact on the performance of the model. Out of all hyperparameters, probably the most important is the learning rate, and in most cases, some tuning of the learning rate is required. In this lecture we will present techniques for hyperparameter tuning of a ConvNet model built with Keras and TensorFlow.

17.2 Evaluate the Impact of the Learning Rate

Loading a Custom Image Dataset in Keras

In previous lectures we worked with datasets that are built-in in Keras or scikit-learn and that can be directly loaded. Let’s look at loading a custom dataset that is not part of the popular ML libraries. The dataset is saved in a folder on my Google Drive, therefore I need to first mount the Google Drive in order to access the folder with the images.

For this lecture, we will use the LFW dataset (Labeled Faces in the Wild), which consists of about 5,000 images of 62 celebrities. The next cells load the dataset and plot a few images to make sure that the labels are correct.

[ ]:
# mount the google drive
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
[ ]:
# import packages
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from keras.utils import load_img, img_to_array
import os
from os import listdir
import csv
import natsort

# Print the version of tf
print("TensorFlow version:{}".format(tf.__version__))
TensorFlow version:2.14.0
[ ]:
# Path to the directory containing the dataset
!unzip -uq 'drive/MyDrive/Data_Science_Course/Fall_2023/Lecture_17-Model_Selection,Tuning/data/LFW-dataset.zip' -d 'sample_data/'
[ ]:
# Directories
train_dir = 'sample_data/LFW-dataset/Train/'
test_dir = 'sample_data/LFW-dataset/Test/'
val_dir = 'sample_data/LFW-dataset/Validation/'
labels_dir = 'sample_data/LFW-dataset/'

# Size of images (pixel width and height)
image_size = 100

# Function for loading the images
def load_imgs(path):
    # List of all images in the folder
    imgList = listdir(path)
    # Make sure that the images are sorted in ascending order
    imgList=natsort.natsorted(imgList)
    # Number of images
    number_imgs = len(imgList)
    # Initialize numpy arrays for the images
    images = np.zeros((number_imgs, image_size, image_size, 3))
    # Read the images
    for i in range(number_imgs):
        tmp_img = load_img(path + imgList[i], target_size=(image_size, image_size, 3))
        img = img_to_array(tmp_img)
        images[i] = img/255.0
    return images

# Call the above function to load the images as numpy arrays
imgs_train = load_imgs(train_dir)
imgs_test = load_imgs(test_dir)
imgs_val = load_imgs(val_dir)
[ ]:
# Load the labels as numpy arrays
labels_train = np.genfromtxt(labels_dir + "train_labels.csv", delimiter=',', dtype=np.int32)
labels_test = np.genfromtxt(labels_dir + "test_labels.csv", delimiter=',', dtype=np.int32)
labels_val = np.genfromtxt(labels_dir + "val_labels.csv", delimiter=',', dtype=np.int32)
[ ]:
# Display the shapes of train, validation, and test datasets
print('Images train shape: {} - Labels train shape: {}'.format(imgs_train.shape, labels_train.shape))
print('Images validation shape: {} - Labels validation shape: {}'.format(imgs_val.shape, labels_val.shape))
print('Images test shape: {} - Labels test shape: {}'.format(imgs_test.shape, labels_test.shape))

# Display the range of images (to make sure they are in the [0, 1] range)
print('\nMax pixel value', np.max(imgs_train))
print('Min pixel value', np.min(imgs_train))
print('Average pixel value', np.mean(imgs_train))
print('Data type', imgs_train[0].dtype)
Images train shape: (3043, 100, 100, 3) - Labels train shape: (3043,)
Images validation shape: (1021, 100, 100, 3) - Labels validation shape: (1021,)
Images test shape: (1049, 100, 100, 3) - Labels test shape: (1049,)

Max pixel value 1.0
Min pixel value 0.0
Average pixel value 0.47390351913451484
Data type float64
[ ]:
# Read the names of the celebrities in the dataset (there are 62 celebrities)
name_list = []
with open(labels_dir+'name_list.csv', 'r') as f:
    reader = csv.reader(f)
    for row in reader:
        name_list.append(row[1])

# Plot a few images to check if the the labels are correct
# There are a few bad images in the dataset, it needs to be cleaned
plt.figure(figsize=(9, 6))
for n in range(9):
    i = np.random.randint(0, len(imgs_train), 1)
    ax = plt.subplot(3, 3, n+1)
    plt.imshow(imgs_train[i[0]])
    plt.title('Label:' + str(name_list[labels_train[i[0]]]))
    plt.axis('off')
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_16_0.png

Define the Model

We will use a pretrained VGG-16 model, and we will just add a classifier with 3 Dense layers on top of the model to fine-tune it to the LFW dataset.

[ ]:
from keras.models import Model
from keras.layers import Dense, GlobalAveragePooling2D, Dropout
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint, ReduceLROnPlateau, LearningRateScheduler
from keras.applications import vgg16

import datetime
now = datetime.datetime.now
[ ]:
def Network():

    base_model = vgg16.VGG16(weights='imagenet', include_top=False, input_shape=(image_size, image_size, 3))

    # Add a global spatial average pooling layer
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    # Add fully-connected layers
    x = Dense(1024, activation='relu')(x)
    x = Dropout(0.25)(x)
    x = Dense(256, activation='relu')(x)
    x = Dropout(0.25)(x)
    # Add a softmax layer
    predictions = Dense(62, activation='softmax')(x)

    # The model
    model = Model(inputs=base_model.input, outputs=predictions)

    return model

Let’s define a function for plotting the accuracy and loss called plot_accuracy_loss, which we can call with different models to examine the learning curves.

[ ]:
def plot_accuracy_loss():
    # plot the accuracy and loss
    train_loss = history.history['loss']
    val_loss = history.history['val_loss']
    acc = history.history['accuracy']
    val_acc = history.history['val_accuracy']

    epochsn = np.arange(1, len(train_loss)+1,1)
    plt.figure(figsize=(12, 4))

    plt.subplot(1,2,1)
    plt.plot(epochsn, acc, 'b', label='Training Accuracy')
    plt.plot(epochsn, val_acc, 'r', label='Validation Accuracy')
    plt.grid(color='gray', linestyle='--')
    plt.legend()
    plt.title('ACCURACY')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')

    plt.subplot(1,2,2)
    plt.plot(epochsn,train_loss, 'b', label='Training Loss')
    plt.plot(epochsn,val_loss, 'r', label='Validation Loss')
    plt.grid(color='gray', linestyle='--')
    plt.legend()
    plt.title('LOSS')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.show()

Learning rate = 1e-4, Epochs = 10

In Lecture 15, we explained that most NNs employ a version of the Gradient Descent algorithm for updating the network parameters during training, depicted in the figure below. The learning rate determines the size of the updates at each step, i.e., it controls how fast the network parameters are updated.

4ab8077ef5df4dda948b1300b0b3ebe8

Figure: Gradient descent algorithm.

If the learning rate is too small, the algorithm will take many epochs to converge, and it may even get stuck into a local minima. If the learning rate is too high, the algorithm may jump over the best solutions and may not be able to converge to good solutions.

fa7ac09d98cb43fd9c10c017e434e004

Figure: Impact of the learning rate. Source: https://www.bdhammel.com/learning-rates/

In the previous lectures, we used the following code to compile the models, which uses the Adam optimizer, but we didn’t specify the learning rate of the optimizer. For the implementation of Adam in Keras, the default learning rate is 1e-3.

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

If we would like to use another value for the learning rate, then we will need to import the Adam optimizer, and compile the model with:

from keras.optimizers import Adam
model.compile(optimizer = Adam(learning_rate=VALUE),
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

The following code trains a model for 10 epochs with a learning rate of 1e-4 = 0.0001. We have selected this learning rate because we know that it works well for this combination of model and data.

We can see that the model achieved close to 90% accuracy on the test set, and the training took about 1 minute. From the plots of the accuracy and loss curves, we can tell that 10 epochs are not sufficient for training the model, because at the end of the 10th epoch, the accuracy was still increasing and the loss was decreasing.

[ ]:
LEARNING_RATE = 1e-4
EPOCHS_NUM = 10

# create a model
model = Network()

# compile the model
model.compile(optimizer = Adam(learning_rate=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0)
print('\nTraining time: %s' % (now() - t))

# evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Training time: 0:00:44.932419
33/33 [==============================] - 1s 21ms/step - loss: 0.3862 - accuracy: 0.8990
Classification Accuracy:  89.89514112472534
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_23_1.png

Learning rate = 1e-4, Epochs = 30

Let’s train the model for 30 epochs using the same learning rate to see if this number of epochs would be sufficient.

From the results we can see that the model achieved 94% accuracy, and the training took about 2 minutes.

Based on the learning curves, at epoch 30 the validation accuracy and loss were converging to a plateau level, and it was unclear if training the model for more than 30 epochs would improve the performance. An alternative is to use Early Stopping callback, so that the training is stopped automatically (e.g., when the validation loss stops decreasing). Such case is shown in a subsequent section.

[ ]:
LEARNING_RATE = 1e-4
EPOCHS_NUM = 30

model = Network()
model.compile(optimizer = Adam(learning_rate=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0)
print('\nTraining time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Training time: 0:01:49.333017
33/33 [==============================] - 0s 12ms/step - loss: 0.3438 - accuracy: 0.9428
Classification Accuracy:  94.28026676177979
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_25_1.png

Learning rate = 1e-3, Epochs = 20

Next, let’s try to train the model using different learning rates, for instance, by increasing the learning rate to 1e-3 = 0.001.

Increasing the learning rate will cause the model to apply larger values for the update of the parameters during the training. We can expect that the training will converge faster, and we can use a smaller number of epochs.

However, large learning rates can cause the model to update the parameters too fast, as in this case. As we can see, the model achieved only 16.87% accuracy, and the accuracy curves did not improve after that level.

[ ]:
LEARNING_RATE = 1e-3
EPOCHS_NUM = 20

model = Network()
model.compile(optimizer = Adam(learning_rate=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0)
print('\nTraining time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Training time: 0:01:13.950914
33/33 [==============================] - 0s 11ms/step - loss: 3.7229 - accuracy: 0.1687
Classification Accuracy:  16.873212158679962
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_27_1.png

Learning rate = 1e-2, Epochs = 10

If we increase the learning rate even further to 0.01, we can expect that training will fail, since we saw that even a learning rate of 0.001 was too high. Based on the learning curves, we can tell that the learning is too fast and too aggressive.

[ ]:
LEARNING_RATE = 1e-2
EPOCHS_NUM = 10

model = Network()
model.compile(optimizer = Adam(learning_rate=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0)
print('\nTraining time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Training time: 0:00:39.685677
33/33 [==============================] - 0s 12ms/step - loss: 3.7224 - accuracy: 0.1687
Classification Accuracy:  16.873212158679962
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_29_1.png

Learning rate = 1e-5, Epochs = 50

Let’s try the opposite case, and reduce the learning rate to 1e-5 = 0.00001. Smaller learning rates produce smaller updates of the model parameters, and slower learning. This may avoid the problems of learning too fast when large learning rates are used.

This model achieved similar accuracy as with the 1e-4 learning rate. Also, the learning curves look good, because the accuracy and loss gradually change, and the validation curves follow the training curves.

[ ]:
LEARNING_RATE = 1e-5
EPOCHS_NUM = 50

model = Network()
model.compile(optimizer = Adam(learning_rate=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0)
print('\nTraining time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Training time: 0:02:58.985777
33/33 [==============================] - 0s 12ms/step - loss: 0.2841 - accuracy: 0.9409
Classification Accuracy:  94.08960938453674
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_31_1.png

Learning rate = 1e-6, Epochs = 50

Next, let’s reduce the learning rate even further to 1e-6.

Although smaller learning rates avoid training failure when we use overly large learning rates, using very small learning rates does not necessarily lead to improved performance, since the learning can be too slow.

Based on the learning curves, we can tell that at the end of epoch 50 the model parameters were gradually being updated, and that with this learning rate we would need to train the model for at least 200 or 300 epochs to reach convergence, or maybe even more.

[ ]:
LEARNING_RATE = 1e-6
EPOCHS_NUM = 50

model = Network()
model.compile(optimizer = Adam(learning_rate=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0)
print('\nTraining time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Training time: 0:02:58.017459
33/33 [==============================] - 0s 12ms/step - loss: 1.3921 - accuracy: 0.6969
Classification Accuracy:  69.68541741371155
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_33_1.png

17.2.1 Learning Rate Finder

There are several tools for estimating the learning rate, such as LRFinder in Keras. This function changes the learning rate in a range of values, typically starting with a very small learning rate and increasing it to a high learning rate. The model is trained and evaluated only for a few epochs using different learning rates. Based on a plot of the loss for different learning rates, this function can help to find a suitable learning rate, that can afterward be used to fully train the model.

For the task at hand, the plot of loss versus learning rate is shown below. The best learning rate is the one when the loss is reducing the fastest, and has the largest slope. From the graph, this value is about \(10^{-4}\).

Although such tools can identify a range of suitable values for the learning rate, they should be used with caution, and the users should still run candidate models in the suggested range to fully evaluate the model performance.

[ ]:
!git clone https://github.com/WittmannF/LRFinder.git
Cloning into 'LRFinder'...
remote: Enumerating objects: 71, done.
remote: Total 71 (delta 0), reused 0 (delta 0), pack-reused 71
Receiving objects: 100% (71/71), 447.02 KiB | 12.08 MiB/s, done.
Resolving deltas: 100% (24/24), done.
[ ]:
from LRFinder.keras_callback import LRFinder
[ ]:
model = Network()
model.compile(optimizer='Adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Perform the Learning Rate Range Test
lr_finder = LRFinder(min_lr=1e-6, max_lr=1e-2)

model.fit(imgs_train, labels_train, batch_size=32, callbacks=[lr_finder], epochs=2)
Epoch 1/2
 6/96 [>.............................] - ETA: 11s - loss: 4.1732 - accuracy: 0.0000e+00
WARNING:tensorflow:Callback method `on_train_batch_end` is slow compared to the batch time (batch time: 0.0407s vs `on_train_batch_end` time: 0.0935s). Check your callbacks.
96/96 [==============================] - 36s 122ms/step - loss: 4.1457 - accuracy: 0.0214
Epoch 2/2
96/96 [==============================] - 12s 120ms/step - loss: 3660.4465 - accuracy: 0.0796
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_37_3.png
<keras.callbacks.History at 0x7f5c50298650>

17.3 Callbacks

Callbacks in programming languages are functions that allow for conditional processing within another function. I.e., they allow to perform some operations during the execution of another function, based on some conditions.

In ML, callbacks are used to monitor the model performance at various stages of training and take certain actions (e.g., at the start or end of an epoch, before or after processing a single batch, etc.). Such actions can include saving the model to the disk after a certain number of epochs, obtaining a view of the internal states of the model and relevant statistics during training, writing logs after every batch of training data to monitor performance metrics, etc. Most ML libraries provide a callback class which allows users to create custom callbacks.

So far we worked with the history callback in Keras, which stores the values of the loss and the metric at each epoch, and its atribute history.history allows to plot the learning curves of a model. Also, in the lecture on ConvNets we explained how the EarlyStopping callback works. The next section provides additional examples of applying callbacks in Keras.

17.3.1 Early Stopping

As we know, using Early Stopping callback is often beneficial, since we don’t need to guess the optimal number of epochs to train the model. Instead, the callback will terminate the training when a selected metric is not improving. In the next cell, we specified to stop the training when the validation loss does not improve for 20 epochs (patience argument). We set the EPOCH_NUM argument to 1000, although we know that the model will terminate after about 50-60 epochs. Therefore, the number of epochs is not very important when we use this callback, it only needs to be large enough so that the training is not stopped prematurely.

This model achieved 91.13% accuracy. However, from the accuracy curve it seems that the accuracy was higher in the previous epoch, and it just dropped in the last epoch. We will see next how to avoid this by using CheckPoint callback.

[ ]:
LEARNING_RATE = 1e-4
EPOCHS_NUM = 1000

model = Network()
model.compile(optimizer=Adam(learning_rate=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
callbacks = [EarlyStopping(monitor='val_loss', patience=20)]
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0, callbacks=callbacks)
print('\nTraining time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Training time: 0:04:01.178825
33/33 [==============================] - 0s 12ms/step - loss: 0.4387 - accuracy: 0.9256
Classification Accuracy:  92.56434440612793
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_40_1.png

17.3.2 Save CheckPoint

CheckPoint callback saves a checkpoint of the model (i.e., the model parameters) after every epoch when a monitored metric does not improve. We set the metrics to be the validation loss, and the values of the model parameters will be saved at the specified filepath. This can be useful if training the model takes hours, where if something goes wrong, we can just resume the training from a checkpoint.

We choose to set verbose=1 for the callback, to output the epochs at which a checkpoint is saved.

At the beginning of the training, the checkpoint will be saved after every epoch, and after the model reaches a plateau, a new checkpoint will be saved only when there is an improvement in the performance, by overwriting the latest checkpoint.

[ ]:
LEARNING_RATE = 1e-4
EPOCHS_NUM = 50

model = Network()
model.compile(optimizer=Adam(learning_rate=LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
callbacks = ModelCheckpoint(filepath='sample_data/model_celeb.h5', monitor='val_loss', mode='min',
                            save_weights_only=True, save_best_only=True, verbose=1)
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0, callbacks=[callbacks])
print('Training time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Epoch 1: val_loss improved from inf to 3.32760, saving model to sample_data/model_celeb.h5

Epoch 2: val_loss improved from 3.32760 to 3.08174, saving model to sample_data/model_celeb.h5

Epoch 3: val_loss improved from 3.08174 to 2.40263, saving model to sample_data/model_celeb.h5

Epoch 4: val_loss improved from 2.40263 to 1.87908, saving model to sample_data/model_celeb.h5

Epoch 5: val_loss improved from 1.87908 to 1.77787, saving model to sample_data/model_celeb.h5

Epoch 6: val_loss improved from 1.77787 to 1.03076, saving model to sample_data/model_celeb.h5

Epoch 7: val_loss improved from 1.03076 to 0.86254, saving model to sample_data/model_celeb.h5

Epoch 8: val_loss improved from 0.86254 to 0.83952, saving model to sample_data/model_celeb.h5

Epoch 9: val_loss improved from 0.83952 to 0.76099, saving model to sample_data/model_celeb.h5

Epoch 10: val_loss improved from 0.76099 to 0.45982, saving model to sample_data/model_celeb.h5

Epoch 11: val_loss improved from 0.45982 to 0.39744, saving model to sample_data/model_celeb.h5

Epoch 12: val_loss improved from 0.39744 to 0.36517, saving model to sample_data/model_celeb.h5

Epoch 13: val_loss improved from 0.36517 to 0.35960, saving model to sample_data/model_celeb.h5

Epoch 14: val_loss did not improve from 0.35960

Epoch 15: val_loss did not improve from 0.35960

Epoch 16: val_loss did not improve from 0.35960

Epoch 17: val_loss improved from 0.35960 to 0.33312, saving model to sample_data/model_celeb.h5

Epoch 18: val_loss did not improve from 0.33312

Epoch 19: val_loss did not improve from 0.33312

Epoch 20: val_loss improved from 0.33312 to 0.31346, saving model to sample_data/model_celeb.h5

Epoch 21: val_loss did not improve from 0.31346

Epoch 22: val_loss did not improve from 0.31346

Epoch 23: val_loss did not improve from 0.31346

Epoch 24: val_loss improved from 0.31346 to 0.30755, saving model to sample_data/model_celeb.h5

Epoch 25: val_loss did not improve from 0.30755

Epoch 26: val_loss did not improve from 0.30755

Epoch 27: val_loss did not improve from 0.30755

Epoch 28: val_loss did not improve from 0.30755

Epoch 29: val_loss did not improve from 0.30755

Epoch 30: val_loss did not improve from 0.30755

Epoch 31: val_loss improved from 0.30755 to 0.28651, saving model to sample_data/model_celeb.h5

Epoch 32: val_loss did not improve from 0.28651

Epoch 33: val_loss did not improve from 0.28651

Epoch 34: val_loss did not improve from 0.28651

Epoch 35: val_loss did not improve from 0.28651

Epoch 36: val_loss did not improve from 0.28651

Epoch 37: val_loss improved from 0.28651 to 0.26158, saving model to sample_data/model_celeb.h5

Epoch 38: val_loss did not improve from 0.26158

Epoch 39: val_loss did not improve from 0.26158

Epoch 40: val_loss did not improve from 0.26158

Epoch 41: val_loss improved from 0.26158 to 0.25302, saving model to sample_data/model_celeb.h5

Epoch 42: val_loss did not improve from 0.25302

Epoch 43: val_loss did not improve from 0.25302

Epoch 44: val_loss did not improve from 0.25302

Epoch 45: val_loss did not improve from 0.25302

Epoch 46: val_loss did not improve from 0.25302

Epoch 47: val_loss did not improve from 0.25302

Epoch 48: val_loss improved from 0.25302 to 0.23428, saving model to sample_data/model_celeb.h5

Epoch 49: val_loss improved from 0.23428 to 0.22223, saving model to sample_data/model_celeb.h5

Epoch 50: val_loss did not improve from 0.22223
Training time: 0:03:01.450305
33/33 [==============================] - 0s 11ms/step - loss: 0.2903 - accuracy: 0.9561
Classification Accuracy:  95.61486840248108
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_42_1.png

17.3.3 Reduce Learning Rate On Plateau

ReduceLROnPlateau stands for Reduce Learning Rate on Plateau. It is another very useful callback, since prior works have reported that training models generally benefits from using larger learning rate at the beginning of the training, and gradually reducing the learning rate when the training does not improve.

This is exactly what this callback does. In the next call, the learning rate is initially set to 1e-4=0.0001. ReduceLROnPlateau has a patience of 10 epochs, factor of 0.1, and minimum learning rate of 1e-6. This means that when the monitored metric (in this case, the validation loss) does not reduce for 10 epochs, the learning rate will be multiplied by the factor and become 1e-5. When the model stops improving again, the learning rate will be again multiplied by the factor and become 1e-6. Since this is the minimum value for the learning rate, we will combine this callback with Early Stopping to terminate the training. Note that the patience value for Early Stopping was set longer than the patience for ReduceLROnPlateau.

[ ]:
LEARNING_RATE = 1e-4
EPOCHS_NUM = 1000

model = Network()
model.compile(optimizer=Adam(learning_rate = LEARNING_RATE), loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
callbacks = [EarlyStopping(monitor='val_loss', patience = 20),
             ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=10, min_lr=1e-6, verbose=1)]
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0, callbacks=callbacks)
print('\nTraining time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])


# plot the accuracy and loss
plot_accuracy_loss()

Epoch 28: ReduceLROnPlateau reducing learning rate to 9.999999747378752e-06.

Epoch 39: ReduceLROnPlateau reducing learning rate to 1e-06.

Training time: 0:02:54.702996
33/33 [==============================] - 0s 12ms/step - loss: 0.3021 - accuracy: 0.9466
Classification Accuracy:  94.66158151626587
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_44_1.png

17.3.4 Learning Rate Scheduler

Learning Rate Scheduler allows to define custom schedulers for adjusting the learning rate during training. Popular learning rate schedules include:

  • Time-based decay

  • Step decay

  • Exponential decay

Time-based Decay

This scheduler decreases the learning rate in each epoch by a given fixed amount. An example is shown in the next figure, where a model is trained for 100 epochs, and the learning rate is gradually reduced from 0.01 in the first epoch to 0.006 in the last epoch.

664a8b55684040bd81a2b0eae386d6eb

Figure: Time-based decay.

The implementation is shown below, where the Learning Rate Scheduler callback accepts a function which defines the schedule for the learning rate. The function lr_time_based_decay applies the decay amount at each epoch, where lr is the learning rate from the previous epoch. The value of the decay is usually set as the quotient of the initial learning rate and the number of epochs.

[ ]:
INITIAL_LEARNING_RATE = 1e-4
EPOCHS_NUM = 50
decay = INITIAL_LEARNING_RATE / EPOCHS_NUM

def lr_time_based_decay(epoch, lr):
    return lr * 1 / (1 + decay * epoch)

model = Network()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
callbacks = [LearningRateScheduler(lr_time_based_decay, verbose=0)]
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0, callbacks=callbacks)
print('\nTraining time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Training time: 0:02:57.472491
33/33 [==============================] - 0s 12ms/step - loss: 1.1274 - accuracy: 0.7750
Classification Accuracy:  77.50238180160522
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_47_1.png

In this case, it seems that the learning rate was reduced too fast, because the performance decreased.

Step Decay

Step decay scheduler decreases the learning rate for a fixed amount after a number of training epochs. An example is shown in the figure, where the learning rate is reduced by half every 10 epochs.

20ace719267b4008907e2064ce98e963

Figure: Step decay.

In the next cell, step decay is applied with the function lr_step_decay, where drop_rate is the reduced ratio of the initial learning rate at each step, and epochs_drop is set to 15 epochs.

[ ]:
import math

INITIAL_LEARNING_RATE = 1e-4
EPOCHS_NUM = 50

def lr_step_decay(epoch):
    drop_rate = 0.5
    epochs_drop = 15
    return INITIAL_LEARNING_RATE * math.pow(drop_rate, math.floor(epoch/epochs_drop))

model = Network()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
callbacks = [LearningRateScheduler(lr_step_decay, verbose=0)]
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0, callbacks=callbacks)
print('\nTraining time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Training time: 0:02:58.313695
33/33 [==============================] - 0s 12ms/step - loss: 0.3283 - accuracy: 0.9523
Classification Accuracy:  95.233553647995
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_50_1.png

Exponential Decay

The scheduler decreases the learning rate at an exponential rate. In the cell below k is the rate of exponential decay.

b1d727ccb95745439ab30d0fc787b6cb

Figure: Exponential decay.

[ ]:
INITIAL_LEARNING_RATE = 1e-4
EPOCHS_NUM = 50

def lr_exp_decay(epoch):
    k = 0.1
    return INITIAL_LEARNING_RATE * math.exp(-k*epoch)

model = Network()
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# fit model
t = now()
callbacks = [LearningRateScheduler(lr_exp_decay, verbose=0)]
history = model.fit(imgs_train, labels_train, batch_size=32, epochs=EPOCHS_NUM,
                     validation_data=(imgs_val, labels_val), verbose=0, callbacks=callbacks)
print('\nTraining time: %s' % (now() - t))

# Evaluate on test data
evals_test = model.evaluate(imgs_test, labels_test)
print("Classification Accuracy: ", 100*evals_test[1])

# plot the accuracy and loss
plot_accuracy_loss()

Training time: 0:02:58.286062
33/33 [==============================] - 0s 12ms/step - loss: 0.4014 - accuracy: 0.9447
Classification Accuracy:  94.47092413902283
../../../_images/Lectures_Theme_3-Model_Engineering_Lecture_17-Model_Selection,Tuning_Lecture_17-Model_Selection_52_1.png

17.5 Keras Tuner

There are several libraries developed for tuning the hyperparameters of neural networks. One is the Keras Tuner for tuning Keras models.

The Keras Tuner is somewhat similar to the Grid Search and Random Search in scikit-learn, and allows to define the search space for the hyperparameters over which the model will be fit, and it returns an optimal set of hyperparameters.

Keras Tuner is not part of the Keras package and it needs to be installed and imported.

[ ]:
pip install -q -U keras-tuner
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.5/129.5 kB 3.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 950.8/950.8 kB 9.0 MB/s eta 0:00:00

[ ]:
import keras_tuner as kt
Using TensorFlow backend

Load Fashion MNIST Dataset

To demonstrate the use of the Keras Tuner we will work with the Fashion MNIST dataset.

[ ]:
(img_train, label_train), (img_test, label_test) = keras.datasets.fashion_mnist.load_data()

# Normalize pixel values between 0 and 1
img_train = img_train.astype('float32') / 255.0
img_test = img_test.astype('float32') / 255.0
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
29515/29515 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26421880/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
5148/5148 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4422102/4422102 [==============================] - 0s 0us/step

Model Builder

In the cell below, a function called model_biilder is created, which performs search over two hyperparameters:

  • Number of neurons in the first Dense layer,

  • Learning rate.

In the code, hp is an instance of the HyperParameterss class provided by Keras Tuner. The line hp_units = hp.Int('units', min_value=32, max_value=512, step=32) defines a grid search for the number of neurons in the Dense layer in the range [32, 64, 96, …, 512].

Next, a grid search for the learning rate is defined in the range [1e-2, 1e-3, 1e-4].

[ ]:
from keras.models import Sequential
from keras.layers import Flatten
from keras.losses import SparseCategoricalCrossentropy


def model_builder(hp):
  model = Sequential()
  model.add(Flatten(input_shape=(28, 28)))

  # Tune the number of units in the first Dense layer
  # Choose an optimal value between 32-512
  hp_units = hp.Int('units', min_value=32, max_value=512, step=32)
  model.add(Dense(units=hp_units, activation='relu'))
  model.add(Dense(10))

  # Tune the learning rate for the optimizer
  # Choose an optimal value from 0.01, 0.001, or 0.0001
  hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])

  model.compile(optimizer=Adam(learning_rate=hp_learning_rate),
                loss=SparseCategoricalCrossentropy(from_logits=True),
                metrics=['accuracy'])

  return model

Hyperparameter Tuning

The Keras Tuner has four tuning algorithms available:

  • RandomSearch Tuner, similar to the Random Grid in scikit-learn performs a random search over a distribution of values for the hyperparameters.

  • Hyperband Tuner, trains a large number of models for a few epochs and carries forward only the top-performing half of models to the next round, to converge to a high-performing model.

  • BayesianOptimization Tuner, performs BayesianOptimization by creating a probabilistic mapping of the model to the loss function, and iteratively evaluating promising sets of hyperparameters.

  • Sklearn Tuner, designed for use with scikit-learn models.

In the next cell, the Hyperband tuner is used, which has as the arguments the model, objective (metric to monitor), maximum number of epochs for each configuration of hyperparameters (training will be stopped for the worst-performing configurations after this number of epochs), and factor (used to search for top-performing models, e.g., a reduction factor of 3 means that one third of the configurations will be kept for the next iteration, and the rest of the configurations will be eliminated).

[ ]:
tuner = kt.Hyperband(model_builder,
                     objective='val_accuracy',
                     max_epochs=10,
                     factor=3)
[ ]:
tuner.search(img_train, label_train, epochs=50, validation_split=0.2, callbacks=[EarlyStopping(monitor='val_loss', patience=5)])
Trial 30 Complete [00h 00m 41s]
val_accuracy: 0.8767499923706055

Best val_accuracy So Far: 0.8920833468437195
Total elapsed time: 00h 08m 42s
[ ]:
# Get the optimal hyperparameters
best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"Optimal number of neuron in the Dense layer: {best_hps.get('units')}")
print (f"Optimal learning rate: {best_hps.get('learning_rate')}")
Optimal number of neuron in the Dense layer: 480
Optimal learning rate: 0.001

Train and Evaluate the Model

Next, we will use the optimal hyperparameters from the Keras Tuner to create a model, and afterward we will evaluate the accuracy on the test dataset.

[ ]:
# Build the model with the optimal hyperparameters and train it on the data for 50 epochs
model = tuner.hypermodel.build(best_hps)
model.fit(img_train, label_train, epochs=50, validation_split=0.2, verbose=0)

eval_result = model.evaluate(img_test, label_test)
print("[test loss, test accuracy]:", eval_result)
313/313 [==============================] - 1s 2ms/step - loss: 0.5827 - accuracy: 0.8874
[test loss, test accuracy]: [0.5827264189720154, 0.8873999714851379]

Other libraries that perform model selection, hyperparameter tuning, and neural architecture search include AutoKeras, auto-sklearn, Auto-PyTorch and Ray Tune for PyTorch, AutoWEKA, and others.

17.6 AutoML

AutoML or Automated Machine Learning refers to tools and libraries that are designed to allow non-ML experts to build Machine Learning systems and solve Data Science tasks without extensive knowledge in these fields.

AutoML systems range from automated No Code ML solutions that allow the end-users to just drag-and-drop their data, or Low Code systems that automate ML steps with minimal coding efforts, to systems that require coding experience and are designed to increase the efficiency of data scientists by automating hyperparameter tuning and architecture search.

Most large providers of cloud computing and ML services typically provide some form of AutoML services. Examples include GoogleAutoML, Microsoft Azure AutoML, Amazon SageMaker, etc.

In a subsequent lecture on training and deploying a model on the cloud, we will present examples of training ML models using No Code and Low Code modes with Microsoft Azure ML.

References

  1. TensorFlow - ML Basics with Keras, Introduction to the Keras Tuner, available at https://www.tensorflow.org/tutorials/keras/keras_tuner#:~:text=The%20Keras%20Tuner%20is%20a,called%20hyperparameter%20tuning%20or%20hypertuning.

  2. Keras Learning Rate Finder, available at https://github.com/surmenok/keras_lr_finder.

  3. Learning Rate Schedule in Practice: an example with Keras and TensorFlow 2.0, B. Chen, available at https://towardsdatascience.com/learning-rate-schedule-in-practice-an-example-with-keras-and-tensorflow-2-0-2f48b2888a0c.

  4. AutoML.org Freiburg-Hannover, AutoML, available at https://www.automl.org/automl/.

BACK TO TOP