Lecture 20 - Transformer Networks

View notebook on Github Open In Collab

20.1 Introduction to Transformers

Transformer Neural Networks, or simply Transformers, is a neural network architecture introduced in 2017 in the now-famous paper “Attention is all you need”. The title refers to the attention mechanism, which forms the basis for data processing with Transformers.

Transformer Networks have been the predominant type of Deep Learning models for NLP in recent years. They replaced Recurrent Neural Networks in all NLP tasks, and also, all Large Language Models employ the Transformer Network architecture. As well as, Transformer Networks were recently adapted for other tasks and have outperformed other Machine Learning models for image processing and video processing tasks, protein and DNA sequence prediction, time-series data processing, and have been used for reinforcement learning tasks. Consequently, Transformers are currently the most important Neural Network architecture.

20.2 Self-attention Mechanism

Self-attention in NNs is a mechanism that forces a model to attend to portions of the data when making predictions. For instance, in NLP, self-attention mechanism is used to identify words in sentences that have significance for a given query word in the sentence. That is, the model should pay more attention to some words in sentences, and less attention to other words in sentences that are less relevant for a given task.

In the following two sentences, in the left subfigure the word “it” refers to “street”, while in the right subfigure the word “it” refers to “animal”. Understanding the relationships between the words in such sentences has been challenging with traditional NLP approaches. Transformers use the self-attention mechanism to model the relationships between all words in a sentence, and assign weights to other words in sentences based on their importance. In the left subfigure, the mechanism estimated that the query word “it” is most related to the word “street”, but the word “it” is also somewhat related to the words “The” and “animal. These words are referred to as key words for the query word “it”.The intensity of the lines connecting the words, as well as the intensity of the blue color, signifies the attention scores (i.e., weights). The wider and bluer the lines, the higher the attention scores between two words are.

fa1209f5c7c742f7872761fa7f34c97d

Figure: Attention to words in sentences.

Specifically, Transformer Network compares each word to every other word in the sentence, and calculates attention scores. This is shown in the next figure, where for example, the word “caves” has the highest attention scores for the words “glacier” and “formed”. The attention scores are calculated as the dot (i.e., inner) product of the input representations of two words. That is, for each Query word \(Q\) and Key word \(K\), the attention score is \(Q\cdot K\).

812aa9dbb9c04db6827feab2c238cb7f

Figure: Attention scores.

Transformers employ word embeddings for representing the individual words in text sequences (where each text sequence can have one or several sentences). Recall from the previous lecture that word embeddings are vector representations of words, such that the vectors of words that have similar semantic meaning have close spatial positions in the embeddings space. Therefore, the attention scores are dot products of the embedding vectors for each pair of words in sentences.

The obtained attention scores for each word are then first scaled (by dividing the values by \(\sqrt d\)) and afterward are normalized to be in the [0,1] range (by applying a softmax function). That is, the attention scores are calculated as \(a_{ij}=softmax(\frac{Q_i\cdot K_j}{\sqrt d})\), where \(d\) is the dimensionality of the embedding vectors. Scaling the values by \(\sqrt d\) is helpful for improving the flow of the gradients during training. The resulting scaled and normalized attention scores are then multiplied with the initial representation of the words, which in the self-attention module is referred to as value or \(V\).

This is shown in the next figure. The left subfigure shows the attention scores calculated as product of the input representations of the words \(Q\) and \(K\), which are afterwards multiplied with the input representation \(V\) to obtain the output of the module. Note that for text classification, all three terms Query, Key, and Value are the same input representation of the words in sentences. However, the original Transformer was developed for machine translation, where the words in the target language are queries, and the words in the source language are pairs of keys and values. This terminology is also related to search engines, which compare queries to keys, and return values (e.g., the user submits a query, the search engine identifies key words within the query to search for, and it returns the results of the search as values). Self-attention works in a similar way, where each query word is matched to other key words, and a weighted value is returned.

The right subfigure below shows how self-attention is implemented in Transformer Networks. Namely, Matmul stands for a matrix multiplication layer which calculates the dot product \(Q\cdot K\), which is afterwards scaled by \(\sqrt d\), then there is an optional masking layer , and afterward the final attention scores are obtained by passing it through a Softmax layer to obtain \(softmax(\frac{Q_i\cdot K_j}{\sqrt d})\). Finally, the attention scores are multiplied with \(V\) via another matrix multiplication layer Matmul to calculate the output of the self-attention module. The optional masking layer can be used for two purposes: (a) to ensure that attention scores are not calculated for the padding tokens in padded sequences (e.g., 0 is often used as the padding token), but instead are calculated only for the positions in input sequences that have actual words in padded sequences; or (b) to set the attention scores for future tokens to zero, so that the model can only attend to previous tokens, as explained in the section below on decoder sub-networks).

461f9e3a2aee4d5db06833f25320bcfb

Figure: Self-attention in Transformer Networks

In conclusion, self-attention is applied to determine the meaning of the words in a sentence based on the context. That is, Transformers use the attention scores to modify the input vector representations for each word and generate a new representation based on the context of the sentence. During the training of the network, the representations of the words are updated and projected into a new embeddings space that takes the context into account.

20.3 Multi-Head Attention

Transformer Networks include multiple self-attention modules in their architecture. Each self-attention module is called attention head, and the aggregation of the outputs of multiple attention heads is called multi-head attention. For instance, the original Transformer model had 8 attention heads, while the GPT-3 language model has 12 attention heads.

The multi-head attention module is shown in the next figure, where the inputs are first passed through a linear layer (dense or fully-connected layer), next they are fed to the multiple attention heads, and the outputs of all attention heads are concatenated, and passed through one more linear layer.

A logical question one can ask is why are multiple attention heads needed? The reason is that multiple attention modules can learn different relationships between the words in sentences. Each module can extract context independently from the other modules, which allows to capture less obvious context and enhance the learning capabilities of the model. For example, one head may capture relationship between the nouns and numerical values in sentences, another head may focus on the relationship between the adjectives in sentences, and another head may focus on rhyming words, etc. And, if one head becomes too specialized in capturing one type of patterns, the other heads can compensate for it and provide redundancy that can improve the overall performance of the model.

Also, the computations of each attention head can be performed in parallel on different workers, which allows for accelerating the training and scaling up the models.

451e39d2877645e1b6e33ab3d135ac84

Figure: Multi-head attention

20.4 Encoder Block

The Encoder Block in Transformer Networks is shown in the next figure. It processes the input word embeddings and extracts representations in text data that can afterwards be used for different NLP tasks.

The components in the Encoder Block are:

  • Multi-head Attention layer, which as explained, consists of multiple self-attention modules.

  • Dropout layer, is a regular dropout layer.

  • Residual connections, are skip connections in neural networks, where the input to a layer is added to the processed output of the layer. Residual connections were popularized in the ResNets models, as they were shown to stabilize the training and mitigate the problems of vanishing and exploding gradients in neural networks (i.e., they refer to cases when the gradients become too small or too large during training). In the figure, the Add term in the layer refers to the residual connection, which adds the input embeddings to the output of the Dropout layer.

  • Layer Normalization, is an operation that is similar to the batch normalization in CNNs, but instead, it normalizes the outputs of each multi-head attention layer independently from the outputs of the other multi-head attention layers, and scales the data to have 0 mean and 1 standard deviation. This type of normalization is more adequate for text data. And, as we learned in the previous lectures, normalization improves the flow of gradients during training. The Norm term in the figure refers to the Layer Normalization operation.

  • Feed Forward network, consists of 2 fully-connected (dense) layers that extract useful data representations.

  • The Encoder Block also contains one more Dropout layer, and another Add & Norm layer that forms a residual connection for the input to the Feed Forward network and applies a layer normalization operation.

Larger Transformer networks typically include several encoder blocks in a sequence. For instance, in the original paper the authors used 6 encoder blocks.

9507bee012b14284a550ae9560c8099c

Figure: Encoder block

The implementation of the Encoder Block in Keras and TensorFlow is shown in the cell following the imported libraries.

The Encoder Block is implemented as a custom layer which is a subclass of the Layer class in Keras. The __init__() constructor method lists the definitions of the layers in the Encoder, and the method call() provides the forward pass with the flow of information through the layers.

  • Multi-head attention layer is implemented in Keras, and it can be directly imported. The arguments in the layer are: num_heads is the number of attention heads, and key_dim is the dimension of the embeddings of the input tokens.

  • Dropout and Normalization layers are also directly imported, with arguments rate for the dropout rate, and epsilon is a small float added to the standard deviation to avoid division by 0.

  • Feed forward network includes 2 dense layers, with the number of neurons set to ff_dim and embed_dim, respectively.

The call() method specifies the forward pass of the network, and takes two parameters: inputs (the input embeddings to the network) and training (an argument which can be True or False). For the dropout layers, during the model training this argument is set to True and dropout is applied, while during inference the argument is set to False and dropout is not applied.

Each step in the call() method performs the data processing for one layer. Note that the multi_head_attention layer has as arguments the inputs twice, which is once for the key and once for the value in the self-attention. Also note the residual connections that are implemented in the layer normalization, e.g., the inputs are added to the output of the multi-head attention.

[ ]:
import tensorflow as tf
from tensorflow import keras
from keras.layers import MultiHeadAttention, LayerNormalization, Dropout, Dense, Embedding, Layer
from keras import Sequential, Model
[ ]:
class TransformerEncoder(Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super().__init__()
        self.multi_head_attention = MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        self.feed_forward_net = Sequential([Dense(ff_dim, activation="relu"), Dense(embed_dim),])
        self.layer_normalization1 = LayerNormalization(epsilon=1e-6)
        self.layer_normalization2 = LayerNormalization(epsilon=1e-6)
        self.dropout1 = Dropout(rate)
        self.dropout2 = Dropout(rate)

    def call(self, inputs, training):
        multi_head_att_output = self.multi_head_attention(inputs, inputs)
        multi_head_att_dropout = self.dropout1(multi_head_att_output, training=training)
        add_norm_output_1 = self.layer_normalization1(inputs + multi_head_att_dropout)
        feed_forward_output = self.feed_forward_net(add_norm_output_1)
        feed_forward_dropout = self.dropout2(feed_forward_output, training=training)
        add_norm_output_2 = self.layer_normalization2(add_norm_output_1 + feed_forward_dropout)
        return add_norm_output_2

20.5 Positional Encoding

We mentioned that Transformers use word embeddings as inputs, however, the embeddings alone don’t provide information about the order of words in sentences. Understandably, the order of the words in a sentence is important, and different order of the words can convey a different meaning. To provide such information, Transformer Network introduces positional encoding for each word that is added to the input embedding, as shown in the next figure.

588a70884bf3456592e2224bd68932fb

Figure: Positional encoding

There are different ways in which positional encoding can be implemented. In the original Transformer paper, the positional encoding is a vector that has the same size as the word embedding vector, and the authors used sine and cosine functions to create position vectors, which are afterwards scaled to be in the range from -1 to 1. Using such positional encoding, each encoding vector corresponds to a unique position in a sequence of words. This type is called sinusoidal positional encoding.

The following cell implements the addition of positional encoding to word embeddings in Keras. In this case, we will not use the approach for obtaining positional encodings based on sine and cosine functions, but instead we will use a simpler approach and learn the positional encodings in the same way the word embeddings are learned. This type of positional encoding is referred to as learned positional encodings/embeddings. Therefore, for both token and positional embeddings we will use the Embedding layer in Keras which we introduced in the previous lecture. The arguments in the Embedding layer are the input dimension input_dim and the dimension of the embedding vectors output_dim. For the token embeddings layer, the input dimension is the size of the vocabulary, whereas for the positional embeddings layer the input dimension is the length of the text sequences.

In the call method, first the length of the text sequences is assigned to maxlen. The function tf.range is similar to NumPy’s linspace and creates numbers in the range from start to limit with a step delta. Next, the two separate Embedding layers are called, and returned is the sum of the token and positional embeddings.

[ ]:
class TokenAndPositionEmbedding(Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super().__init__()
        self.token_embeddings = Embedding(input_dim=vocab_size, output_dim=embed_dim)
        self.positional_embeddings = Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, inputs):
        maxlen = tf.shape(inputs)[-1]
        positions = tf.range(start=0, limit=maxlen, delta=1)
        position_embeddings = self.positional_embeddings(positions)
        input_embeddings = self.token_embeddings(inputs)
        return input_embeddings + position_embeddings

20.6 Using a Transformer Model for Classification

Model Definition

We will now employ the layers that we defined above, to create a Transformer model for text classification.

It is a simple model that consists of the following parts:

  • Encoder, which includes an Input layer that defines the maximum length of input sequences, TokenAndPositionEmbedding layer, and the TransformerEncoder layer.

  • Classifier, which consists of a GlobalAveragePooling1D layer, and two Dropout and Dense layers. Global Average Pooling calculates the average value for each word, and it passes those values to the dense layers to classify the text sequences.

[ ]:
from keras.layers import Input, GlobalAveragePooling1D

maxlen = 200  # Maximum length of input sequences is 200 words
embed_dim = 32  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Dense layer size in the feed forward network inside transformer
vocab_size = 20000  # The size of the vocabulary is 20k words

# encoder
inputs = Input(shape=(maxlen,))
embedding_layer = TokenAndPositionEmbedding(maxlen, vocab_size, embed_dim)(inputs)
x = TransformerEncoder(embed_dim, num_heads, ff_dim)(embedding_layer)

# classifier
x = GlobalAveragePooling1D()(x)
x = Dropout(0.1)(x)
x = Dense(20, activation="relu")(x)
x = Dropout(0.1)(x)
outputs = Dense(1, activation="sigmoid")(x)

model = Model(inputs=inputs, outputs=outputs)

The summary of the model is shown below.

[ ]:
model.summary()
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #
=================================================================
 input_1 (InputLayer)        [(None, 200)]             0

 token_and_position_embeddi  (None, 200, 32)           646400
 ng (TokenAndPositionEmbedd
 ing)

 transformer_encoder (Trans  (None, 200, 32)           10656
 formerEncoder)

 global_average_pooling1d (  (None, 32)                0
 GlobalAveragePooling1D)

 dropout_2 (Dropout)         (None, 32)                0

 dense_2 (Dense)             (None, 20)                660

 dropout_3 (Dropout)         (None, 20)                0

 dense_3 (Dense)             (None, 1)                 21

=================================================================
Total params: 657737 (2.51 MB)
Trainable params: 657737 (2.51 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________

Loading the Dataset

Let’s apply the model for sentiment analysis of the movie reviews in the IMDB database. The data is loaded from the Keras datasets, and it contains 25,000 training sequences and 25,000 validation sequences.

[ ]:
from keras.preprocessing.sequence import pad_sequences

(x_train, y_train), (x_val, y_val) = keras.datasets.imdb.load_data(num_words=vocab_size)
print(len(x_train), "Training sequences")
print(len(x_val), "Validation sequences")
x_train = pad_sequences(x_train, maxlen=maxlen)
x_val = pad_sequences(x_val, maxlen=maxlen)
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/imdb.npz
17464789/17464789 [==============================] - 1s 0us/step
25000 Training sequences
25000 Validation sequences

Model Training

[ ]:
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

model.fit(x_train, y_train, batch_size=32, epochs=2, validation_data=(x_val, y_val))
Epoch 1/2
782/782 [==============================] - 69s 76ms/step - loss: 0.3943 - accuracy: 0.8097 - val_loss: 0.2872 - val_accuracy: 0.8773
Epoch 2/2
782/782 [==============================] - 20s 26ms/step - loss: 0.1968 - accuracy: 0.9239 - val_loss: 0.3153 - val_accuracy: 0.8746
<keras.src.callbacks.History at 0x7920b23f9000>

20.7 Fine-tuning a Pretrained BERT Model

BERT (Bidirectional Encoder Representations from Transformers) is a Transformer Network that can be used for variety of NLP tasks such as question answering, text classification, machine translation, etc.

In this section we will use a pretrained version of BERT and will fine-tune it for classification of news articles in the AG database (that we used in the previous lecture).

TensorFlow Hub is a repository of pretrained machine learning models, and it offers several versions of BERT such as: Small BERT, Albert, and BERT Expert. The different versions of BERT are optimized for different use cases. In our case, we will use SmallBERT.

To use this model we will need to install the TensorFlow Text library for text processing.

[ ]:
!pip install -q tensorflow_text

import tensorflow_text as text
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.5/6.5 MB 36.9 MB/s eta 0:00:00

The BERT model in TensorFlow Hub has a corresponding text preprocessing model for converting texts into tokens.

[ ]:
import tensorflow_hub as hub
import numpy as np

bert_handle = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-4_H-512_A-8/2'
preprocessing_model = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'

The output of the preprocessing model has 3 elements:

  • input_word_ids: token ids of the input sequences.

  • input_mask: has value 1 for all input tokens before padding, and value 0 for the padding tokens.

  • input_type_ids: has different values for segments in text; e.g., if there are 3 sentences in the input text, the tokens in the same sentences will have the same index.

Let’s wrap preprocessing_model into a hub.KerasLayer and test it on a sample sentence.

[ ]:
preprocess_layer = hub.KerasLayer(preprocessing_model)
[ ]:
sample_news = ['Tech rumors: The tech giant Apple is working on a self driving car']

preprocessed_news = preprocess_layer(sample_news)

print('Keys:', preprocessed_news.keys())
# length of the input sequence
print('Shape:', preprocessed_news["input_word_ids"].shape)
print('Word Ids:', preprocessed_news["input_word_ids"][0,:10])
print('Input Mask:', preprocessed_news["input_mask"][0, :10])
print('Type Ids:', preprocessed_news["input_type_ids"][0, :10])
Keys: dict_keys(['input_mask', 'input_word_ids', 'input_type_ids'])
Shape: (1, 128)
Word Ids: tf.Tensor([  101  6627 11256  1024  1996  6627  5016  6207  2003  2551], shape=(10,), dtype=int32)
Input Mask: tf.Tensor([1 1 1 1 1 1 1 1 1 1], shape=(10,), dtype=int32)
Type Ids: tf.Tensor([0 0 0 0 0 0 0 0 0 0], shape=(10,), dtype=int32)

Loading the Dataset

The news articles in the AG dataset are classified into 4 categories: World, Sports, Business, and Sci/Tech.

[ ]:
import tensorflow_datasets as tfds

(train_data, val_data), info = tfds.load('ag_news_subset:1.0.0', #version 1.0.0
                                         split=('train', 'test'),
                                         with_info=True,
                                         as_supervised=True)
Downloading and preparing dataset 11.24 MiB (download: 11.24 MiB, generated: 35.79 MiB, total: 47.03 MiB) to /root/tensorflow_datasets/ag_news_subset/1.0.0...
Dataset ag_news_subset downloaded and prepared to /root/tensorflow_datasets/ag_news_subset/1.0.0. Subsequent calls will reuse this data.
[ ]:
# Dataset information
class_names = info.features['label'].names
print('Classes:', class_names)

print('Number of training samples"', info.splits['train'].num_examples)
print('Number of test samples"', info.splits['test'].num_examples)
Classes: ['World', 'Sports', 'Business', 'Sci/Tech']
Number of training samples" 120000
Number of test samples" 7600
[ ]:
buffer_size = 1000
batch_size = 32

# prepare the data
train_data = train_data.shuffle(buffer_size)
train_data = train_data.batch(batch_size).prefetch(1)
val_data = val_data.batch(batch_size).prefetch(1)

Model Definition with BERT

The model defined below includes an Input layer, a preprocessing layer to convert the text data into token embeddings, and a layer for the BERT model.

Afterward, the output is passed through a classifier head, which includes two dense layers and dropout layers.

[ ]:
# input layer
input_text = Input(shape=(), dtype=tf.string)

# preprocesing model
preprocessing_layer = hub.KerasLayer(preprocessing_model)(input_text)
# Bert model, set trainable to True
bert_encoder = hub.KerasLayer(bert_handle, trainable=True)(preprocessing_layer)

# For fine-tuning use pooled output
pooled_bert_output = bert_encoder['pooled_output']

# clasifier
x = Dense(16, activation='relu')(pooled_bert_output)
x = Dropout(0.2)(x)
final_output = Dense(4, activation='softmax')(x)

# Combine input and output
news_model = Model(input_text, final_output)

Model Training

Let’s compile and train the model.

[ ]:
# compile
news_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=3e-5),
                        loss='sparse_categorical_crossentropy',
                        metrics=['accuracy'])
# train
news_model.fit(train_data, epochs=3, validation_data=val_data)
Epoch 1/3
3750/3750 [==============================] - 735s 193ms/step - loss: 0.3221 - accuracy: 0.8924 - val_loss: 0.2472 - val_accuracy: 0.9178
Epoch 2/3
3750/3750 [==============================] - 715s 191ms/step - loss: 0.2130 - accuracy: 0.9302 - val_loss: 0.2213 - val_accuracy: 0.9272
Epoch 3/3
3750/3750 [==============================] - 714s 190ms/step - loss: 0.1623 - accuracy: 0.9473 - val_loss: 0.2526 - val_accuracy: 0.9242
<keras.src.callbacks.History at 0x791f163977f0>

Model Evaluation

Finally, let’s predict the class of two news articles.

[ ]:
sample_news_1 = ['Tesla, a self driving car company is also planning to make a humanoid robot. This humanoid robot appeared dancing in the latest Tesla AI day']

predictions_1 = news_model.predict(np.array(sample_news_1))

predicted_class_1 = np.argmax(predictions_1)

print('Predicted class:', predicted_class_1)
print('Predicted class name:', class_names[predicted_class_1])
1/1 [==============================] - 0s 426ms/step
Predicted class: 3
Predicted class name: Sci/Tech
[ ]:
sample_news_2 = ["In the last weeks, there has been many transfer suprises in footbal. Ronaldo went back to Old Trafford, "
                "while Messi went to Paris Saint Germain to join his former colleague Neymar."
                "We can't wait to see these two clubs will perform in upcoming leagues"]

predictions_2 = news_model.predict(np.array(sample_news_2))

predicted_class_2 = np.argmax(predictions_2)

print('Predicted class:', predicted_class_2)
print('Predicted class name:', class_names[predicted_class_2])
1/1 [==============================] - 0s 22ms/step
Predicted class: 1
Predicted class name: Sports

20.8 Decoder Sub-network

The Transformer Network in the original paper was designed for machine translation. Differently from the text classification task where for an input text sentence the model predicts a class label, in machine translation for an input text sentence in a source language the model predicts the corresponding text sentence in a target language. Therefore, both the input and output of the model are text sequences. These type of models are called sequence-to-sequence models, or oftentimes this term is abbreviated to seq2seq models. Beside machine translation, other NLP tasks that employ seq2seq models include question answering, text summarization, dialog generation, and others.

The architecture of Transformer Networks designed to handle seq2seq tasks consists of encoder and decoder sub-networks.

  • Encoder sub-network takes a source text sequence as an input, and extracts a useful representation of the text data.

  • Decoder sub-network takes a target text sequence as an input, as well as it receives the intermediate representation from the encoder sub-network. The decoder combines the information from the target sequence and the encoded source sequence, and learns to predict the next word (token) in the target sequence.

This is shown in the next figure, where the French sequence “Je suis etudiant” is translated into “I am a student”. The decoder outputs one word at each time step until the end-of-sequence is reached.

5a421b2d67c347899627a1d7a1627263

Figure: Decoder block

Such models that predict future values based on past observations under the assumption that the current value is dependent on previous values are called autoregressive models. Autoregressive text generation involves iteratively generating one token at a time, by predicting the next word or token based on the preceding words in the sequence. This approach allows the model to produce coherent and relevant responses by chatbots.

The architecture of the decoder is similar to the encoder and it is shown in the next figure. The upper part of the decoder is practically the same as the encoder, and it consists of a multi-head attention module with residual connections and layer normalization, followed by a feed-forward network with residual connections and layer normalization.

The main difference from the encoder is the masked multi-head attention module in the lower part of the decoder. This module is inserted before the multi-head attention module in the decoder. Masked multi-head attention module applies masking to the next words in the target sequence, so that the network does not have access to those words. That is, during training, if the model needs to predict the 4th word in a sentence, masks will be applied to all words after the 3rd word, so that the model has access only to the words 1, 2, and 3, in order to predict the 4th word. This step ensures that the model uses only the previous steps to predict the word in the next step in the target sequence. This type of mask is also referred to as causal attention mask, because it enforces causality by ensuring the the model only relies on information avaiable up to the current token for predicting the next token.

This also explains why in the figure below inputs to the decoder sub-network are “Outputs (shifted right)”. It is because at each step, the target sequence is shifted to the right and it is fed again into the decoder. E.g., after predicting the 4th word, to predict the 5th word the input to the decoder will be words 1, 2, 3, and 4, and so on.

Finally, the output representations from the decoder are passed to a linear (dense) layer and a softmax layer, that outputs the probability for the next word in the vocabulary learned from the training dataset.

And also note the marks Nx in the figure. They indicate that the shown encoder and decoder blocks are repeated multiple times in the network. In the original Transformer Network, there are 6 encoder blocks, and similarly there are 6 decoder blocks. Introducing multiple blocks in the encoder and decoder sub-networks increases the learning ability as it allows the model to learn more abstract representations.

0649d7556c20464caa8e8d8c4907f88c

Figure: Transformer Network

Note that Recurrent Neural Networks are also a type of seq2seq models. Transformer Networks have several advantages over RNN, due to the ability to inspect entire text sequences at once, capture context in long sequences, are parallelizable, and are more powerful in general. Conversely, RNN have access only to the next token in a sequence (have difficulty finding correlations in long sequences because the information needs to pass through many processing steps), can not perform parallel computations (are slow to train), and the gradients can become unstable.

20.9 Vision Transformers

After the initial success of Transformer Networks in NLP, recently they have been adapted for computer vision tasks as well. The initial Transformer model for vision tasks proposed in 2021 was called Vision Transformer (ViT).

The architecture of ViT is very similar to the Transformers used in NLP. However, Transformer Networks were designed for working with sequential data, while images are spatial data types. To consider each pixel in an image as a sequential token would be impractical and too time-consuming. Therefore, ViT splits images into a set of smaller image patches (16x16 pixels), and it uses the sequence of image patches as inputs to the model (i.e., each image patch is considered a token). Each image patch is first flattened to one-dimensional vector, and those vectors are afterward passed through a dense layer to learn lower-dimensional embeddings for each patch. Positional embeddings and class embeddings are added, and the sequences are fed to a standard transformer encoder. Class embeddings are vectors that correspond to different classes in the dataset. The encoder block in ViT is identical to the encoder in the original Transformer Network. The steps are depicted in the figure below.

46f4b47e60ec4c2aad1ef00ade57f91e

Figure: Vision Transformer

The authors trained 3 versions of ViT, called Base (12 encoder blocks, 768 embeddings dimension, 86M parameters), Large (24 encoder blocks, 1,024 embeddings dimension, 307M parameters), and Huge (32 encoder blocks, 1,280 embeddings dimension, 632M parameters).

Various other versions of vision transformers were introduced recently, which include MaxViT (Multi-axis ViT), Swin (Shifted Window ViT), DeiT (Data-efficient image Transformer), T2T-ViT (Token-to-token ViT), and others. These models achieved higher accuracy on many vision tasks in comparison to Convolutional Neural Networks (EffNet, ConvNeXt, NFNet). The following figure shows the accuracy on ImageNet.

fa587b21c74d4fcf971082ea2831390b

Figure: Accuracy on the ImageNet dataset

References

  1. The Illustrated Transformer, Jay Alammar, available at: https://jalammar.github.io/illustrated-transformer/.

  2. Keras Examples, Text classification with Transformer, available at: https://keras.io/examples/nlp/text_classification_with_transformer/.

  3. Using Pretrained BERT for Text Classification, Jean de Dieu Nyandwi, available at: https://github.com/Nyandwi/machine_learning_complete/blob/main/9_nlp_with_tensorflow/5_using_pretrained_bert_for_text_classification.ipynb.

  4. Deep Learning with Python, Francois Chollet, Second Edition, Manning Publications, 2021.

  5. TensorFlow Tutorials, Neural Machine Translation with a Transformer and Keras, available at https://www.tensorflow.org/text/tutorials/transformer.

  6. How the Vision Transformer (ViT) Works in 10 Minutes: An Image is Worth 16x16 Words, Nikolas Adaloglou, available at https://theaisummer.com/vision-transformer/.

BACK TO TOP