Lecture 23 - Large Language Models¶

23.1 Introduction to LLMs
- 23.1.1 Architecture of Large Language Models
- 23.1.2 Variants of Transformer Network Architectures
23.2 Creating LLMs
23.3 Finetuning LLMs
23.4 Limitations and Ethical Considerations of LLMs
23.5 Foundation Models
References

23.1 Introduction to LLMs¶

Large Language Models (LLMs) are a class of Deep Neural Networks designed to understand and generate natural human language. These models have achieved state-of-the-art performance across various NLP tasks.

LLMs are a result of many years of research and advancement in NLP and Machine Learning. Important phases in the development include:

Statistical language models (1980s-2000s): developed to predict the probability of a word in a text sequence based on the preceding words. Examples of statistical language models include bag-of-words models based on N-grams. These models were used in tasks like speech recognition and machine translation, but struggled with capturing long-range dependencies and context-related information in text.
Neural network models (2000-2017): fully-connected NNs and recurrent NNs emerged as an alternative to statistical language models. Long Short-Term Memory (LSTM) RNN models were used for sequence-to-sequence tasks like machine translation and they formed the basis for several early LLMs. Similar to statistical language models, RNNs struggled with capturing context-related information, and other limitations of RNNs include the inability to parallelize the data processing, and the gradients can become unstable during training.
Transformer network models (2017-present): Transformer architecture introduced the self-attention mechanism as a replacement for the recurrent layers in RNNs. This breakthrough enabled the development of more powerful and efficient LLMs, laying the foundation for BERT, GPT, and modern LLMs.

23.1.1 Architecture of Large Language Models¶

The architecture of LLMs is based on Transformer Networks, which we covered in Lecture 20. The main components of the Transformer Networks architecture include:

Input embeddings, are fixed-size continuous vector embeddings that represent tokens in input text.
Positional encodings, are fixed-size continuous vectors that are added to the input embeddings to provide information about the relative positions of the tokens in the input text sequence.
Encoder, is composed of a stack of multi-head attention modules and fully-connected (feed-forward) modules. The encoder block also includes dropout layers, residual connections, and applies layer normalization.
Decoder, is composed of a stack of multi-head self-attention modules and fully-connected (feed-forward) modules similarly to the encoder block. The decoder block has an additional masked multi-head attention module, that applies masking to the next words in the text sequence to ensure that the module does not have access to those words for predicting the next token.
Output fully-connected layer, the output of the decoder is passed through a fully-connected (dense, linear) layer to produce the next token in the text sequence.

45b40ec511df48ff840d054c7819a7e7

Figure: Pretraining LLMs. Source: [2].

The architecture of Transformer Networks includes multiple successive encoder and decoder blocks to create deep networks with many layers that allow learning complex patterns in input text. For example, the original Transformer Network has 6 encoder and 6 decoder blocks, as shown in the above figure.

The self-attention mechanism is a key component of the Transformer Network architecture that enables the model to weigh the importance of each token with respect to the other tokens in a sequence. It allows to capture long-range dependencies and relationships between the tokens (words) and helps the model to understand the context and structure of the input text sequence.

23.1.2 Variants of Transformer Network Architectures¶

Various LLMs have been built on top of the Transformer Network architecture. The popular variants include:

Decoder-only models: are autoregressive models that utilize only the decoder part of the Transformer Network architecture. These models are particularly suitable for generating text and content. An example of decoder-only LLMs is the family of GPT models.
Encoder-only models: use only the encoder part of the Transformer Network architecture, and perform well on tasks related to language understanding, such as classification and sentiment analysis. An example is the BERT model.
Encoder-decoder models: employ the original Transformer Network architecture and combine encoder and decoder sub-networks, enabling to both understand language and generate content. These models can be used for various NLP tasks with minimal task-specific modifications. An example of this class of models is T5 (Text-to-Text Transfer Transformer).

List of LLMs¶

A large number of LLMs were developed in the past several years. Some of the most well-known LLMs include:

GPT (Generative Pretrained Transformers): Developed by OpenAI, the GPT family are the best-known LLMs. They include GPT 1, 2, 3, 3.5 (initial ChatGPT), and 4 (current ChatGPT). According to some sources, GPT-4 has 1.76 trillion parameters, and it is trained on 13T tokens.
BERT (Bidirectional Encoder Representations from Transformers): Developed by Google in 2018, BERT is an early LLM with 340M parameters that can understand natural language and answer questions.
LlaMA (Large Language Model Meta AI): Developed by Meta AI, LlaMA is an open-source LLM, which can be used for both research and commercial uses. It consists of several models including LlaMA base model, LlaMA-Chat, and Code-LlaMA. LlaMA 2 includes models with 7B, 13B, and 70B parameters, trained on 2T tokens.
Falcon: Developed by UAE’s Technology Innovation Institute (TII), it is an open-source family of models with 1.3B, 7.5B, 40B, and 180B parameters, trained on 3.5T tokens.
Bard, developed by Google, Bard is an LLM with 137B parameters trained on 1.56T tokens, that is based on the LaMDA model.
Claude: developed by Anthropic AI, Claude is an LLM with 137B parameters.
PaLM (Pathways Language Model): Developed by Google, PaLM is an LLM with 540B parameters capable of common-sense and arithmetic reasoning, code generation, and translation. It was trained on 3.6T tokens.
Cohere LLM: Developed by Cohere, it is a family of LLMs with 6B, 13B, and 52B parameters, designed for enterprise use cases.
Vicuna: Developed by LMSYS, Vicuna is a 13B parameters chat assistant finetuned from LLaMA on user-shared conversations.
Alpaca: Developed by Stanford, it is an LLM finetuned from instruction-following samples by LLaMA.
Dolly: Developed by Databricks, it is an open-source instruction-following LLM language model with 12B parameters.

23.2 Creating LLMs¶

Creating modern LLMs such as ChatGPT or LlaMA 2, typically involves three main phases:

Pretraining, the model extracts knowledge from large unlabeled text datasets.
Supervised finetuning, the model is refined to improve the quality of generated responses.
Alignment, the model is further refined to generate safe and helpful responses that are aligned with human preferences.

23.2.1 Pretraining¶

The first step in creating LLMs is pretraining the model on massive amounts of text data. The datasets usually consist of a large collection of web pages or e-books comprising billions or trillions of tokens, and ranging from gigabytes to terabytes of text. During pretraining, the model learns the structure of the language, grammar rules, facts about the world, and reasoning abilities. And, it also learns biases and harmful content present in the training data.

Pretraining is performed using unsupervised learning techniques. Two common approaches for pretraining LLMs are:

Causal Language Modeling, also known as autoregressive language modeling, involves training the model to predict the next token in the text sequence given the previous tokens. This approach is used for pretraining ChatGPT, LlaMA 2, and it is more common with modern LLMs.
Masked Language Modeling, where a certain percentage of the input tokens are randomly masked, and the model is trained to predict the masked tokens based on the surrounding context. BERT and earlier LLMs were pretrained with masked language modeling.

The following figure depicts the pretraining phase with Causal Language Modeling, where the model learns to predict the next word in a sentence given the previous words.

9d55c921266147e98fb8f2f05d99ca8b

Figure: Pretraining LLMs. Source: [3].

Pretraining allows to extract knowledge from very large unlabeled datasets in unsupervised learning manner, without the need for manual labeling. Or, to be more precise, the “label” in LLMs pretraining is the next word in the text, to which we already have access since it is part of the training text. Such pretraining approach is also called self-supervised training, since the model uses each next word in the text to self-supervise the training.

Note that pretraining LLMs from scratch is computationally expensive and time-consuming. As we stated before, the pretraining phase can cost millions of dollars (e.g., the estimated cost for training GPT-4 is $100 million). Also, pretraining LLMs requires access to large datasets and technical expertise with strong understanding of deep learning workflows, working with distributed software and hardware, and managing model training with thousands of GPUs simultaneously.

23.2.2 Supervised Finetuning¶

After the pretraining phase, the model is finetuned on a much smaller dataset, which is carefully generated with human supervision. This dataset consists of samples where AI trainers provide both queries (instructions) and model responses (outputs), as depicted in the following figure. That is, instruction is the input text given to the model, and output is the desired response by the model. The model takes the instruction text as input (e.g., “Write a limerick about a pelican”) and uses next-token prediction to generate the output text (e.g., “There once was a pelican so fine …”).

The finetuning process involves updating the model’s weights using supervised learning techniques. The objective of supervised finetuning is to improve the quality of the generated responses by the pretrained LLM.

To compile datasets for supervised finetuning, AI trainers need to write the desired instructions and responses, which is a laborious process. Typical datasets include between 1K and 100K instruction-output pairs. Based on the provided instruction-output pairs, the model is finetuned to generate responses that are similar to those provided by AI trainers.

fe1fd6c3e2c44eccbfaca3bad5633e74

Figure: Finetuning a pretrained LLM. Source: [3].

23.2.3 Alignment with Reinforcement Learning from Human Feedback (RLHF)¶

To further improve the performance and align the model responses with human preferences, LLMs are typically refined in one additional phase with Reinforcement Learning from Human Feedback (RLHF). This process is depicted in the figure below and involves the following steps:

Collect human feedback. For this step a new dataset is created by collecting sample prompts from a database or by creating a set of new prompts. For each prompt, multiple responses are generated by the supervised finetuned model. Next, AI trainers are asked to rank by quality all responses generated by the model for the same prompt, from best to worst. Such feedback is used to define the human preferences and expectations about the responses by the model. Although this ranking process is time-consuming, it is usually less labor-intensive than creating the dataset for supervised finetuning, since ranking the responses is faster than writing the responses.
Create a reward model. The collected data with human feedback containing the prompts and the ranking scores of the different responses is used to train a Reward Model (denoted with RM in the figure). The task for the Reward Model is to predict the quality of the different responses to a given prompt and output a ranking score. The ranking scores provided by AI trainers are used to establish the ground-truth for training the Reward Model. Note that the Reward Model is a different model than the LLM that is being finetuned, and it only needs to rank the generated responses by the LLM.
Finetune the LLM with RL. The LLM is finetuned using a Reinforcement Learning (RL) algorithm, and for this step typically the Proximal Policy Optimization (PPO) algorithm is used. For a new prompt, the original LLM generates a response, which the Reward Model evaluates and calculates a reward score $r_k$. Next, the PPO algorithm uses the reward score $r_k$ to finetune the LLM so that the total rewards for the generated responses by the LLM are maximized. I.e., the goal is to generate responses by the LLM that maximize the predicted reward scores, and by that, the responses become more aligned with human preferences and are more useful to human users.
Iterative improvement. The RLHF process is performed iteratively, with multiple rounds of collecting additional feedback from human labelers, re-training the Reward Model, and applying Reinforcement Learning. This leads to continuous refinement and improvement of the LLM’s performance.

830297c8b1a847c48a488f908daabe19

Figure: Reinforcement Learning from Human Feedback. Source: [4].

In summary, the RLHF approach creates a reward system that is augmented by human feedback and is used to teach LLMs which responses are more aligned with human preferences. Through these iterations, LLMs can be better aligned with our human values and can lead to higher-quality responses, as well as improved performance on specific tasks.

RLHF has been successfully applied to finetune models like ChatGPT and LlaMA models. Note also that there are several variants of the RLFH approach for finetuning LLMs. For example, LlaMA 2 employs two reward models: one based on the ranks of helpfulness of the responses, and another based on the ranks of safety of the responses. The final reward score is obtained as a combination of the helpfulness and safety scores.

23.3 Finetuning LLMs¶

Finetuning LLMs involves updating the weights of an LLM model on new data to improve its performance on a specific task and make the model more suitable for a specific use case. It involves additional re-training of the model on a new dataset that is specific to that task. That is, finetuning is a transfer learning technique, where the gained knowledge by a trained model is transfered to improve the performance on a target task.

To adapt LLMs to a custom task, different finetuning techniques have been applied. Full model finetuning is a method that finetunes all the parameters of all the layers of a pretrained model. Full model finetuning typically can achieve the best performance, but it is also the most resource-intensive and time-consuming. Performance-efficient finetuning involves updating only a small number of the parameters to reduce the required computational resources and costs.

In this section, we will demonstrate how to finetune LlaMA 2 (Large Language Model Meta AI 2), which is an open-source LLM developed by Meta AI. Released in July 2023, LlaMA 2 was the first LLM that is open for both research and commercial use. LlaMA 2 is a successor model to the original LlaMA developed by Meta AI as well. LlaMA 2 has three variants with 7B, 13B, and 70B parameters. It has been trained on 2 trillion tokens, and it has a context window of 4,096 tokens enabling to process large documents. For instance, for the task of summarizing a pdf document the context can include the entire text of the pdf document, or for dialog with a chatbot the context can include the previous conversation history with the chatbot.

Furthermore, several specialized versions of LlaMA 2 were recently released, including LlaMA-2-Chat optimized for dialog generation, and Code LlaMA optimized for code generation tasks.

23.3.1 Parameter-Efficient Finetuning (PEFT)¶

Finetuning LLMs is challenging since the large number of parameters of modern LLMs requires substantial computational resources for storing the models and for re-training the wights. Thus, it can be prohibitively expensive for most users. For instance, to load the largest version of the LLaMA 2 model with 70 billion parameters into the GPU memory requires approximately 280 GB of RAM. Full model finetuning of LlaMA 2 model with 70 billion parameters requires 780 GB of GPU memory. This is equivalent to 10 A100s GPUs that have 80 GB RAM each, or 48 T4 GPUs that have 16 GB RAM each. The free version of Google Colab offers one T4 GPU with 16 GB RAM.

Fortunately, several Parameter-Efficient FineTuning (PEFT) techniques have been introduced recently, which allow updating only a small number of the model weights. Consequently, these techniques enable finetuning LLMs using lower computational resources by reducing memory usage and speeding up the training process. PEFT techniques include prompt tuning, prefix tuning, adding additional adapter layers in the transformer block, and low-rank adaptation (LoRA). Among these techniques, LoRA finetuning has been the most popular, since it allows to train LLMs with a single GPU.

Hugging Face has developed a PEFT library that contains implementations of common finetuning techniques, such as weight quantization, LoRA, QLoRA, prefix tuning, and others. We will use the PEFT library to finetune LlaMA 2 on a custom dataset.

23.3.2 Low-Rank Adaptation (LoRA)¶

Low-Rank Adaptation (LoRA) involves freezing the pretrained model and finetuning a small number of additional weights. After the additional weights are updated, these weights are merged with the weights of the original model.

This is depicted in the following figure, where regular finetuning is shown in the left figure, and it involves updating all weights $W$ in a pretrained model. As we know, the weight update matrix $\nabla{W}$ is calculated based on the negative gradient of the loss function. Finetuning with LoRA is shown in the right figure, where the weight update matrix $\nabla{W}$ is decomposed into two smaller matrices, $\nabla{W}=W_A*W_B$, with size $W_A \in \mathbb{R}^{A \times r}$ and $W_B \in \mathbb{R}^{r \times B}$. The matrices $W_A$ and $W_B$ are called low-rank adapters, since they have lower rank $r$ in comparison to the original weight matrix, i.e., they have fewer number of columns or rows, respectively. During training, gradients are backpropagated only through the matrices $W_A$ and $W_B$, while the pretrained weights $W$ remain frozen.

For instance, if the full weight matrix $W$ is of size $100 \times 100$, this is equal to $10,000$ elements (model weights). If we decompose the weight update matrix $\nabla{W}$ by using rank $r=5$, the total number of elements of $W_A \in \mathbb{R}^{100 \times 5}$ and $W_B \in \mathbb{R}^{5 \times 100}$ will be $500 + 500 = 1,000$. Hence, with LoRA the number of elements was reduced from $10,000$ to $1,000$.

30ab9527e6664aac9106ef5ce8a4fd5e

Figure: Regular finetuning versus LoRA finetuning . Source: [5].

23.3.3 Quanitized LoRA (QLoRA)¶

Quanitized LoRA (QLoRA) is a modified version of LoRA that uses 4-bit quantized weights. Quantization reduces the precision for the values of the network weights. In TensorFlow and PyTorch, the network weights by default are stored with 32-bit floating-point precision. With quantization techniques, the network weights are stored with lower precision, such as 16-bit, 8-bit, or 4-bit precision.

QLoRA introduces a new 4-bit quantization format called “nf4” (normalized float 4) where the range of values is normalized to the range [-1, 1] by dividing the values evenly into 16 bins (4-bit allows $2^4=16$ values). While 4-bit floating point precision (fp4) applies non-linear floating point representation of the original values and results in unequal spacing of the values, normalized float 4 precision (nf4) applies linear quantization of the original values into equaly spaced bins and follows a normal distribution.

In addition, QLoRA combines 4-bit quantization of the model weights in the pretrained model and LoRA that adds low-rank adaptor layers. The benefits of QLoRA with 4-bit quantization of the model weights include reduced size of the model and increased inference speed, while having a modest decrease in the overall model performance.

For example, with QLoRA a 70B parameter model can be finetuned with 48 GB VRAM, in comparison to 780 GB VRAM required for finetuning all weights of the original model (using 32-bit floating-point precision). Similarly, QLoRA enables to train the smaller version of LlaMA 2 with 7B parameters on a T4 GPU (provided by Google Colab) that has 16 GB VRAM. In cases when only a single GPU is available, using quantization is necessary for finetuning LLMs.

23.3.4 Finetuning Example: Finetuning LlaMA-2 7B¶

Import Libraries¶

We will begin by installing the required libraries and importing modules from these packages. These include accelerate (for optimized training on GPUs), peft (for Parameter-Efficient Fine-Tuning), bitsandbytes (to quantize the LlaMA model to 4-bit precision), transformers (for working with Transformer Networks), and trl (for supervised finetuning, where trl stands for Transformer Reinforcement Learning).

[ ]:

!pip install -q accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 244.2/244.2 kB 4.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 72.9/72.9 kB 11.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 92.5/92.5 MB 19.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.4/7.4 MB 106.8 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 77.4/77.4 kB 10.4 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.3/1.3 MB 81.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 311.7/311.7 kB 37.3 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 108.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 521.2/521.2 kB 60.7 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 115.3/115.3 kB 17.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 134.8/134.8 kB 20.7 MB/s eta 0:00:00

[ ]:

import os
import torch
from datasets import load_dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,
    HfArgumentParser, TrainingArguments, pipeline, logging)
from peft import LoraConfig, PeftModel, get_peft_model
from trl import SFTTrainer

Load the Model¶

We will download the smallest version of LlaMA-2-Chat model with 7B parameters from Hugging Face. Understandably, the larger LlaMA 2 models with 13B and 70B parameters require larger memory and computational resources for finetuning.

Also, we will use the BitsAndBytes library to apply quantization with 4-bit precision format for loading the model weights. Loading a quantized model reduces the GPU memory requirement and makes it possible to train the model with a single GPU, as a tradeoff for some loss in precision. In the next cell we define the configuration for BitsAndBytes, and afterward we will use the configuration in the from_pretrained function to load the LlaMA 2 model. The parameters in BitsAndBytes configuration are described in the commented code below.

The compute type can be either “float16”, “bfloat16”, or “float32” because computations are performed in either 16 or 32-bit precision. In this case, we specified to use "torch. float16" compute data type (i.e., 16-bit floating-point numbers) for memory-saving purposes. Note that although the model weights are loaded with 4-bit precision, the weights are dequantized to 16-bit precision for performing the calculations for the forward and backward passes through the network, as 4-bit precision is too low for performing the calculations.

[ ]:

# The model is Llama 2 from the Hugging Face hub
model_name = "NousResearch/Llama-2-7b-chat-hf"

[ ]:

# BitsAndBytes configuraton
bnb_config = BitsAndBytesConfig(
    # Load the model using 4-bit precision
    load_in_4bit=True,
    # Quantization type (fp4 or nf4)
    # nf4 is "normalized float 4" format, uses a symmetric quantization scheme with 4-bit precision
    bnb_4bit_quant_type="nf4",
    # Compute dtype for 4-bit models
    bnb_4bit_compute_dtype= torch.float16,
    # Use double quantization for 4-bit models
    # Double quantization applies further quantization to the quantization constants
    bnb_4bit_use_double_quant=False,
)

We will use AutoModelForCausalLM to load the model with the from_pretrained function, and we will use the above BitesAndBytes configuration to load the model parameters with 4-bit precision.

In the following cell we will load the corresponding tokenizer for LlaMA 2 by using AutoTokenizer and from_pretrained.

[ ]:

# Load Llama 2 model from Hugging Face
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # Apply quantization by using the bnb configuration from the previous cell
    quantization_config=bnb_config,
    # Don't cache the model weights, load the model weights from Hugging Face
    use_cache=False,
    # Trade-off parameter in Llama-2, less important, it should be 1 in most cases
    pretraining_tp=1,
    # Load the entire model on the GPU if available
    device_map="auto"
)

[ ]:

# Load tokenizer from Hugging Face
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
# Needed for LLaMA tokenizer
tokenizer.pad_token = tokenizer.eos_token
# Fix an overflow issue with fp16 training
tokenizer.padding_side = "right"

Define LoRA Configuration¶

Next, the model will be packed into the LoRA format, which will introduce additional weights and keep the original weights frozen. The parameters in the LoRA configuration include:

r, determines the rank of update matrices, where lower rank results in smaller update matrices with fewer trainable parameters, and greater rank results in more trainable parameters but more robust model.
lora_alpha, controls the LoRA scaling factor.
lora_dropout, is the dropout rate for LoRA layers.
bias, specifies if the bias parameters should be trained.
task_type, is Causal LLM for the considered task.

[ ]:

# LoRA configuration
peft_config = LoraConfig(
    # LoRA rank dimension
    r=64,
    # Alpha parameter for LoRA scaling
    lora_alpha=16,
    # Dropout rate for LoRA layers
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

In order to understand how LoRA impacts the finetuning of LlaMA 2 model, let’s compare the total number of trainable parameters in LLaMA 2 and the trainable parameters for the LoRA model. As we can note in the cell below, the LoRA model has about 33M trainable parameters, which is about 1% of the total trainable parameters in LlaMA 2. This makes it possible to finetune the model on a single GPU.

[ ]:

def print_number_of_trainable_model_parameters(model, use_4bit=True):
    trainable_model_params = 0
    all_model_params = 0
    for _, param in model.named_parameters():
        all_model_params += param.numel()
        if param.requires_grad:
            trainable_model_params += param.numel()
    if use_4bit:
        all_model_params *= 2
        trainable_model_params *= 2
    print(f"Total model parameters: {all_model_params:,d}. Trainable model parameters: {trainable_model_params:,d}. Percent of trainable parameters: {100 * trainable_model_params/ all_model_params:4.2f} %")

[ ]:

# compare the number of trainable parameters to QLoRA model
qlora_model = get_peft_model(model, peft_config)

# print trainable parameters
print_number_of_trainable_model_parameters(qlora_model)

Total model parameters: 7,067,934,720. Trainable model parameters: 67,108,864. Percent of trainable parameters: 0.95 %

Load the Dataset¶

We will use the Lamini docs dataset, which contains questions and answers about the framework Lamini for training and developing Language Models. The dataset contains 1,260 question/answer pairs. Here are a few samples from the dataset.

Question	Answer
Does Lamini support generating code	Yes, Lamini supports generating code through its API.
How do I report a bug or issue with the Lamini documentation?	You can report a bug or issue with the Lamini documentation by submitting an issue on the Lamini GitHub page.
Can Lamini be used in an online learning setting, where the model is updated continuously as new data becomes available?	It is possible to use Lamini in an online learning setting where the model is updated continuously as new data becomes available. However, this would require some additional implementation and configuration to ensure that the model is updated appropriately and efficiently.

A preprocessed version of the dataset in a format that matches the instruction-output pairs for LlaMA 2 is available on Hugging Face, and we will directly load the preprocessed version of the dataset.

[ ]:

# Lamini dataset
dataset = load_dataset("mwitiderrick/llamini_llama", split="train")

[ ]:

print(f'Number of prompts: {len(dataset)}')

Number of prompts: 1260

Model Training¶

The next cell defines the training arguments, and the commented notes describe the arguments. Note that we will finetune the model for only 1 epoch (if we finetune for more than 1 epoch it will take longer but it will probably result in improved performance).

[ ]:

# Set training parameters
training_arguments = TrainingArguments(
    # Output directory where the model predictions and checkpoints will be stored
    output_dir="./results",
    # Number of training epochs
    num_train_epochs=1,
    # Batch size per GPU for training
    per_device_train_batch_size=4,
    # Number of update steps to accumulate the gradients for
    gradient_accumulation_steps=1,
    # Optimizer to use
    optim="paged_adamw_32bit",
    # Save checkpoint every number of steps
    save_steps=0,
    # Log updates every number of steps
    logging_steps=25,
    # Initial learning rate (AdamW optimizer)
    learning_rate=2e-4,
    # Weight decay to apply
    weight_decay=0.001,
    # Enable fp16/bf16 training (set bf16 to True with an A100)
    fp16=False,
    bf16=False,
    # Maximum gradient normal (gradient clipping)
    max_grad_norm=0.3,
    # Group sequences with same length into batches (to minimize padding)
    # Saves memory and speeds up training considerably
    group_by_length=True,
    # Learning rate schedule
    lr_scheduler_type="constant"
)

Next, we will use the SFTTrainer class in Hugging Face to create an instance of the model by passing the loaded LlaMA 2 model, training dataset, PeFT configuration, tokenizer, and the training arguments. SFTTrainer stands for Supervised Fine-Tuning Trainer.

[ ]:

# Set supervised finetuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    tokenizer=tokenizer,
    args=training_arguments,
    # Column in the dataset that contains the data
    dataset_text_field="text",
    # Maximum sequence length to use
    max_seq_length=None,
    # Pack multiple short examples in the same input sequence to increase efficiency
    packing=False,
)

/usr/local/lib/python3.10/dist-packages/peft/utils/other.py:102: FutureWarning: prepare_model_for_int8_training is deprecated and will be removed in a future version. Use prepare_model_for_kbit_training instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/trl/trainer/sft_trainer.py:159: UserWarning: You didn't pass a `max_seq_length` argument to the SFTTrainer, this will default to 1024
  warnings.warn(

Finally, we can train the model with the train() function in Hugging Face. In the output of the cell we can see the loss for every 25 training steps, because we set logging_steps=25 in the training arguments.

The training took about 15 minutes on a T4 GPU with High-RAM memory on Google Clab Pro.

[ ]:

# Train the model
trainer.train()

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
  warnings.warn(

[315/315 16:50, Epoch 1/1]

Step	Training Loss
25	1.931500
50	0.579800
75	0.656700
100	0.454200
125	0.582700
150	0.435500
175	0.624900
200	0.404000
225	0.579700
250	0.438200
275	0.564200
300	0.397300

TrainOutput(global_step=315, training_loss=0.6278291732545883, metrics={'train_runtime': 1020.6457, 'train_samples_per_second': 1.235, 'train_steps_per_second': 0.309, 'total_flos': 5614695674511360.0, 'train_loss': 0.6278291732545883, 'epoch': 1.0})

Generate Text¶

To generate text with the trained model we will use the Hugging Face pipeline with the task set to "text-generation". We can set the length of the generated text tokens with the max_length argument.

The output displays the start <s>[INST] and end [/INST] of the instruction prompt, followed by the generated output by the model.

[ ]:

# Run text generation pipeline with the finetuned model
prompt = "What are Lamini models?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
output = pipe(f"<s>[INST] {prompt} [/INST]")
print(output[0]['generated_text'])

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.
/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py:1270: UserWarning: You have modified the pretrained model configuration to control generation. This is a deprecated strategy to control generation and will be removed soon, in a future version. Please use a generation configuration file (see https://huggingface.co/docs/transformers/main_classes/text_generation )
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torch/utils/checkpoint.py:61: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn(

<s>[INST] What are Lamini models? [/INST]  Lamini is a language model training platform that allows developers to train and fine-tune their own custom language models using their own data. everybody can train a model with Lamini, regardless of their technical expertise. Lamini models are trained on large datasets of text and can be used for a variety of natural language processing tasks, such as text classification, sentiment analysis, and language translation.

Lamini models are trained using a technique called "fine-tuning," which involves adjusting the weights of a pre-trained language model to fit a specific task or dataset. This allows developers to train a model that is tailored to their specific needs, without having to start from scratch.

Lamini models can be trained on any dataset, including text data from the internet, social media, or internal company data. They can also be trained on a combination of different datasets, allowing developers

[ ]:

# Run text generation pipeline with the finetuned model
prompt = "How to evaluate the quality of the generated text with Lamini models"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500)
output = pipe(f"<s>[INST] {prompt} [/INST]")
print(output[0]['generated_text'])

<s>[INST] How to evaluate the quality of the generated text with Lamini models [/INST] Lamini offers a Python library called Lamini Python Library that allows you to train and use LLMs. everybody can use Lamini's python library to train and use LLMs.

Here are some ways to evaluate the quality of the generated text with Lamini models:

1. Perplexity: Measure the perplexity of the generated text by comparing it to a reference text. Lower perplexity indicates better quality.
2. BLEU score: Use the BLEU score to evaluate the quality of the generated text. BLEU is a metric that measures the similarity between the generated text and a reference text.
3. ROUGE score: Use the ROUGE score to evaluate the quality of the generated text. ROUGE is a metric that measures the similarity between the generated text and a reference text.
4. METEOR score: Use the METEOR score to evaluate the quality of the generated text. METEOR is a metric that measures the similarity between the generated text and a reference text.
5. Human evaluation: Have human evaluators rate the quality of the generated text on a scale from 1 to 5.
6. Automatic metrics: Use automatic metrics such as perplexity, entropy, and coherence to evaluate the quality of the generated text.
7. LLM's performance on specific tasks: Evaluate the performance of the LLM on specific tasks such as text generation, question answering, and language translation.
8. LLM's performance on specific domains: Evaluate the performance of the LLM on specific domains such as medical text, legal text, or technical text.
9. LLM's performance on specific types of data: Evaluate the performance of the LLM on specific types of data such as text, images, or audio.
10. LLM's performance on specific evaluation metrics: Evaluate the performance of the LLM on specific evaluation metrics such as accuracy, precision, recall, and F1 score.

It's important to note that the quality of the generated text can vary depending on the specific use case and the data used to train the LLM. Therefore, it's important to evaluate the quality of the generated text in the context

[ ]:

# Run text generation pipeline with the finetuned model
prompt = "Write a poem about Data Science"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=500)
output = pipe(f"<s>[INST] {prompt} [/INST]")
print(output[0]['generated_text'])

<s>[INST] Write a poem about Data Science [/INST]  In the realm of code and algorithms,
 nobody knows like Data Science
With a dash of math and a pinch of magic,
She weaves a tale of insight and wonder.

With a click of her mouse, she uncovers
The secrets hidden deep within the data's cloak.
She sifts through the noise and finds the gems,
And turns them into insights that make us think.

She's a detective of the digital age,
Uncovering patterns and connections that we've never seen.
She's a wizard of the data realm,
Weaving spells of prediction and machine learning.

With a nod of her head, she conjures up
A world of possibilities, a world of wonder.
She's a master of the data craft,
A weaver of tales that make our hearts sing.

So let us hail the Data Queen,
Who brings us closer to the truth we seek.
For in the realm of data science,
She's the one who holds the key.

23.3.5 Retrieval Augmented Generation (RAG)¶

Retrieval Augmented Generation (RAG) refers to using external sources of information for improving the quality of generated responses by LLMs. RAG enables LLMs to retrieve facts from external sources (such as Wikipedia, news articles) and provide responses that are more accurate and/or are up-to-date.

In general, the internal knowledge of LLMs is static, as it is fixed by the date of the used training dataset. Therefore, LLMs cannot answer questions about current events, and they are stuck in the time moment of their training data. Updating LLMs with knowledge about current events requires to continuously retrain the models on new data. Such a process is very expensive, as it requires collecting updated datasets and finetuning the model to update the weights.

RAG enables to avoid expensive LLMs retraining, by retrieving information from updated external databases to generate responses. The RAG approach involves two phases: retrieval and content generation. The retrieval phase includes performing relevancy search of external databases regarding a user query, and retrieving supporting documents and snippets of important information. Afteward, in the content generation phases, these supporting documents are used as a context that is appended to the user query, and are fed to the LLMs for generating the final response.

Instead of relying only on the information contained in the training dataset used for training an LLM, RAG provides an interface to external knowledge to ensure that the model has access to the most current and reliable facts. E.g., in enterprise setting, external sources of information for RAG can comprise of various company-specific files, documents, and databases. Employing RAG can result in more relevant responses and it can reduce the problem of hallucination by LLMs. It also allows the users to review the sources that were used by LLMs and verify the accuracy of generated responses.

23.3.6 Prompt Engineering¶

Prompt engineering is a technique for improving the performance of LLMs by providing detailed context and information about a specific task. It involves creating text prompts that provide additional information or guidance to the model, such as the topic of the generated response. With prompt engineering, the model can better understand the kind of expected output and produce more accurate and relevant results.

The following tips for creating effective prompts as part of prompt engineering can improve the performance of LLMs:

Use clear and concise prompts: The prompt should be easy to understand and provide enough information for the model to generate relevant output. Avoid using jargon or technical terms.
Use specific examples: Providing specific examples can help the model better understand the expected output. For example, if you want the model to generate a story about a particular topic, include a few sentences about the setting, characters, and plot.
Vary the prompts: Use prompts with different styles, tones, and formats to obtain more diverse outputs from the model.
Test and refine: Test the prompts on the model and refine them by adding more detail or adjusting the tone and style.
Use feedback: Use feedback from users or other sources to identify areas where the model needs more guidance and make adjustments accordingly.

Chain-of-thought technique involves providing the LLM with a series of instructions to help guide the model and generate a more coherent and relevant response. This technique is useful for obtaining well-reasoned responses from LLMs.

An example of a chain-of-thought prompt is as follows: “You are a virtual tour guide from 1901. You have tourists visiting Eiffel Tower. Describe Eiffel Tower to your audience. Begin with (1) why it was built, (2) how long it took to build, (3) where were the materials sourced to build, (4) number of people it took to build it, and (5) number of people visiting the Eiffel tour annually in the 1900’s, the amount of time it completes a full tour, and why so many people visit it each year. Make your tour funny by including one or two funny jokes at the end of the tour.”

23.4 Limitations and Ethical Considerations of LLMs¶

Although LLMs have demonstrated impressive performance across a wide range of tasks, there are several limitations and ethical considerations that raise concerns.

Limitations:

Computational resources: Training LLMs requires significant computational resources, making it difficult for researchers with limited access to GPUs or specialized hardware to develop and use these models.
Data bias: LLMs are trained on vast amounts of data from the internet, which often contain biases present in the data. As a result, the models may unintentionally learn and reproduce biases in their generated responses.
Producing hallucinations: LLMs can produce hallucinations, which are responses that are false, inacurate, unexpected, or contextually inappropriate. One example of hallucination by ChatGPT is when asked to list academic papers by an author, and it provides papers that don’t exist.
Inability to explain: LLMs are inherently black-box models, making it challenging to explain their reasoning or decision-making processes, which is essential in certain applications like healthcare, finance, and legal domains.

Ethical considerations:

Privacy concerns: LLMs memorize information from their training data, and can potentially reveal sensitive information or violate user privacy.
Misinformation and manipulation: Text generated by LLMs can be exploited to create disinformation, fake news, or deepfake content that manipulates public opinion and undermines trust.
Accessibility and fairness: The computational resources and expertise required to train LLMs may lead to an unequal distribution of benefits, where only a few organizations have the resources to develop and control these powerful models.
Environmental impact: The large-scale training of LLMs consumes a significant amount of energy contributing to carbon emissions, which raises concerns about the environmental sustainability of these models.

Conclusively, it is important to encourage transparency, collaboration, and responsible AI practices to ensure that LLMs benefit all members of society without causing harm.

23.5 Foundation Models¶

Foundation Models are extremely large NN models trained on tremendous amounts of data with substantial computational resources, resulting in high capabilities for transfer learning to a wide range of downstream tasks. In other words, these models are scaled along each of the three factors: number of model parameters, size of the training dataset, and amount of computation. And, they are typically trained using self-supervised learning on unlabeled data. The scale of Foundation Models leads to new emergent capabilities, such as the ability to perform well on tasks that the models were not explicitly trained to do. This allows few-shot learning, which refers to finetuning Foundation Models to new downstream tasks by using only a few training data instances for the new task. Similarly, zero-shot learning extends this concept even further, and refers to a model’s ability to generalize to new tasks for which the model hasn’t seen any examples during the training.

LLMs represent early examples of Foundation Models, because LLMs are trained at scale and can be adapted for various NLP tasks, even for tasks they were not trained to perform.

The term Foundation Models is more general than LLMs, and they generally refer to large models that are trained on multimodal data, where the inputs can include text, images, audio, video, and other data sources.

The importance of Foundation Models is in their potential to replace task-specific ML models that are specialized in solving one task (i.e., optimized to perform well on one dataset) with general models that have the capabilities to solve multiple tasks. I.e., these models can serve as a foundation that is adaptable to a broad range of applications.

ae2e63ed127b4604b15a6aa1f6ff9769

Figure: Foundation model. Source: link.

References¶

Introduction to Large Language Models, by Bernhard Mayrhofer, available at https://github.com/datainsightat/introduction_llm.
Understanding Encoder and Decoder LLMs, by Sebastian Raschka, available at https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder.
LLM Training: RLHF and Its Alternatives, by Sebastian Raschka, available at https://magazine.sebastianraschka.com/p/llm-training-rlhf-and-its-alternatives.
Training Language Models to Follow Instructions with Human Feedback, by Long Ouyang et al., available at https://arxiv.org/abs/2203.02155.
Parameter-Efficient LLM Finetuning With Low-Rank Adaptation (LoRA), by Sebastian Raschka, available at https://sebastianraschka.com/blog/2023/llm-finetuning-lora.html.
How to Fine-tune Llama 2 With LoRA, by Derrick Mwiti, available at https://www.mldive.com/p/how-to-fine-tune-llama-2-with-lora.
Fine-Tuning Llama 2.0 with Single GPU Magic, by Chee Kean, available at https://ai.plainenglish.io/fine-tuning-llama2-0-with-qloras-single-gpu-magic-1b6a6679d436.
Fine-Tuning LLaMA 2 Models using a single GPU, QLoRA and AI Notebooks, by Mathieu Busquet, available at https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/.
Getting started with Llama, by Meta AI, available at https://ai.meta.com/llama/get-started/.

BACK TO TOP