How to Fine-Tune LLaMA 2: Step by Step using SFT & LoRA

What is LLaMA 2?

LLaMA 2, released by Meta in 2023, is a powerful open-source language model. It is part of the LLaMA (Large Language Model) family, which includes models with different sizes, ranging from 7 billion to 70 billion parameters. Parameters are like the brain cells of the model, and the more it has, the smarter it can be. A model with more parameters can understand and respond in more complex ways.

LLaMA 2 was trained using a massive amount of data 2 trillion words (or tokens). It can process up to 4,096 tokens of text at once, which means it can handle more information at a time compared to the earlier version, LLaMA 1. This helps the model generate better responses.

There are different versions of LLaMA 2 made for specific tasks. For example, LLaMA Chat is fine-tuned to be better at conversations, trained with over 1 million examples of human dialogue. Another version, Code LLaMA, is designed to help with coding and supports programming languages like Python, Java, and C++.

Key Concepts in Fine-Tuning Language Models

Fine-tuning is a way to improve a language model, like LLaMA 2, for specific tasks by retraining it on focused data. Below are key methods used in fine-tuning:

Supervised Fine-Tuning (SFT):
In this method, the model is trained with specific data and guidance. For example, to make LLaMA 2 good at medical analysis, it would be trained on medical records and literature. Each piece of training data includes both the input and the correct output (like a question and answer). This helps the model learn to make better predictions for the task it’s being fine-tuned for.
Reinforcement Learning from Human Feedback (RLHF):
RLHF improves the model by teaching it through human feedback. Humans interact with the model, rating or correcting its responses. The model then adjusts itself to give better answers that match what people expect. This method is useful for making the model better at conversations or creative tasks.
Prompt Template:
A prompt template is a set pattern or structure used to guide the model’s responses. For example, a weather forecast prompt might start with “The weather for [location] is…” and the model would fill in the details. These templates help ensure that the model generates the right type of response for a specific task.
Parameter-Efficient Fine-Tuning (PEFT):
PEFT is a way to fine-tune a model without adjusting all of its parameters, which makes the process faster and requires less computing power. Two techniques used for this are:
- LoRA (Low-Rank Adaptation): LoRA changes only a few important parts of the model, making fine-tuning more efficient.
- QLoRA (Quantized Low-Rank Adaptation): QLoRA reduces the size of the model by lowering the precision of certain parameters, which is useful for running models on devices with limited memory.

These methods make it easier to fine-tune large models like LLaMA 2 for specific tasks without needing huge amounts of computing resources.

How to Fine-Tune LLaMA 2: Step-by-Step Guide

This guide will show you how to fine-tune LLaMA 2 using an example dataset. We'll focus on two methods: Supervised Fine-Tuning (SFT) and Parameter-Efficient Fine-Tuning (PEFT) with LoRA (Low-Rank Adaptation).

For this tutorial, we’ll use the Guanaco dataset from HuggingFace. This dataset has 534,530 entries and covers 175 language tasks, including English grammar, natural language understanding, cross-language awareness, and detecting explicit content.

Here’s the complete script you can use in a Jupyter notebook, provided you have access to a GPU and enough memory. Below, we'll explain how each part of the code works step by step.

# Import necessary libraries
import os
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline, logging
from peft import LoraConfig
from trl import SFTTrainer
import torch
import gc

#Force garbage collection
gc.collect()

def display_cuda_memory():    print("\n--------------------------------------------------\n")
    print("torch.cuda.memory_allocated: %fGB"%(torch.cuda.memory_allocated(0)/1024/1024/1024))
    print("torch.cuda.memory_reserved: %fGB"%(torch.cuda.memory_reserved(0)/1024/1024/1024))
    print("torch.cuda.max_memory_reserved: %fGB"%(torch.cuda.max_memory_reserved(0)/1024/1024/1024))
    print("\n--------------------------------------------------\n")

# Install required libraries (uncomment the following line when running in a notebook environment)

#For PyTorch memory management add the following code

os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "max_split_size_mb:1024"



# Define model, dataset, and new model name
base_model = "NousResearch/Llama-2-7b-chat-hf"
guanaco_dataset = "mlabonne/guanaco-llama2-1k"
new_model = "llama-2-7b-chat-guanaco"

# Load dataset
dataset = load_dataset(guanaco_dataset, split="train")

# 4-bit Quantization Configuration
compute_dtype = getattr(torch, "float16")
quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=False)

# Load model with 4-bit precision
model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config=quant_config, device_map={"": 0})
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Set PEFT Parameters
peft_params = LoraConfig(lora_alpha=16, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM")

# Define training parameters
training_params = TrainingArguments(output_dir="./results", num_train_epochs=1, per_device_train_batch_size=4, gradient_accumulation_steps=1, optim="paged_adamw_32bit", save_steps=25, logging_steps=25, learning_rate=2e-4, weight_decay=0.001, fp16=False, bf16=False, max_grad_norm=0.3, max_steps=-1, warmup_ratio=0.03, group_by_length=True, lr_scheduler_type="constant", report_to="tensorboard")

# Initialize the trainer
trainer = SFTTrainer(model=model, train_dataset=dataset, peft_config=peft_params, dataset_text_field="text", max_seq_length=None, tokenizer=tokenizer, args=training_params, packing=False)

#Force clean the pytorch cache
gc.collect()

torch.cuda.empty_cache()

# Train the model
trainer.train()

# Save the model and tokenizer
trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

# Evaluate the model (optional, requires Tensorboard installation)
# from tensorboard import notebook
# log_dir = "results/runs"
# notebook.start("--logdir {} --port 4000".format(log_dir))

# Test the model
logging.set_verbosity(logging.CRITICAL)
prompt = "Who is Leonardo Da Vinci?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

Installing Libraries

The code below installs the necessary libraries. We'll be installing accelerate, peft, bitsandbytes, transformers, and trl.

- The transformers library allows us to access pre-trained models and tokenizers.
- `bitsandbytes` helps with efficient model quantization, reducing memory usage.

If you're not working in a Jupyter notebook, you'll need to run these commands separately before running the script.

%pip install accelerate==0.21.0 peft==0.4.0 bitsandbytes==0.40.2 transformers==4.31.0 trl==0.4.7

Importing Modules

We’ll start by importing the necessary classes and functions.

- torch is the main library for PyTorch, a popular machine learning framework.
- load_dataset is used to load the training data.
- AutoModelForCausalLM and AutoTokenizer from the transformers library are used to load the model and tokenizer.
- Additional imports like BitsAndBytesConfig, TrainingArguments, pipeline, and logging provide tools for configuring the model and other utility functions.

import os
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, TrainingArguments, pipeline, logging
from peft import LoraConfig
from trl import SFTTrainer

Model Configuration

Next, we’ll set up the base model and the dataset for fine-tuning. We'll define the following variables:

- Base model: NousResearch/Llama-2-7b-chat-hf (the pre-trained model we want to fine-tune).
- Dataset: mlabonne/guanaco-llama2-1k (the dataset used for training).
- New model name: You can give the fine-tuned model a custom name.

This will set the foundation for the fine-tuning process.

base_model = "NousResearch/Llama-2-7b-chat-hf"
guanaco_dataset = "mlabonne/guanaco-llama2-1k"
new_model = "llama-2-7b-chat-guanaco"

Loading Dataset

Next, we'll retrieve and prepare the dataset for training. Using the load_dataset function, we fetch the specified dataset from Hugging Face. The parameter split="train" ensures that we are working with the training portion of the dataset. This dataset will then be used to fine-tune the model.

dataset = load_dataset(guanaco_dataset, split="train")

4-bit Quantization Configuration

Now, we’ll configure the model for efficient training on consumer-grade hardware. To do this, we'll apply 4-bit quantization using `BitsAndBytesConfig`. This technique helps lower the model’s memory usage and computational needs, making it possible to train large models on less powerful hardware without greatly affecting performance.

compute_dtype = getattr(torch, "float16")
quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=compute_dtype, bnb_4bit_use_double_quant=False)

Loading Model

Next, we’ll initialize the base model with the 4-bit quantization settings. Using the AutoModelForCausalLM.from_pretrained function, we load a pre-trained causal language model. This model is set up to use the 4-bit quantization defined earlier. Additionally, the use_cache and pretraining_tp settings are configured to optimize the model's training behavior, ensuring better performance during fine-tuning.

model = AutoModelForCausalLM.from_pretrained(base_model, quantization_config=quant_config, device_map={"": 0})
model.config.use_cache = False
model.config.pretraining_tp = 1

Loading Tokenizer

Now, we’ll prepare the tokenizer to process text from the training dataset, in line with the model's requirements. The tokenizer converts text into a format that the model can understand. Setting padding_side to "right" addresses specific issues with fp16 (16-bit floating-point) operations.

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

Set PEFT Parameters

Now, we’ll set up the fine-tuning process by adjusting a small subset of the model's parameters using the LoRA (Low-Rank Adaptation) method. We use the `LoraConfig` class to configure the settings for Parameter-Efficient Fine-Tuning (PEFT). Key parameters include:

- lora_alpha, lora_dropout, and r: These define the structure and operation of the LoRA layers for efficient fine-tuning.
- bias: Specifies whether to adjust bias terms.
- `task_type`: Set to "CAUSAL_LM" because LLaMA 2 is a causal language model.

This method enables fine-tuning with minimal computational resources.

peft_params = LoraConfig(lora_alpha=16, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM")

Training Parameters

Next, we’ll configure the settings that manage the training process using `TrainingArguments`. This class allows us to specify key parameters like:

- batch_size: The number of samples processed together in one iteration.
- learning_rate: The initial learning rate for the optimizer.
- weight_decay: Helps prevent overfitting by applying a penalty to larger weights.
- num_train_epochs: The number of times the model will train over the entire dataset.

These settings control how the model learns during fine-tuning, ensuring efficient and effective training.

training_params = TrainingArguments(output_dir="./results", num_train_epochs=1, per_device_train_batch_size=4, gradient_accumulation_steps=1, optim="paged_adamw_32bit", save_steps=25, logging_steps=25, learning_rate=2e-4, weight_decay=0.001, fp16=False, bf16=False, max_grad_norm=0.3, max_steps=-1, warmup_ratio=0.03, group_by_length=True, lr_scheduler_type="constant", report_to="tensorboard")

Model Fine-Tuning

Finally, we can start the actual fine-tuning process of the model with the dataset. SFTTrainer is used to train the model using the defined parameters. It takes the model, dataset, PEFT configuration, tokenizer, and training parameters as inputs and packs them into a training setup. This step is where the model learns from the new dataset.

trainer = SFTTrainer(model=model, train_dataset=dataset, peft_config=peft_params, dataset_text_field="text", max_seq_length=None, tokenizer=tokenizer, args=training_params, packing=False)

Training Execution

To execute the training process, we’ll run the train() method of SFTTrainer. It adjusts the model's weights based on the input data and training parameters.

trainer.train()

The output should resemble something like this:

Save and Evaluate

Now that the training is complete, we’ll save the fine-tuned model and evaluate its performance.

We’ll use TensorBoard to visualize training metrics, which will help us assess how well the model has performed during fine-tuning. This visualization can provide insights into training progress and potential areas for improvement.

trainer.model.save_pretrained(new_model)
trainer.tokenizer.save_pretrained(new_model)

from tensorboard import notebook
log_dir = "results/runs"
notebook.start("--logdir {} --port 4000".format(log_dir))

Test the Model

We can now test the capabilities of the fine-tuned model by using a simple prompt to generate text. This is done with the pipeline function, which is a convenient tool for text generation. The output will show how well the model has adapted to the new data and how effectively it can generate relevant responses.

logging.set_verbosity(logging.CRITICAL)
prompt = "Who is Leonardo Da Vinci?"
pipe = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=200)
result = pipe(f"<s>[INST] {prompt} [/INST]")
print(result[0]['generated_text'])

In conclusion, fine-tuning LLaMA 2 allows us to customize a powerful language model for specific tasks while making it efficient for consumer-grade hardware. By leveraging methods like LoRA for Parameter-Efficient Fine-Tuning and utilizing tools such as TensorBoard for monitoring, we can effectively enhance the model's performance. The process involves careful preparation of datasets, configuration of training parameters, and evaluation of the model’s output. Overall, this approach enables us to harness the capabilities of advanced AI in a way that is accessible and adaptable to various applications.