Advanced

Fine-Tuning with LoRA and QLoRA

Lesson 2 of 4 Estimated Time 55 min

Fine-Tuning with LoRA and QLoRA

Parameter-efficient fine-tuning (PEFT) enables adapting large pre-trained models with minimal additional parameters. LoRA (Low-Rank Adaptation) adds trainable low-rank matrices, reducing memory and compute by 99% while maintaining quality. QLoRA combines LoRA with quantization for even greater efficiency.

Core Concepts

LoRA: Low-Rank Adaptation

Instead of fine-tuning all parameters, LoRA adds decomposed weight updates:

W' = W + ΔW = W + BA

Where:

  • W: Pre-trained weights (frozen, not updated)
  • B: Low-rank matrix (d × r)
  • A: Low-rank matrix (r × k)
  • r: Rank (typically 8-64, much smaller than original dimensions)

Complexity Reduction:

Original parameters: 7B × 7B = 49 billion
LoRA parameters (r=8): 2 × (7B × 8) = 112 million
Reduction: 99.8%

QLoRA: Quantized LoRA

Combines LoRA with 4-bit quantization:

  • Quantize base model to 4-bit integers
  • Add LoRA adapters on top
  • Train only adapters (millions of parameters)
  • Result: Fine-tune 65B models on single 24GB GPU

Memory Comparison:

Full fine-tuning 65B: 260GB (FP32)
QLoRA 65B: 48GB (4-bit base + LoRA)
Efficiency gain: 5.4x

Adapter Modules

Alternative approach: insert small trainable modules between layers. Each adapter layer adds 2-4% of parameters while maintaining nearly full fine-tuning quality.

Practical Implementation

LoRA Configuration and Training

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b')

# Configure LoRA
lora_config = LoraConfig(
    r=8,  # Rank of low-rank matrices
    lora_alpha=32,  # Scaling factor
    target_modules=['q_proj', 'v_proj'],  # Which linear layers to adapt
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)

# Print parameter counts
model.print_trainable_parameters()
# trainable params: 1,048,576 || total params: 6,738,415,616 || trainable: 0.016%

# Standard training loop
training_args = TrainingArguments(
    output_dir='./lora_checkpoints',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=10,
    save_steps=500,
    evaluation_strategy='steps',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

# Save only the adapter (not full model)
model.save_pretrained('./lora_adapter')  # ~5MB instead of 13GB

QLoRA: Quantized LoRA for Large Models

from peft import prepare_model_for_kbit_training
from transformers import BitsAndBytesConfig
import torch

# Quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,  # Double quantization
    bnb_4bit_quant_type='nf4',  # NormalFloat4
)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-2-70b',
    quantization_config=quantization_config,
    device_map='auto',  # Auto distribute across GPUs
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Apply LoRA on top of quantized model
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)

# Training as normal
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()

Multi-Task Adapter Management

from peft import PeftModel

# Load base model once
base_model = AutoModelForCausalLM.from_pretrained('gpt2')

# Load task-specific adapters
summarization_model = PeftModel.from_pretrained(
    base_model,
    './lora_adapters/summarization'
)
translation_model = PeftModel.from_pretrained(
    base_model,
    './lora_adapters/translation'
)
qa_model = PeftModel.from_pretrained(
    base_model,
    './lora_adapters/qa'
)

# Use different adapters for different tasks
def route_task(task, input_text):
    if task == 'summarize':
        summarization_model.eval()
        output = summarization_model.generate(input_text)
    elif task == 'translate':
        translation_model.eval()
        output = translation_model.generate(input_text)
    elif task == 'qa':
        qa_model.eval()
        output = qa_model.generate(input_text)
    return output

Advanced Techniques

DoRA: Decomposed Rank Adaptation

Decomposes weight updates into magnitude and direction for better adaptation quality:

dora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    use_dora=True,  # Enable DoRA instead of standard LoRA
    target_modules=['q_proj', 'v_proj'],
)

model = get_peft_model(model, dora_config)

Prefix Tuning

Insert learnable prefix tokens that condition the model:

from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    num_virtual_tokens=20,  # Length of learned prefix
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, prefix_config)
# Adds ~20 * hidden_size trainable parameters

Combining Multiple PEFT Methods

# Combine LoRA on multiple layers
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj', 'up_proj', 'down_proj'],
    lora_dropout=0.05,
)

model = get_peft_model(model, lora_config)

Production Considerations

Merging LoRA Weights for Deployment

def merge_lora_and_save(model, output_path):
    """Merge LoRA weights into base model for inference"""
    merged_model = model.merge_and_unload()
    merged_model.save_pretrained(output_path)
    return merged_model

# Now can use standard transformers inference
merged_model = merge_lora_and_save(peft_model, './deployed_model')
from transformers import pipeline
pipe = pipeline('text-generation', model='./deployed_model')
output = pipe("Once upon a time")

Inference Optimization with Quantization

# Use 8-bit quantization + LoRA for efficient inference
model = AutoModelForCausalLM.from_pretrained(
    'gpt2',
    load_in_8bit=True,
    device_map='auto',
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# Inference
model.eval()
with torch.no_grad():
    outputs = model.generate(input_ids, max_length=100)

Benchmark: LoRA vs Full Fine-tuning

import time
import torch

def benchmark_training(model, train_loader, num_epochs=1):
    torch.cuda.reset_peak_memory_stats()
    start = time.time()

    for epoch in range(num_epochs):
        for batch in train_loader:
            # Training step...
            pass

    duration = time.time() - start
    memory = torch.cuda.max_memory_allocated() / 1e9

    return duration, memory

# Results typically show:
# Full fine-tuning: 8 hours, 32GB memory
# LoRA: 2 hours, 8GB memory
# QLoRA: 1.5 hours, 2GB memory

Key Takeaway

LoRA and QLoRA democratize fine-tuning of large models by reducing memory and compute requirements to consumer-grade hardware while maintaining quality comparable to full fine-tuning. Choose LoRA for balanced speed/quality, QLoRA for extreme efficiency.

Practical Exercise

Task: Fine-tune Llama-2-7B with LoRA on custom instruction-following dataset.

Requirements:

  1. Prepare 1000+ instruction-response pairs
  2. Configure LoRA with rank tuning experiments
  3. Implement QLoRA variant for comparison
  4. Train on single GPU with gradient accumulation
  5. Evaluate instruction following quality
  6. Merge and deploy final model

Evaluation:

  • BLEU score > 0.70 on held-out test
  • Training time < 2 hours on single A100
  • Peak memory usage < 12GB
  • Compare training speed: full vs LoRA vs QLoRA
  • Human evaluation of response quality