Fine-Tuning with LoRA and QLoRA

Parameter-efficient fine-tuning (PEFT) enables adapting large pre-trained models with minimal additional parameters. LoRA (Low-Rank Adaptation) adds trainable low-rank matrices, reducing memory and compute by 99% while maintaining quality. QLoRA combines LoRA with quantization for even greater efficiency.

Core Concepts

LoRA: Low-Rank Adaptation

Instead of fine-tuning all parameters, LoRA adds decomposed weight updates:

W' = W + ΔW = W + BA

Where:

W: Pre-trained weights (frozen, not updated)
B: Low-rank matrix (d × r)
A: Low-rank matrix (r × k)
r: Rank (typically 8-64, much smaller than original dimensions)

Complexity Reduction:

Original parameters: 7B × 7B = 49 billion
LoRA parameters (r=8): 2 × (7B × 8) = 112 million
Reduction: 99.8%

QLoRA: Quantized LoRA

Combines LoRA with 4-bit quantization:

Quantize base model to 4-bit integers
Add LoRA adapters on top
Train only adapters (millions of parameters)
Result: Fine-tune 65B models on single 24GB GPU

Memory Comparison:

Full fine-tuning 65B: 260GB (FP32)
QLoRA 65B: 48GB (4-bit base + LoRA)
Efficiency gain: 5.4x

Adapter Modules

Alternative approach: insert small trainable modules between layers. Each adapter layer adds 2-4% of parameters while maintaining nearly full fine-tuning quality.

Practical Implementation

LoRA Configuration and Training

from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b')

# Configure LoRA
lora_config = LoraConfig(
    r=8,  # Rank of low-rank matrices
    lora_alpha=32,  # Scaling factor
    target_modules=['q_proj', 'v_proj'],  # Which linear layers to adapt
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)

# Print parameter counts
model.print_trainable_parameters()
# trainable params: 1,048,576 || total params: 6,738,415,616 || trainable: 0.016%

# Standard training loop
training_args = TrainingArguments(
    output_dir='./lora_checkpoints',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=10,
    save_steps=500,
    evaluation_strategy='steps',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

trainer.train()

# Save only the adapter (not full model)
model.save_pretrained('./lora_adapter')  # ~5MB instead of 13GB

QLoRA: Quantized LoRA for Large Models

from peft import prepare_model_for_kbit_training
from transformers import BitsAndBytesConfig
import torch

# Quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_use_double_quant=True,  # Double quantization
    bnb_4bit_quant_type='nf4',  # NormalFloat4
)

# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-2-70b',
    quantization_config=quantization_config,
    device_map='auto',  # Auto distribute across GPUs
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Apply LoRA on top of quantized model
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj'],
    lora_dropout=0.05,
    bias='none',
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)

# Training as normal
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()

Multi-Task Adapter Management

from peft import PeftModel

# Load base model once
base_model = AutoModelForCausalLM.from_pretrained('gpt2')

# Load task-specific adapters
summarization_model = PeftModel.from_pretrained(
    base_model,
    './lora_adapters/summarization'
)
translation_model = PeftModel.from_pretrained(
    base_model,
    './lora_adapters/translation'
)
qa_model = PeftModel.from_pretrained(
    base_model,
    './lora_adapters/qa'
)

# Use different adapters for different tasks
def route_task(task, input_text):
    if task == 'summarize':
        summarization_model.eval()
        output = summarization_model.generate(input_text)
    elif task == 'translate':
        translation_model.eval()
        output = translation_model.generate(input_text)
    elif task == 'qa':
        qa_model.eval()
        output = qa_model.generate(input_text)
    return output

Advanced Techniques

DoRA: Decomposed Rank Adaptation

Decomposes weight updates into magnitude and direction for better adaptation quality:

dora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    use_dora=True,  # Enable DoRA instead of standard LoRA
    target_modules=['q_proj', 'v_proj'],
)

model = get_peft_model(model, dora_config)

Prefix Tuning

Insert learnable prefix tokens that condition the model:

from peft import PrefixTuningConfig

prefix_config = PrefixTuningConfig(
    num_virtual_tokens=20,  # Length of learned prefix
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, prefix_config)
# Adds ~20 * hidden_size trainable parameters

Combining Multiple PEFT Methods

# Combine LoRA on multiple layers
lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj', 'up_proj', 'down_proj'],
    lora_dropout=0.05,
)

model = get_peft_model(model, lora_config)

Production Considerations

Merging LoRA Weights for Deployment

def merge_lora_and_save(model, output_path):
    """Merge LoRA weights into base model for inference"""
    merged_model = model.merge_and_unload()
    merged_model.save_pretrained(output_path)
    return merged_model

# Now can use standard transformers inference
merged_model = merge_lora_and_save(peft_model, './deployed_model')
from transformers import pipeline
pipe = pipeline('text-generation', model='./deployed_model')
output = pipe("Once upon a time")

Inference Optimization with Quantization

# Use 8-bit quantization + LoRA for efficient inference
model = AutoModelForCausalLM.from_pretrained(
    'gpt2',
    load_in_8bit=True,
    device_map='auto',
)

model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)

# Inference
model.eval()
with torch.no_grad():
    outputs = model.generate(input_ids, max_length=100)

Benchmark: LoRA vs Full Fine-tuning

import time
import torch

def benchmark_training(model, train_loader, num_epochs=1):
    torch.cuda.reset_peak_memory_stats()
    start = time.time()

    for epoch in range(num_epochs):
        for batch in train_loader:
            # Training step...
            pass

    duration = time.time() - start
    memory = torch.cuda.max_memory_allocated() / 1e9

    return duration, memory

# Results typically show:
# Full fine-tuning: 8 hours, 32GB memory
# LoRA: 2 hours, 8GB memory
# QLoRA: 1.5 hours, 2GB memory

Key Takeaway

LoRA and QLoRA democratize fine-tuning of large models by reducing memory and compute requirements to consumer-grade hardware while maintaining quality comparable to full fine-tuning. Choose LoRA for balanced speed/quality, QLoRA for extreme efficiency.

Practical Exercise

Task: Fine-tune Llama-2-7B with LoRA on custom instruction-following dataset.

Requirements:

Prepare 1000+ instruction-response pairs
Configure LoRA with rank tuning experiments
Implement QLoRA variant for comparison
Train on single GPU with gradient accumulation
Evaluate instruction following quality
Merge and deploy final model

Evaluation:

BLEU score > 0.70 on held-out test
Training time < 2 hours on single A100
Peak memory usage < 12GB
Compare training speed: full vs LoRA vs QLoRA
Human evaluation of response quality