Fine-Tuning with LoRA and QLoRA
Fine-Tuning with LoRA and QLoRA
Parameter-efficient fine-tuning (PEFT) enables adapting large pre-trained models with minimal additional parameters. LoRA (Low-Rank Adaptation) adds trainable low-rank matrices, reducing memory and compute by 99% while maintaining quality. QLoRA combines LoRA with quantization for even greater efficiency.
Core Concepts
LoRA: Low-Rank Adaptation
Instead of fine-tuning all parameters, LoRA adds decomposed weight updates:
W' = W + ΔW = W + BA
Where:
- W: Pre-trained weights (frozen, not updated)
- B: Low-rank matrix (d × r)
- A: Low-rank matrix (r × k)
- r: Rank (typically 8-64, much smaller than original dimensions)
Complexity Reduction:
Original parameters: 7B × 7B = 49 billion
LoRA parameters (r=8): 2 × (7B × 8) = 112 million
Reduction: 99.8%
QLoRA: Quantized LoRA
Combines LoRA with 4-bit quantization:
- Quantize base model to 4-bit integers
- Add LoRA adapters on top
- Train only adapters (millions of parameters)
- Result: Fine-tune 65B models on single 24GB GPU
Memory Comparison:
Full fine-tuning 65B: 260GB (FP32)
QLoRA 65B: 48GB (4-bit base + LoRA)
Efficiency gain: 5.4x
Adapter Modules
Alternative approach: insert small trainable modules between layers. Each adapter layer adds 2-4% of parameters while maintaining nearly full fine-tuning quality.
Practical Implementation
LoRA Configuration and Training
from peft import LoraConfig, get_peft_model, TaskType
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
# Load pre-trained model
model = AutoModelForCausalLM.from_pretrained('meta-llama/Llama-2-7b')
# Configure LoRA
lora_config = LoraConfig(
r=8, # Rank of low-rank matrices
lora_alpha=32, # Scaling factor
target_modules=['q_proj', 'v_proj'], # Which linear layers to adapt
lora_dropout=0.05,
bias='none',
task_type=TaskType.CAUSAL_LM
)
# Apply LoRA to model
model = get_peft_model(model, lora_config)
# Print parameter counts
model.print_trainable_parameters()
# trainable params: 1,048,576 || total params: 6,738,415,616 || trainable: 0.016%
# Standard training loop
training_args = TrainingArguments(
output_dir='./lora_checkpoints',
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_steps=100,
weight_decay=0.01,
logging_steps=10,
save_steps=500,
evaluation_strategy='steps',
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
# Save only the adapter (not full model)
model.save_pretrained('./lora_adapter') # ~5MB instead of 13GB
QLoRA: Quantized LoRA for Large Models
from peft import prepare_model_for_kbit_training
from transformers import BitsAndBytesConfig
import torch
# Quantization configuration
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True, # Double quantization
bnb_4bit_quant_type='nf4', # NormalFloat4
)
# Load model with quantization
model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Llama-2-70b',
quantization_config=quantization_config,
device_map='auto', # Auto distribute across GPUs
)
# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)
# Apply LoRA on top of quantized model
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj'],
lora_dropout=0.05,
bias='none',
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, lora_config)
# Training as normal
trainer = Trainer(model=model, args=training_args, ...)
trainer.train()
Multi-Task Adapter Management
from peft import PeftModel
# Load base model once
base_model = AutoModelForCausalLM.from_pretrained('gpt2')
# Load task-specific adapters
summarization_model = PeftModel.from_pretrained(
base_model,
'./lora_adapters/summarization'
)
translation_model = PeftModel.from_pretrained(
base_model,
'./lora_adapters/translation'
)
qa_model = PeftModel.from_pretrained(
base_model,
'./lora_adapters/qa'
)
# Use different adapters for different tasks
def route_task(task, input_text):
if task == 'summarize':
summarization_model.eval()
output = summarization_model.generate(input_text)
elif task == 'translate':
translation_model.eval()
output = translation_model.generate(input_text)
elif task == 'qa':
qa_model.eval()
output = qa_model.generate(input_text)
return output
Advanced Techniques
DoRA: Decomposed Rank Adaptation
Decomposes weight updates into magnitude and direction for better adaptation quality:
dora_config = LoraConfig(
r=8,
lora_alpha=32,
use_dora=True, # Enable DoRA instead of standard LoRA
target_modules=['q_proj', 'v_proj'],
)
model = get_peft_model(model, dora_config)
Prefix Tuning
Insert learnable prefix tokens that condition the model:
from peft import PrefixTuningConfig
prefix_config = PrefixTuningConfig(
num_virtual_tokens=20, # Length of learned prefix
task_type=TaskType.CAUSAL_LM
)
model = get_peft_model(model, prefix_config)
# Adds ~20 * hidden_size trainable parameters
Combining Multiple PEFT Methods
# Combine LoRA on multiple layers
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=['q_proj', 'v_proj', 'k_proj', 'o_proj', 'up_proj', 'down_proj'],
lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
Production Considerations
Merging LoRA Weights for Deployment
def merge_lora_and_save(model, output_path):
"""Merge LoRA weights into base model for inference"""
merged_model = model.merge_and_unload()
merged_model.save_pretrained(output_path)
return merged_model
# Now can use standard transformers inference
merged_model = merge_lora_and_save(peft_model, './deployed_model')
from transformers import pipeline
pipe = pipeline('text-generation', model='./deployed_model')
output = pipe("Once upon a time")
Inference Optimization with Quantization
# Use 8-bit quantization + LoRA for efficient inference
model = AutoModelForCausalLM.from_pretrained(
'gpt2',
load_in_8bit=True,
device_map='auto',
)
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
# Inference
model.eval()
with torch.no_grad():
outputs = model.generate(input_ids, max_length=100)
Benchmark: LoRA vs Full Fine-tuning
import time
import torch
def benchmark_training(model, train_loader, num_epochs=1):
torch.cuda.reset_peak_memory_stats()
start = time.time()
for epoch in range(num_epochs):
for batch in train_loader:
# Training step...
pass
duration = time.time() - start
memory = torch.cuda.max_memory_allocated() / 1e9
return duration, memory
# Results typically show:
# Full fine-tuning: 8 hours, 32GB memory
# LoRA: 2 hours, 8GB memory
# QLoRA: 1.5 hours, 2GB memory
Key Takeaway
LoRA and QLoRA democratize fine-tuning of large models by reducing memory and compute requirements to consumer-grade hardware while maintaining quality comparable to full fine-tuning. Choose LoRA for balanced speed/quality, QLoRA for extreme efficiency.
Practical Exercise
Task: Fine-tune Llama-2-7B with LoRA on custom instruction-following dataset.
Requirements:
- Prepare 1000+ instruction-response pairs
- Configure LoRA with rank tuning experiments
- Implement QLoRA variant for comparison
- Train on single GPU with gradient accumulation
- Evaluate instruction following quality
- Merge and deploy final model
Evaluation:
- BLEU score > 0.70 on held-out test
- Training time < 2 hours on single A100
- Peak memory usage < 12GB
- Compare training speed: full vs LoRA vs QLoRA
- Human evaluation of response quality