Advanced
Model Compression and Quantization
Model Compression and Quantization
Quantization, pruning, and distillation reduce model size and inference latency. This lesson covers GPTQ, AWQ quantization, GGUF format, Flash Attention optimization, and structured pruning techniques.
Core Concepts
Weight-Only Quantization
Quantize weights to 4-bit, keep activations at higher precision:
Original: 7B parameters × 4 bytes = 28GB
GPTQ: 7B parameters × 4 bits = 3.5GB
Compression: 8x
GPTQ: Post-Training Quantization
Solve mixed-integer optimization problem per layer:
- Minimize reconstruction error
- Adapt other weights to compensate
- One-shot: No retraining needed
AWQ: Activation-Aware Quantization
Identify weight channels important for activation preservation:
- Focus quantization on important weights
- Better accuracy than uniform quantization
- Same computational benefits
GGUF Format
Optimized format for inference with llama.cpp:
- Single file format
- Quantization built-in
- CPU-friendly inference
Practical Implementation
GPTQ Quantization
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
quantize_config = BaseQuantizeConfig(
bits=4,
group_size=128,
desc_act=False,
)
model = AutoGPTQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b",
quantize_config=quantize_config,
)
# Quantize
model.quantize(calibration_dataset)
# Save
model.save_pretrained("./llama2-7b-gptq")
AWQ Quantization
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained("meta-llama/Llama-2-7b")
quant_config = {
"zero_point": True,
"q_group_size": 128,
"w_bit": 4,
"version": "GEMM",
}
model.quantize(
calibration_dataset,
quant_config=quant_config,
)
model.save_quantized("./llama2-7b-awq")
Flash Attention Optimization
from flash_attn import flash_attn_func
# Use Flash Attention in forward pass
class OptimizedAttention(nn.Module):
def forward(self, q, k, v, causal=False):
return flash_attn_func(q, k, v, causal=causal)
# Automatic 2-3x speedup with reduced memory
Advanced Techniques
Structured Pruning
import torch.nn.utils.prune as prune
class PrunedModel:
def __init__(self, model, pruning_ratio=0.9):
self.model = model
self.pruning_ratio = pruning_ratio
def prune_magnitude(self):
for module in self.model.modules():
if isinstance(module, nn.Linear):
prune.l1_unstructured(module, "weight", amount=self.pruning_ratio)
def make_permanent(self):
for module in self.model.modules():
if isinstance(module, nn.Linear):
prune.remove(module, "weight")
Production Considerations
Benchmarking Quantized Models
def compare_models(original, quantized):
results = {}
for model, name in [(original, "original"), (quantized, "quantized")]:
# Accuracy
acc = evaluate(model)
# Latency
latency = benchmark_latency(model)
# Memory
memory = get_model_size(model)
results[name] = {"accuracy": acc, "latency": latency, "memory": memory}
return results
Key Takeaway
Quantization and compression techniques reduce model size by 8x while maintaining accuracy, enabling deployment on edge devices and improving serving throughput. GPTQ and AWQ are production-proven methods for LLM compression.
Practical Exercise
Task: Compress Llama-2-7B to 4-bit and benchmark end-to-end.
Requirements:
- Apply GPTQ or AWQ quantization
- Save in GGUF format for llama.cpp
- Benchmark latency and accuracy
- Compare multiple quantization methods
- Deploy on resource-constrained device
Evaluation:
- Quantized model accuracy drop < 2%
- 4-6x speedup vs FP32
- 8x memory reduction
- Successful inference on CPU