Advanced
LLM Evaluation and Benchmarking
LLM Evaluation and Benchmarking
Evaluating large language models requires comprehensive benchmarks spanning knowledge, reasoning, coding, and alignment. This lesson covers standard benchmarks (MMLU, HumanEval, GSM8K), evaluation frameworks, and best practices for assessing model capabilities.
Core Concepts
Evaluation Dimensions
Capability Evaluation:
- Knowledge: MMLU (57 tasks), GPQA (graduate-level)
- Reasoning: GSM8K (math), BIG-Bench
- Coding: HumanEval, MBPP
- Alignment: TruthfulQA, Harmlessness
Efficiency Evaluation:
- Throughput: tokens/second
- Latency: time per token
- Memory: peak allocation
Standard Benchmarks
MMLU: Massive Multitask Language Understanding
- 57,000 questions across 57 domains
- STEM, humanities, social sciences
- Typical scores: 7B models 47%, 70B models 82%
HumanEval: Code Generation
- 164 programming problems
- Pass@k metric
- GPT-3: 48%, GPT-4: 88%
GSM8K: Grade School Math
- 8,500 problems, chain-of-thought reasoning
- 7B models: ~30%, 70B models: ~91%
TruthfulQA: Truthfulness
- 817 questions where models commonly hallucinate
- Measures factuality, not capability
Practical Implementation
Evaluation with LM Eval Harness
from lm_eval import evaluator
from lm_eval.models.huggingface import HuggingFaceModel
model = HuggingFaceModel(
pretrained='meta-llama/Llama-2-7b',
device_map='auto',
)
results = evaluator.evaluate(
model,
task_names=['mmlu', 'humaneval', 'gsm8k'],
num_fewshot=5,
batch_size=32,
)
print(results)
Custom Evaluation Framework
from sklearn.metrics import accuracy_score
class EvaluationFramework:
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer
self.results = {}
def evaluate_multiple_choice(self, dataset):
correct = 0
for item in dataset:
prompt = item['question']
choices = item['choices']
answer = item['answer']
scores = []
for choice in choices:
inputs = self.tokenizer(f"{prompt} {choice}", return_tensors='pt')
with torch.no_grad():
outputs = self.model(**inputs)
scores.append(outputs.logits[:,-1].max().item())
prediction = choices[np.argmax(scores)]
if prediction == answer:
correct += 1
return correct / len(dataset)
def evaluate_generation(self, dataset, metric_fn):
scores = []
for item in dataset:
inputs = self.tokenizer(item['prompt'], return_tensors='pt')
with torch.no_grad():
outputs = self.model.generate(**inputs, max_length=200)
prediction = self.tokenizer.decode(outputs[0])
score = metric_fn(item['reference'], prediction)
scores.append(score)
return np.mean(scores)
Advanced Techniques
Few-Shot Evaluation
def evaluate_fewshot(model, examples, k=5):
# Build prompt with examples
prompt = "Examples:\n"
for ex in examples[:k]:
prompt += f"Q: {ex['q']}\nA: {ex['a']}\n\n"
# Evaluate test
scores = []
for ex in examples[k:]:
full_prompt = prompt + f"Q: {ex['q']}\nA:"
inputs = tokenizer(full_prompt, return_tensors='pt')
outputs = model.generate(**inputs)
pred = tokenizer.decode(outputs[0])
score = 1 if pred.strip() == ex['a'] else 0
scores.append(score)
return np.mean(scores)
Long-Context Evaluation
def evaluate_long_context(model, dataset, context_length=4096):
scores = []
for item in dataset:
document = item['document'][:context_length]
prompt = f"Document: {document}\nQuestion: {item['q']}\nAnswer:"
inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=context_length+100)
with torch.no_grad():
outputs = model.generate(**inputs)
prediction = tokenizer.decode(outputs[0])
score = compute_metric(prediction, item['answer'])
scores.append(score)
return np.mean(scores)
Production Considerations
Model Monitoring
import time
class ModelMonitor:
def __init__(self, model, eval_dataset):
self.model = model
self.eval_dataset = eval_dataset
self.history = []
def evaluate_periodically(self, interval_hours=24):
while True:
score = self.evaluate_generation(self.eval_dataset)
self.history.append({
'timestamp': time.time(),
'score': score
})
if len(self.history) > 1:
prev = self.history[-2]['score']
if score < 0.95 * prev:
alert(f"Degradation: {prev:.4f} -> {score:.4f}")
time.sleep(interval_hours * 3600)
Key Takeaway
Comprehensive evaluation across multiple dimensions—knowledge, reasoning, coding, alignment—is essential for understanding capabilities and limitations. Use benchmark suites, custom evaluations, and continuous monitoring.
Practical Exercise
Task: Build evaluation framework for custom domain.
Requirements:
- Curate 500+ domain test examples
- Implement 5+ evaluation metrics
- Benchmark vs public baseline
- Analyze by difficulty/category
- Create dashboard
Evaluation:
- Correlation with human judgment
- Statistical significance testing
- Reproducibility across runs