LLM Evaluation and Benchmarking

Evaluating large language models requires comprehensive benchmarks spanning knowledge, reasoning, coding, and alignment. This lesson covers standard benchmarks (MMLU, HumanEval, GSM8K), evaluation frameworks, and best practices for assessing model capabilities.

Core Concepts

Evaluation Dimensions

Capability Evaluation:

Knowledge: MMLU (57 tasks), GPQA (graduate-level)
Reasoning: GSM8K (math), BIG-Bench
Coding: HumanEval, MBPP
Alignment: TruthfulQA, Harmlessness

Efficiency Evaluation:

Throughput: tokens/second
Latency: time per token
Memory: peak allocation

Standard Benchmarks

MMLU: Massive Multitask Language Understanding

57,000 questions across 57 domains
STEM, humanities, social sciences
Typical scores: 7B models 47%, 70B models 82%

HumanEval: Code Generation

164 programming problems
Pass@k metric
GPT-3: 48%, GPT-4: 88%

GSM8K: Grade School Math

8,500 problems, chain-of-thought reasoning
7B models: ~30%, 70B models: ~91%

TruthfulQA: Truthfulness

817 questions where models commonly hallucinate
Measures factuality, not capability

Practical Implementation

Evaluation with LM Eval Harness

from lm_eval import evaluator
from lm_eval.models.huggingface import HuggingFaceModel

model = HuggingFaceModel(
    pretrained='meta-llama/Llama-2-7b',
    device_map='auto',
)

results = evaluator.evaluate(
    model,
    task_names=['mmlu', 'humaneval', 'gsm8k'],
    num_fewshot=5,
    batch_size=32,
)

print(results)

Custom Evaluation Framework

from sklearn.metrics import accuracy_score

class EvaluationFramework:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.results = {}
    
    def evaluate_multiple_choice(self, dataset):
        correct = 0
        for item in dataset:
            prompt = item['question']
            choices = item['choices']
            answer = item['answer']
            
            scores = []
            for choice in choices:
                inputs = self.tokenizer(f"{prompt} {choice}", return_tensors='pt')
                with torch.no_grad():
                    outputs = self.model(**inputs)
                    scores.append(outputs.logits[:,-1].max().item())
            
            prediction = choices[np.argmax(scores)]
            if prediction == answer:
                correct += 1
        
        return correct / len(dataset)
    
    def evaluate_generation(self, dataset, metric_fn):
        scores = []
        for item in dataset:
            inputs = self.tokenizer(item['prompt'], return_tensors='pt')
            with torch.no_grad():
                outputs = self.model.generate(**inputs, max_length=200)
            prediction = self.tokenizer.decode(outputs[0])
            score = metric_fn(item['reference'], prediction)
            scores.append(score)
        
        return np.mean(scores)

Advanced Techniques

Few-Shot Evaluation

def evaluate_fewshot(model, examples, k=5):
    # Build prompt with examples
    prompt = "Examples:\n"
    for ex in examples[:k]:
        prompt += f"Q: {ex['q']}\nA: {ex['a']}\n\n"
    
    # Evaluate test
    scores = []
    for ex in examples[k:]:
        full_prompt = prompt + f"Q: {ex['q']}\nA:"
        inputs = tokenizer(full_prompt, return_tensors='pt')
        outputs = model.generate(**inputs)
        pred = tokenizer.decode(outputs[0])
        score = 1 if pred.strip() == ex['a'] else 0
        scores.append(score)
    
    return np.mean(scores)

Long-Context Evaluation

def evaluate_long_context(model, dataset, context_length=4096):
    scores = []
    for item in dataset:
        document = item['document'][:context_length]
        prompt = f"Document: {document}\nQuestion: {item['q']}\nAnswer:"
        
        inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=context_length+100)
        with torch.no_grad():
            outputs = model.generate(**inputs)
        
        prediction = tokenizer.decode(outputs[0])
        score = compute_metric(prediction, item['answer'])
        scores.append(score)
    
    return np.mean(scores)

Production Considerations

Model Monitoring

import time

class ModelMonitor:
    def __init__(self, model, eval_dataset):
        self.model = model
        self.eval_dataset = eval_dataset
        self.history = []
    
    def evaluate_periodically(self, interval_hours=24):
        while True:
            score = self.evaluate_generation(self.eval_dataset)
            self.history.append({
                'timestamp': time.time(),
                'score': score
            })
            
            if len(self.history) > 1:
                prev = self.history[-2]['score']
                if score < 0.95 * prev:
                    alert(f"Degradation: {prev:.4f} -> {score:.4f}")
            
            time.sleep(interval_hours * 3600)

Key Takeaway

Comprehensive evaluation across multiple dimensions—knowledge, reasoning, coding, alignment—is essential for understanding capabilities and limitations. Use benchmark suites, custom evaluations, and continuous monitoring.

Practical Exercise

Task: Build evaluation framework for custom domain.

Requirements:

Curate 500+ domain test examples
Implement 5+ evaluation metrics
Benchmark vs public baseline
Analyze by difficulty/category
Create dashboard

Evaluation:

Correlation with human judgment
Statistical significance testing
Reproducibility across runs