Advanced

LLM Evaluation and Benchmarking

Lesson 4 of 4 Estimated Time 55 min

LLM Evaluation and Benchmarking

Evaluating large language models requires comprehensive benchmarks spanning knowledge, reasoning, coding, and alignment. This lesson covers standard benchmarks (MMLU, HumanEval, GSM8K), evaluation frameworks, and best practices for assessing model capabilities.

Core Concepts

Evaluation Dimensions

Capability Evaluation:

  • Knowledge: MMLU (57 tasks), GPQA (graduate-level)
  • Reasoning: GSM8K (math), BIG-Bench
  • Coding: HumanEval, MBPP
  • Alignment: TruthfulQA, Harmlessness

Efficiency Evaluation:

  • Throughput: tokens/second
  • Latency: time per token
  • Memory: peak allocation

Standard Benchmarks

MMLU: Massive Multitask Language Understanding

  • 57,000 questions across 57 domains
  • STEM, humanities, social sciences
  • Typical scores: 7B models 47%, 70B models 82%

HumanEval: Code Generation

  • 164 programming problems
  • Pass@k metric
  • GPT-3: 48%, GPT-4: 88%

GSM8K: Grade School Math

  • 8,500 problems, chain-of-thought reasoning
  • 7B models: ~30%, 70B models: ~91%

TruthfulQA: Truthfulness

  • 817 questions where models commonly hallucinate
  • Measures factuality, not capability

Practical Implementation

Evaluation with LM Eval Harness

from lm_eval import evaluator
from lm_eval.models.huggingface import HuggingFaceModel

model = HuggingFaceModel(
    pretrained='meta-llama/Llama-2-7b',
    device_map='auto',
)

results = evaluator.evaluate(
    model,
    task_names=['mmlu', 'humaneval', 'gsm8k'],
    num_fewshot=5,
    batch_size=32,
)

print(results)

Custom Evaluation Framework

from sklearn.metrics import accuracy_score

class EvaluationFramework:
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        self.results = {}
    
    def evaluate_multiple_choice(self, dataset):
        correct = 0
        for item in dataset:
            prompt = item['question']
            choices = item['choices']
            answer = item['answer']
            
            scores = []
            for choice in choices:
                inputs = self.tokenizer(f"{prompt} {choice}", return_tensors='pt')
                with torch.no_grad():
                    outputs = self.model(**inputs)
                    scores.append(outputs.logits[:,-1].max().item())
            
            prediction = choices[np.argmax(scores)]
            if prediction == answer:
                correct += 1
        
        return correct / len(dataset)
    
    def evaluate_generation(self, dataset, metric_fn):
        scores = []
        for item in dataset:
            inputs = self.tokenizer(item['prompt'], return_tensors='pt')
            with torch.no_grad():
                outputs = self.model.generate(**inputs, max_length=200)
            prediction = self.tokenizer.decode(outputs[0])
            score = metric_fn(item['reference'], prediction)
            scores.append(score)
        
        return np.mean(scores)

Advanced Techniques

Few-Shot Evaluation

def evaluate_fewshot(model, examples, k=5):
    # Build prompt with examples
    prompt = "Examples:\n"
    for ex in examples[:k]:
        prompt += f"Q: {ex['q']}\nA: {ex['a']}\n\n"
    
    # Evaluate test
    scores = []
    for ex in examples[k:]:
        full_prompt = prompt + f"Q: {ex['q']}\nA:"
        inputs = tokenizer(full_prompt, return_tensors='pt')
        outputs = model.generate(**inputs)
        pred = tokenizer.decode(outputs[0])
        score = 1 if pred.strip() == ex['a'] else 0
        scores.append(score)
    
    return np.mean(scores)

Long-Context Evaluation

def evaluate_long_context(model, dataset, context_length=4096):
    scores = []
    for item in dataset:
        document = item['document'][:context_length]
        prompt = f"Document: {document}\nQuestion: {item['q']}\nAnswer:"
        
        inputs = tokenizer(prompt, return_tensors='pt', truncation=True, max_length=context_length+100)
        with torch.no_grad():
            outputs = model.generate(**inputs)
        
        prediction = tokenizer.decode(outputs[0])
        score = compute_metric(prediction, item['answer'])
        scores.append(score)
    
    return np.mean(scores)

Production Considerations

Model Monitoring

import time

class ModelMonitor:
    def __init__(self, model, eval_dataset):
        self.model = model
        self.eval_dataset = eval_dataset
        self.history = []
    
    def evaluate_periodically(self, interval_hours=24):
        while True:
            score = self.evaluate_generation(self.eval_dataset)
            self.history.append({
                'timestamp': time.time(),
                'score': score
            })
            
            if len(self.history) > 1:
                prev = self.history[-2]['score']
                if score < 0.95 * prev:
                    alert(f"Degradation: {prev:.4f} -> {score:.4f}")
            
            time.sleep(interval_hours * 3600)

Key Takeaway

Comprehensive evaluation across multiple dimensions—knowledge, reasoning, coding, alignment—is essential for understanding capabilities and limitations. Use benchmark suites, custom evaluations, and continuous monitoring.

Practical Exercise

Task: Build evaluation framework for custom domain.

Requirements:

  1. Curate 500+ domain test examples
  2. Implement 5+ evaluation metrics
  3. Benchmark vs public baseline
  4. Analyze by difficulty/category
  5. Create dashboard

Evaluation:

  • Correlation with human judgment
  • Statistical significance testing
  • Reproducibility across runs