Model Evaluation and Selection

You can’t improve what you don’t measure. A common mistake is optimizing the wrong metric, leading to models that look good on paper but fail in production. This lesson teaches you to measure model performance correctly and make informed decisions about which models to deploy.

Train/Test Split: Avoiding Overfitting Detection

The fundamental concept: you can’t evaluate on data you trained on. That would be like grading your own homework—you’ll always get 100%.

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

# Load data
X, y = ...  # Your data

# Standard: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Train on training data only
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evaluate on test data only
train_accuracy = model.score(X_train, y_train)  # On training data
test_accuracy = model.score(X_test, y_test)    # On unseen data

print(f"Train accuracy: {train_accuracy:.4f}")
print(f"Test accuracy: {test_accuracy:.4f}")

# The gap reveals overfitting
if train_accuracy - test_accuracy > 0.1:
    print("Model is overfitting!")

The stratify parameter is crucial for imbalanced datasets—it ensures both train and test have the same class distribution as the original:

# Without stratify: test set might have different distribution
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# With stratify: preserves class distribution
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y
)

print(f"Original class distribution: {np.bincount(y) / len(y)}")
print(f"Train distribution: {np.bincount(y_train) / len(y_train)}")
print(f"Test distribution: {np.bincount(y_test) / len(y_test)}")

Cross-Validation: More Robust Evaluation

With limited data, a single train/test split can be misleading. Cross-validation uses multiple splits:

from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold

# K-Fold: divide data into k folds, train k times
cv = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

print(f"Fold scores: {scores}")
print(f"Mean: {scores.mean():.4f}")
print(f"Std: {scores.std():.4f}")

# For classification with imbalanced classes: StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')

Each fold is trained and evaluated separately. The final score is the average, giving you a robust estimate of true performance:

# Visualization of cross-validation process
from sklearn.model_selection import cross_validate
import numpy as np

cv_results = cross_validate(
    model, X, y, cv=5,
    scoring=['accuracy', 'precision', 'recall', 'f1'],
    return_train_score=True
)

print("Train scores:")
print(f"  Accuracy: {cv_results['train_accuracy'].mean():.4f}")
print(f"  Precision: {cv_results['train_precision'].mean():.4f}")

print("Test scores:")
print(f"  Accuracy: {cv_results['test_accuracy'].mean():.4f}")
print(f"  Precision: {cv_results['test_precision'].mean():.4f}")

Classification Metrics: Beyond Accuracy

Accuracy is a trap when you have imbalanced data. If 99% of emails are legitimate, a classifier that says “always legitimate” gets 99% accuracy but is useless.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

# Make predictions on test set
y_pred = model.predict(X_test)

# Confusion Matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print(f"True Negatives: {tn}")
print(f"False Positives: {fp}")
print(f"False Negatives: {fn}")
print(f"True Positives: {tp}")

# Metrics
accuracy = (tp + tn) / (tp + tn + fp + fn)
precision = tp / (tp + fp)  # Of predicted positive, how many are truly positive?
recall = tp / (tp + fn)     # Of actual positive, how many did we catch?
f1 = 2 * (precision * recall) / (precision + recall)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")

Choosing metrics depends on your problem:

# Spam detection: minimize false positives (users missing legitimate emails)
# Metric: Precision (of emails we mark spam, how many truly are spam?)

# Disease diagnosis: minimize false negatives (missing actual diseases)
# Metric: Recall (of people with disease, how many do we diagnose?)

# Balanced problem: no specific cost to false positives vs false negatives
# Metric: F1 (harmonic mean of precision and recall)

ROC Curves and AUC: Probability Calibration

For binary classification, ROC curves show the trade-off between true positive rate and false positive rate across different thresholds:

from sklearn.metrics import roc_curve, auc, roc_auc_score
import matplotlib.pyplot as plt

# Get probability predictions (not binary)
y_proba = model.predict_proba(X_test)[:, 1]

# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
auc_score = auc(fpr, tpr)

# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {auc_score:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.title('ROC Curve')
plt.show()

print(f"AUC Score: {auc_score:.4f}")

AUC (Area Under Curve) ranges from 0 to 1:

0.5 = random guessing
0.7-0.8 = acceptable
0.8-0.9 = excellent
0.9+ = outstanding

AUC is especially useful for imbalanced datasets because it’s threshold-independent.

Hyperparameter Tuning: Finding Optimal Parameters

Every model has hyperparameters (parameters not learned from data). Tuning them is essential:

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# GridSearchCV tries all combinations
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

# Use best model
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print(f"Test score with best params: {test_score:.4f}")

GridSearchCV exhaustively tries all parameter combinations. For large parameter spaces, use RandomizedSearchCV:

from sklearn.model_selection import RandomizedSearchCV

# Randomly samples parameter combinations instead of trying all
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    n_iter=20,  # Try 20 random combinations
    cv=5,
    n_jobs=-1,
    random_state=42
)

random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")

Learning Curves: Diagnosing Problems

Learning curves show how performance changes with dataset size. They reveal whether you need more data:

from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt

# Generate learning curve
train_sizes, train_scores, val_scores = learning_curve(
    RandomForestClassifier(n_estimators=100, random_state=42),
    X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10),
    scoring='accuracy', n_jobs=-1
)

train_mean = train_scores.mean(axis=1)
val_mean = val_scores.mean(axis=1)

plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, label='Training score')
plt.plot(train_sizes, val_mean, label='Validation score')
plt.xlabel('Training set size')
plt.ylabel('Score')
plt.legend()
plt.title('Learning Curve')
plt.show()

# Interpretation:
# - If gap is large and validation plateaus: UNDERFITTING (need more data or simpler model)
# - If gap is large and validation keeps improving: OVERFITTING (need simpler model or regularization)

Comparing Models Statistically

When the difference is small, use statistical tests:

from scipy.stats import ttest_rel

# Get cross-validation scores for two models
cv1 = cross_val_score(model1, X, y, cv=5, scoring='f1')
cv2 = cross_val_score(model2, X, y, cv=5, scoring='f1')

# Paired t-test
t_stat, p_value = ttest_rel(cv1, cv2)

print(f"Model 1 mean: {cv1.mean():.4f}")
print(f"Model 2 mean: {cv2.mean():.4f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Difference is statistically significant")
else:
    print("Difference is not statistically significant")

Key Takeaway

Never trust a single metric on a single test set. Use cross-validation, choose domain-appropriate metrics, and validate your assumptions about model performance. A “70% accurate” model might be amazing or worthless depending on your problem.

Practical Exercise

You’re evaluating models for detecting fraudulent credit card transactions. The dataset has 100,000 transactions with 1,000 frauds (1% fraud rate).

Your task:

Split data appropriately for this imbalanced scenario
Train Logistic Regression, Random Forest, and Gradient Boosting
Evaluate using: Accuracy, Precision, Recall, F1, AUC
Create visualizations: Confusion matrices, ROC curves
Perform hyperparameter tuning with cross-validation
Determine which model to deploy and why

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

# Generate imbalanced fraud data
X, y = make_classification(n_samples=100000, n_features=20, weights=[0.99, 0.01],
                           n_informative=15, n_redundant=5, random_state=42)

# Your implementation:
# 1. Train/test split with stratification
# 2. Train three models
# 3. Evaluate with appropriate metrics for fraud detection
# 4. Visualization (ROC curves, confusion matrices)
# 5. Hyperparameter tuning
# 6. Final recommendation with justification

# Expected output:
# - Metrics comparison table
# - ROC curves for all models
# - Tuned hyperparameters
# - Business recommendation

This exercise teaches the real-world discipline of model evaluation in an important, imbalanced scenario.