Supervised Learning: Regression and Classification

Supervised learning is the workhorse of applied ML. Whether you’re predicting house prices, detecting spam, or recommending products, you’re almost certainly using supervised learning. This lesson covers the core algorithms and when to use each one.

The fundamental idea: you have labeled examples (features and their corresponding targets), and you want to learn a function that maps features to targets. This function works on new, unseen data.

Linear Regression: The Simplest Starting Point

Linear regression assumes the relationship between features and target is… linear. Despite its simplicity, it’s often surprisingly effective:

import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create synthetic data with linear relationship
X = np.array([[1], [2], [3], [4], [5]])
y = 2 * X.squeeze() + 3 + np.random.randn(5) * 0.5

# Fit linear regression
model = LinearRegression()
model.fit(X, y)

# Make predictions
y_pred = model.predict(X)

# Evaluate
mse = mean_squared_error(y, y_pred)
r2 = r2_score(y, y_pred)

print(f"Coefficient: {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
print(f"MSE: {mse:.4f}")
print(f"R² Score: {r2:.4f}")

Linear regression gives you interpretable coefficients. If your coefficient for “house_size” is 200, it means each additional square foot adds $200 to the expected price (holding other variables constant).

Logistic Regression: Binary Classification

Despite its name, logistic regression is for classification. It models the probability that an instance belongs to the positive class:

from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create sample classification problem
X, y = make_classification(n_samples=200, n_features=10, n_informative=8,
                           n_redundant=2, random_state=42)

# Fit logistic regression
model = LogisticRegression(random_state=42, max_iter=1000)
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Prediction probabilities (0-1 probability of class 1)
y_proba = model.predict_proba(X)[:, 1]

# Evaluation
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y, y_pred))

The sigmoid function converts the linear combination of features into a probability:

$$P(y=1|x) = \frac{1}{1 + e^{-z}}$$

where z is the linear combination of features and weights.

Decision Trees: Interpretable Non-Linear Models

Decision trees work by recursively splitting the data on features, creating a tree of yes/no questions. They handle non-linear relationships naturally:

from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt

# Fit decision tree
tree = DecisionTreeClassifier(max_depth=3, random_state=42)
tree.fit(X, y)

# Visualize the tree
plt.figure(figsize=(20, 10))
plot_tree(tree, feature_names=[f'Feature {i}' for i in range(X.shape[1])],
          class_names=['Class 0', 'Class 1'], filled=True)
plt.show()

# Feature importance
importances = tree.feature_importances_
for name, importance in zip([f'Feature {i}' for i in range(X.shape[1])], importances):
    if importance > 0:
        print(f"{name}: {importance:.4f}")

Decision trees are interpretable but prone to overfitting. Control overfitting with:

max_depth: Limit tree depth
min_samples_split: Minimum samples to split a node
min_samples_leaf: Minimum samples in a leaf

Random Forests: Ensemble Power

Random Forests combine many decision trees, reducing overfitting and improving robustness. Each tree sees a random subset of features and data:

from sklearn.ensemble import RandomForestClassifier

# Random Forest: ensemble of 100 trees
rf = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)
rf.fit(X, y)

# Out-of-bag evaluation (built-in cross-validation)
print(f"OOB Score: {rf.oob_score_:.4f}")

# Feature importance (averaged across trees)
importances = rf.feature_importances_
sorted_idx = np.argsort(importances)[::-1]

print("\nTop features:")
for idx in sorted_idx[:5]:
    print(f"Feature {idx}: {importances[idx]:.4f}")

Random Forests handle:

Non-linear relationships
Feature interactions automatically
Both classification and regression
Missing values (to some extent)

Gradient Boosting: Sequential Improvement

Gradient Boosting works by building trees sequentially, each correcting errors of the previous one:

from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1,
                                max_depth=5, random_state=42)
gb.fit(X, y)

# Feature importance
importances = gb.feature_importances_
print(f"Accuracy: {gb.score(X, y):.4f}")

Gradient Boosting vs Random Forest:

Random Forest: Parallel trees, fast, less prone to overfitting
Gradient Boosting: Sequential trees, slower but often slightly better performance, more careful tuning needed

Regression Beyond Linear

When your relationship isn’t linear, you need alternatives:

# Data with non-linear relationship
X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([1, 4, 9, 16, 25, 36]) + np.random.randn(6) * 2

# Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X, y)
y_pred = tree_reg.predict(X)

# Random Forest Regression
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
rf_reg.fit(X, y)
y_pred = rf_reg.predict(X)

# Evaluate both
print(f"Tree R²: {tree_reg.score(X, y):.4f}")
print(f"RF R²: {rf_reg.score(X, y):.4f}")

Multi-Class Classification

When you have more than 2 classes:

from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier

# Load multi-class dataset
X, y = load_iris(return_X_y=True)
print(f"Classes: {np.unique(y)}")  # [0 1 2]

# Fit classifier
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X, y)

# Predictions
y_pred = model.predict(X)

# Multi-class probabilities
y_proba = model.predict_proba(X)
print(f"Probabilities shape: {y_proba.shape}")  # (150, 3)

# Evaluation
from sklearn.metrics import classification_report
print(classification_report(y, y_pred, target_names=['Setosa', 'Versicolor', 'Virginica']))

For multi-class, scikit-learn automatically handles:

One-vs-Rest: Train binary classifier for each class
One-vs-One: Train binary classifier for each pair of classes
Multinomial: Direct multi-class algorithm (like softmax)

Algorithm Selection Guide

Choose based on your problem characteristics:

# Small dataset, interpretability important?
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()

# Medium dataset, good balance of speed/accuracy?
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=100)

# Large dataset, computational power available?
from sklearn.ensemble import GradientBoostingClassifier
model = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05)

# Need probability calibration?
from sklearn.calibration import CalibratedClassifierCV
model = CalibratedClassifierCV(RandomForestClassifier(), method='sigmoid')

Key Takeaway

Start simple (linear regression/logistic regression), then add complexity only if needed. Tree-based ensembles (Random Forests) are your safest bet for most real-world problems—they’re fast, robust, and require minimal hyperparameter tuning.

Practical Exercise

You’re building a loan approval system. You have:

10,000 customers with features: credit_score, income, debt_to_income, employment_years
Target: approved (1) or rejected (0), with 70% approval rate

Your task:

Load and split the data (80/20 train/test)
Train 3 models: Logistic Regression, Decision Tree, Random Forest
Compare accuracy, precision, recall, and F1 score
Identify which features matter most
Visualize one decision tree to understand the decision boundaries

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd

# Generate synthetic loan data
X, y = make_classification(n_samples=10000, n_features=4, n_informative=3,
                           n_redundant=1, weights=[0.3, 0.7], random_state=42)

# Your implementation here:
# 1. Train/test split
# 2. Train three models
# 3. Compare performance
# 4. Feature importance
# 5. Visualization

# Expected output:
# - Comparison table of metrics for all 3 models
# - Feature importance ranking
# - Confusion matrices
# - Decision tree visualization

This exercise teaches you the practical workflow: experiment with multiple algorithms, compare rigorously, and choose based on business requirements (speed, interpretability, accuracy).