Foundations

Unsupervised Learning and Clustering

Lesson 3 of 4 Estimated Time 50 min

Unsupervised Learning and Clustering

Not all problems come with labels. When you’re exploring data without a target variable, you need unsupervised learning. Clustering helps you discover natural groupings in your data, while dimensionality reduction reveals the underlying structure.

K-Means Clustering: Finding Centroids

K-Means is the simplest clustering algorithm. It partitions data into k clusters by minimizing the distance from each point to its cluster center:

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt

# Generate synthetic data with 3 natural clusters
X, y_true = make_blobs(n_samples=300, centers=3, n_features=2, random_state=42)

# Fit K-Means
kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Visualize
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X[:, 0], X[:, 1], c=labels, cmap='viridis', alpha=0.6)
# Plot cluster centers
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            c='red', s=200, marker='X', edgecolor='black', linewidth=2, label='Centroids')
plt.colorbar(scatter)
plt.title('K-Means Clustering')
plt.legend()
plt.show()

print(f"Inertia (sum of squared distances): {kmeans.inertia_:.2f}")
print(f"Cluster sizes: {np.bincount(labels)}")

Critical question: How many clusters should you use? Use the elbow method:

# Try different k values
inertias = []
silhouette_scores = []
K_range = range(1, 10)

from sklearn.metrics import silhouette_score

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)
    if k > 1:
        silhouette_scores.append(silhouette_score(X, kmeans.labels_))

# Plot elbow curve
fig, axes = plt.subplots(1, 2, figsize=(14, 4))

axes[0].plot(K_range, inertias, 'bo-')
axes[0].set_xlabel('Number of Clusters (k)')
axes[0].set_ylabel('Inertia')
axes[0].set_title('Elbow Method')
axes[0].grid()

axes[1].plot(range(2, 10), silhouette_scores, 'go-')
axes[1].set_xlabel('Number of Clusters (k)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Score vs k')
axes[1].grid()

plt.tight_layout()
plt.show()

# The "elbow" shows the optimal k

Silhouette Score measures how well each point fits into its cluster (range: -1 to 1, higher is better).

DBSCAN: Density-Based Clustering

DBSCAN finds clusters of arbitrary shape by looking at point density. It doesn’t require specifying k in advance:

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# DBSCAN is sensitive to scale—standardize first
X_scaled = StandardScaler().fit_transform(X)

dbscan = DBSCAN(eps=0.3, min_samples=5)
labels = dbscan.fit_predict(X_scaled)

# -1 indicates noise points (not in any cluster)
n_clusters = len(set(labels)) - (1 if -1 in labels else 0)
n_noise = list(labels).count(-1)

print(f"Number of clusters: {n_clusters}")
print(f"Number of noise points: {n_noise}")

# Visualize
plt.figure(figsize=(10, 6))
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))

for k, col in zip(unique_labels, colors):
    if k == -1:
        col = [0, 0, 0, 1]  # Black for noise
    class_member_mask = (labels == k)
    xy = X_scaled[class_member_mask]
    plt.scatter(xy[:, 0], xy[:, 1], c=[col], alpha=0.6, s=50, edgecolors='k')

plt.title('DBSCAN Clustering')
plt.show()

Key parameters:

  • eps: Maximum distance between points in a cluster (larger = larger clusters)
  • min_samples: Minimum points in eps-neighborhood to form a cluster

Hierarchical Clustering: Building Dendrograms

Hierarchical clustering builds a tree of nested clusters, showing relationships at multiple scales:

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist

# Compute linkage matrix
linkage_matrix = linkage(X, method='ward')

# Plot dendrogram
plt.figure(figsize=(14, 6))
dendrogram(linkage_matrix, truncate_mode='lastp', p=30)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index or (Cluster Size)')
plt.ylabel('Distance')
plt.show()

# Extract clusters at specific distance threshold
clusters = fcluster(linkage_matrix, t=10, criterion='distance')
print(f"Number of clusters at distance threshold 10: {len(np.unique(clusters))}")

Hierarchical clustering is useful when you want to understand relationships at multiple scales.

Dimensionality Reduction: PCA and Beyond

High-dimensional data is hard to visualize and can be computationally expensive. Dimensionality reduction projects data into fewer dimensions while preserving important information.

Principal Component Analysis (PCA)

PCA finds linear combinations of features that capture the most variance:

from sklearn.decomposition import PCA

# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Explained variance
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative variance: {np.cumsum(pca.explained_variance_ratio_)}")

# Visualize reduced-dimensional data
plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_true, cmap='viridis', alpha=0.6)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
plt.title('PCA Projection')
plt.colorbar(scatter)
plt.show()

# How many components for 95% variance?
pca_95 = PCA(n_components=0.95)  # Keep 95% of variance
X_reduced = pca_95.fit_transform(X)
print(f"Dimensions for 95% variance: {X_reduced.shape[1]}")

PCA is linear and interpretable, but may not work well when clusters aren’t separable by linear projections.

t-SNE: Non-Linear Visualization

t-SNE creates stunning 2D visualizations by preserving local neighborhoods:

from sklearn.manifold import TSNE

# t-SNE is computationally expensive for large datasets
# Use subset for speed
sample_idx = np.random.choice(len(X), 1000, replace=False)
X_sample = X[sample_idx]
y_sample = y_true[sample_idx]

tsne = TSNE(n_components=2, random_state=42, perplexity=30, n_iter=1000)
X_tsne = tsne.fit_transform(X_sample)

plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample, cmap='viridis', alpha=0.6)
plt.title('t-SNE Visualization')
plt.colorbar(scatter)
plt.show()

t-SNE is excellent for visualization but not for analysis—it doesn’t preserve global structure well.

UMAP: Practical Non-Linear Reduction

UMAP is faster than t-SNE and preserves more global structure:

from umap import UMAP

umap = UMAP(n_components=2, random_state=42)
X_umap = umap.fit_transform(X_sample)

plt.figure(figsize=(10, 6))
scatter = plt.scatter(X_umap[:, 0], X_umap[:, 1], c=y_sample, cmap='viridis', alpha=0.6)
plt.title('UMAP Visualization')
plt.colorbar(scatter)
plt.show()

Anomaly Detection: Finding Outliers

Unsupervised anomaly detection identifies unusual points:

from sklearn.ensemble import IsolationForest

# Isolation Forest: isolates anomalies by random partitioning
iso_forest = IsolationForest(contamination=0.1, random_state=42)  # Expect 10% anomalies
anomaly_labels = iso_forest.fit_predict(X)

# -1 = anomaly, 1 = normal
n_anomalies = (anomaly_labels == -1).sum()
print(f"Anomalies detected: {n_anomalies}")

# Visualize
plt.figure(figsize=(10, 6))
colors = ['red' if label == -1 else 'blue' for label in anomaly_labels]
plt.scatter(X[:, 0], X[:, 1], c=colors, alpha=0.6)
plt.title('Anomaly Detection with Isolation Forest')
plt.show()

Other anomaly detection methods:

  • Local Outlier Factor (LOF): Compares local density
  • One-Class SVM: Finds boundary around normal data
  • Mahalanobis Distance: Measures statistical distance

Evaluation: When You Don’t Have Labels

Without true labels, how do you know your clustering is good?

from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

# Silhouette Score (higher is better, range -1 to 1)
silhouette = silhouette_score(X, labels)

# Davies-Bouldin Index (lower is better)
db_index = davies_bouldin_score(X, labels)

# Calinski-Harabasz Index (higher is better)
ch_index = calinski_harabasz_score(X, labels)

print(f"Silhouette Score: {silhouette:.4f}")
print(f"Davies-Bouldin Index: {db_index:.4f}")
print(f"Calinski-Harabasz Index: {ch_index:.2f}")

Key Takeaway

Unsupervised learning is exploratory—use visualization and multiple metrics to understand your data. Start with simple K-Means, but don’t be afraid to try DBSCAN or hierarchical clustering if you don’t know k or expect non-spherical clusters.

Practical Exercise

You have customer transaction data with 50 features. Your task:

  1. Apply PCA and t-SNE to visualize data
  2. Use K-Means with the elbow method to find optimal k
  3. Compare K-Means vs DBSCAN on original data
  4. Identify and visualize anomalies using Isolation Forest
  5. Interpret the results: what customer segments emerged?
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Use digits dataset as a proxy for customer features
X, y = load_digits(return_X_y=True)

# Your implementation:
# 1. PCA reduction and visualization
# 2. t-SNE visualization (sample subset for speed)
# 3. K-Means with elbow method
# 4. DBSCAN comparison
# 5. Anomaly detection
# 6. Business interpretation

# Expected output:
# - PCA variance explained plot
# - 2D visualizations (t-SNE, UMAP, PCA)
# - Elbow curve and silhouette scores
# - Anomalies highlighted
# - Summary of business insights

This exercise teaches you to apply unsupervised methods to understand real data structure.