Python Data Structures for ML
Python Data Structures for ML
When you’re building machine learning systems, you’ll spend more time working with data than writing model code. Python’s data structure ecosystem—NumPy arrays, pandas DataFrames, and related tools—form the foundation of everything you’ll do in data science. Understanding how these structures work, especially their memory layout and performance characteristics, is crucial for writing efficient ML pipelines.
Why NumPy Arrays Over Python Lists
You might wonder: why not just use Python’s built-in lists? Let’s explore the differences that matter for ML.
Python lists are flexible and can contain mixed types, but they’re not optimized for numerical computation. Each element in a Python list is actually a pointer to a separate object in memory, which means accessing elements requires multiple levels of indirection. This is slow when you’re doing operations on millions of numbers.
NumPy arrays, by contrast, store data in contiguous blocks of memory in a specific numeric type. This layout allows:
- Vectorized operations: Perform the same operation on all elements without explicit loops
- Broadcasting: Operate on arrays of different shapes intelligently
- Memory efficiency: Store data densely without object overhead
- Speed: Leverage compiled C code under the hood
Let’s see this in action:
import numpy as np
import time
# Python list approach
python_list = list(range(1_000_000))
start = time.time()
result = [x * 2 for x in python_list]
list_time = time.time() - start
# NumPy approach
numpy_array = np.arange(1_000_000)
start = time.time()
result = numpy_array * 2
numpy_time = time.time() - start
print(f"List time: {list_time:.4f}s")
print(f"NumPy time: {numpy_time:.4f}s")
print(f"Speedup: {list_time / numpy_time:.1f}x")
On a typical machine, NumPy will be 10-100x faster. This difference compounds across your entire pipeline.
Creating and Shaping Arrays
NumPy provides several ways to create arrays:
import numpy as np
# From Python sequences
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])
# Creating arrays with specific patterns
zeros = np.zeros((3, 4)) # 3x4 array of zeros
ones = np.ones((2, 5)) # 2x5 array of ones
identity = np.eye(4) # 4x4 identity matrix
linspace = np.linspace(0, 10, 50) # 50 evenly spaced values from 0 to 10
# Random arrays (important for ML)
random_normal = np.random.randn(100, 50) # 100x50 from standard normal
random_uniform = np.random.uniform(0, 1, (100, 50)) # 100x50 uniform [0,1)
random_int = np.random.randint(0, 10, 1000) # 1000 random integers [0,10)
Reshaping is a critical operation when preparing data:
arr = np.arange(12)
print(arr.shape) # (12,)
# Reshape to 3x4
arr_2d = arr.reshape(3, 4)
print(arr_2d.shape) # (3, 4)
# Reshape to 2x2x3
arr_3d = arr.reshape(2, 2, 3)
print(arr_3d.shape) # (2, 2, 3)
# Flatten back to 1D
arr_flat = arr_2d.flatten()
print(arr_flat.shape) # (12,)
# Transpose (swap dimensions)
arr_t = arr_2d.T
print(arr_t.shape) # (4, 3)
Broadcasting: The Magic of NumPy Operations
Broadcasting is NumPy’s mechanism for operating on arrays of different shapes. It’s powerful but can be confusing—understanding it deeply will save you hours of debugging.
The key rule: when operating on two arrays, NumPy aligns their dimensions from the right. If dimensions don’t match, NumPy stretches the smaller array to match:
# Example 1: Adding scalar to array
arr = np.array([[1, 2, 3], [4, 5, 6]])
result = arr + 10 # 10 is broadcast to match arr's shape
print(result)
# [[11 12 13]
# [14 15 16]]
# Example 2: Adding row vector to matrix
arr = np.array([[1, 2, 3], [4, 5, 6]])
row = np.array([10, 20, 30]) # Shape (3,)
result = arr + row # row broadcasts to match each row of arr
print(result)
# [[11 22 33]
# [14 25 36]]
# Example 3: Adding column vector to matrix
arr = np.array([[1, 2, 3], [4, 5, 6]])
col = np.array([[10], [20]]) # Shape (2, 1)
result = arr + col # col broadcasts across columns
print(result)
# [[11 12 13]
# [24 25 26]]
Broadcasting rules (left-to-right alignment):
- If arrays have different numbers of dimensions, pad the smaller shape with 1s on the left
- Arrays are compatible if dimensions are equal OR one is 1
- Size-1 dimensions are stretched to match the other array
Here’s where it gets practical for ML:
# Standardizing a dataset (zero mean, unit variance)
data = np.random.randn(1000, 50) # 1000 samples, 50 features
# Compute mean for each feature (shape: 50,)
mean = data.mean(axis=0)
# Compute standard deviation for each feature (shape: 50,)
std = data.std(axis=0)
# Subtract mean from all samples (broadcasting happens automatically)
centered = data - mean # mean broadcasts across all 1000 samples
# Divide by std (broadcasting again)
standardized = centered / std
print(f"Original mean: {data.mean(axis=0)[:5]}")
print(f"Standardized mean: {standardized.mean(axis=0)[:5]}")
print(f"Standardized std: {standardized.std(axis=0)[:5]}")
Indexing and Slicing
Efficiently accessing array elements is critical for data preprocessing:
arr = np.arange(20).reshape(4, 5)
print(arr)
# [[ 0 1 2 3 4]
# [ 5 6 7 8 9]
# [10 11 12 13 14]
# [15 16 17 18 19]]
# Single element access
print(arr[2, 3]) # 13
# Slicing
print(arr[1:3, 2:4]) # Rows 1-2, columns 2-3
# [[ 7 8]
# [12 13]]
# Boolean indexing (crucial for filtering)
mask = arr > 10
filtered = arr[mask]
print(filtered) # [11 12 13 14 15 16 17 18 19]
# Fancy indexing
rows = np.array([0, 2])
cols = np.array([1, 3])
print(arr[rows, cols]) # [1, 13]
# Advanced: using indices to reorder
indices = np.array([3, 1, 2, 0])
reordered = arr[indices] # Reorder rows
Boolean indexing deserves special attention—you’ll use it constantly:
scores = np.array([45, 78, 92, 34, 88, 76, 91])
# Find scores above threshold
passing = scores[scores >= 80]
print(passing) # [92 88 91]
# Find indices of passing scores
passing_indices = np.where(scores >= 80)[0]
print(passing_indices) # [2 4 6]
# Multiple conditions
data = np.random.randn(1000, 5)
high_variance_cols = np.where(data.var(axis=0) > 0.5)[0]
print(f"Columns with variance > 0.5: {high_variance_cols}")
From NumPy to Pandas: DataFrames
While NumPy arrays are homogeneous (all elements same type), real-world data is often heterogeneous—some columns are numeric, some are categorical strings. This is where pandas DataFrames shine:
import pandas as pd
# Create from dict of arrays
data = {
'age': [25, 34, 28, 45, 31],
'salary': [50000, 75000, 62000, 95000, 68000],
'department': ['Engineering', 'Sales', 'Engineering', 'Executive', 'Marketing']
}
df = pd.DataFrame(data)
print(df)
# DataFrame structure
print(df.shape) # (5, 3)
print(df.dtypes) # Data type of each column
print(df.columns) # Column names
print(df.index) # Row labels
DataFrames enable column-specific operations while maintaining the relationship between them:
# Access columns
print(df['age']) # Series
print(df[['age', 'salary']]) # DataFrame with 2 columns
# Calculate statistics by group
print(df.groupby('department')['salary'].mean())
# Vectorized operations on columns
df['salary_increase'] = df['salary'] * 1.1
df['age_group'] = pd.cut(df['age'], bins=[0, 30, 40, 100],
labels=['Young', 'Mid', 'Senior'])
Key Takeaway
NumPy’s vectorized operations and broadcasting eliminate the need for explicit loops, making your code simultaneously faster (10-100x), more readable, and more Pythonic. Master these concepts and you’ll write ML pipelines that are both elegant and efficient.
Practical Exercise
You’re working on a recommendation system. You have:
- A matrix of user ratings: shape (10000, 5000) — 10,000 users, 5,000 products
- Each user has rated only a subset of products (missing values are 0)
Your task:
- Compute the mean rating for each product (ignoring zeros)
- Normalize each user’s ratings to have mean 0 (center each row)
- Find the top 10 products by average rating
- Create a binary matrix indicating which products each user has rated
Write the code using only NumPy operations (no loops).
import numpy as np
# Generate synthetic rating matrix
np.random.seed(42)
ratings = np.random.choice([0, 1, 2, 3, 4, 5], size=(10000, 5000), p=[0.7, 0.05, 0.05, 0.1, 0.05, 0.05])
# Your solution here:
# 1. Mean rating per product (ignoring zeros)
# 2. Normalize users' ratings to mean 0
# 3. Top 10 products
# 4. Binary matrix of rated products
The solution requires combining your understanding of axes, broadcasting, and boolean indexing. Try it before checking the answer!