Scientific Python Ecosystem
Scientific Python Ecosystem
The Python scientific ecosystem extends far beyond NumPy and pandas. Understanding the full landscape—SciPy for advanced algorithms, scikit-learn for preprocessing, joblib for parallelization, and environment tools—transforms you from someone who can write Python into someone who can build reproducible, professional ML systems.
SciPy: Advanced Scientific Computing
SciPy builds on NumPy with specialized algorithms for optimization, integration, linear algebra, and statistics:
import numpy as np
from scipy.optimize import minimize, curve_fit
from scipy.stats import norm, chi2
from scipy.spatial.distance import euclidean, cosine
# Optimization: finding minimum of function
def objective(x):
return (x - 3) ** 2 + 2 * np.sin(x)
result = minimize(objective, x0=0)
print(f"Minimum found at x={result.x[0]:.4f}, f(x)={result.fun:.4f}")
# Curve fitting: fit a function to data
x_data = np.linspace(0, 10, 100)
y_data = 2.5 * np.exp(-0.3 * x_data) + np.random.normal(0, 0.1, 100)
def exponential(x, a, b):
return a * np.exp(-b * x)
params, _ = curve_fit(exponential, x_data, y_data, p0=[2.5, 0.3])
print(f"Fitted parameters: a={params[0]:.4f}, b={params[1]:.4f}")
# Statistical distributions
# Compute probability and percentiles
p_value = 1 - norm.cdf(2.5) # P(Z > 2.5) for standard normal
quantile = norm.ppf(0.95) # 95th percentile
print(f"P(Z > 2.5): {p_value:.4f}")
print(f"95th percentile: {quantile:.4f}")
# Distance metrics (crucial for clustering, KNN)
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
euclidean_dist = euclidean(v1, v2)
cosine_sim = cosine(v1, v2)
print(f"Euclidean distance: {euclidean_dist:.4f}")
print(f"Cosine distance: {cosine_sim:.4f}")
Scikit-Learn: Preprocessing Transformers
Scikit-learn’s transformer API is essential for building robust pipelines:
from sklearn.preprocessing import StandardScaler, RobustScaler, PolynomialFeatures, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
# Sample heterogeneous data
X = pd.DataFrame({
'age': [25, 35, 45, 55, 28],
'income': [50000, 75000, 95000, 120000, 62000],
'education': ['HS', 'Bachelor', 'Masters', 'PhD', 'Bachelor']
})
# Step 1: Define what to do with each column
numeric_features = ['age', 'income']
categorical_features = ['education']
# Step 2: Create transformers for each group
numeric_transformer = Pipeline(steps=[
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('onehot', OneHotEncoder(drop='first', sparse_output=False))
])
# Step 3: Combine transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# Step 4: Add a final estimator (e.g., classifier) into the full pipeline
from sklearn.linear_model import LogisticRegression
full_pipeline = Pipeline(steps=[
('preprocessor', preprocessor),
('classifier', LogisticRegression(max_iter=200))
])
# Now you can fit the entire pipeline
# full_pipeline.fit(X, y)
This pattern is gold: it keeps preprocessing and modeling together, preventing data leakage and making code reproducible.
Feature Selection and Dimensionality Reduction
When you have many features, you need tools to identify which ones matter:
from sklearn.feature_selection import SelectKBest, f_classif, mutual_info_classif
from sklearn.decomposition import PCA
import numpy as np
# Sample data: 1000 samples, 50 features
X = np.random.randn(1000, 50)
y = np.random.randint(0, 2, 1000)
# Statistical feature selection (univariate)
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
print(f"Shape after selecting top 10: {X_selected.shape}")
# See which features were selected
selected_indices = selector.get_support(indices=True)
print(f"Selected feature indices: {selected_indices}")
# Principal Component Analysis: linear dimensionality reduction
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)
print(f"Shape after PCA: {X_pca.shape}")
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Cumulative variance: {np.cumsum(pca.explained_variance_ratio_)}")
# How many components for 95% variance?
n_components_95 = np.argmax(np.cumsum(pca.explained_variance_ratio_) >= 0.95) + 1
print(f"Components needed for 95% variance: {n_components_95}")
Joblib: Parallel Processing
When you’re doing expensive computations (grid search, cross-validation), parallelization becomes critical:
from joblib import Parallel, delayed
import numpy as np
# Serial approach: slow
def expensive_computation(x):
"""Simulate expensive task"""
result = 0
for i in range(10_000_000):
result += np.sin(i * x)
return result
# Without parallelization
results = [expensive_computation(x) for x in range(10)]
# With parallelization: use multiple CPU cores
results = Parallel(n_jobs=-1)(
delayed(expensive_computation)(x) for x in range(10)
)
print(f"Results: {results}")
Joblib is also crucial for caching expensive computations:
from joblib import Memory
# Set up a cache directory
cache = Memory('cache_dir', verbose=1)
@cache.cache
def expensive_function(x):
"""This result will be cached"""
print(f"Computing for x={x}...")
return x ** 2
# First call: computes
result1 = expensive_function(5)
# Second call with same input: loads from cache
result2 = expensive_function(5)
# Different input: computes again
result3 = expensive_function(6)
Environment Management and Reproducibility
In production, reproducibility is non-negotiable. You need to capture exactly what versions you used:
# requirements.txt approach (basic)
"""
numpy==1.24.3
pandas==2.0.2
scikit-learn==1.2.2
scipy==1.10.1
matplotlib==3.7.1
"""
# pip install -r requirements.txt
Better: use a lock file that includes transitive dependencies:
# Create lock file with all dependencies (including sub-dependencies)
pip freeze > requirements.txt
# Or with poetry (modern approach)
poetry add numpy pandas scikit-learn
poetry export --format=requirements.txt > requirements.txt
Most importantly: always set random seeds for reproducibility:
import numpy as np
import random
from sklearn.model_selection import train_test_split
# Set seeds
np.random.seed(42)
random.seed(42)
# Now all random operations are reproducible
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42 # Also set here!
)
# This ensures same train/test split every run
Building a Reproducible ML Project
Here’s a complete example of a reproducible ML workflow:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
import joblib
import json
from datetime import datetime
class MLProject:
def __init__(self, random_state=42):
"""Initialize with fixed random state for reproducibility"""
self.random_state = random_state
np.random.seed(random_state)
self.metadata = {
'timestamp': datetime.now().isoformat(),
'random_state': random_state
}
self.model = None
self.scaler = None
def load_and_preprocess(self, filepath):
"""Load and preprocess data"""
df = pd.read_csv(filepath)
# Handle missing values
df = df.dropna()
# Separate features and target
X = df.drop('target', axis=1)
y = df['target']
# Store feature names (for later prediction)
self.feature_names = X.columns.tolist()
return X, y
def train(self, X, y):
"""Train model with proper train/test split"""
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=self.random_state, stratify=y
)
# Scale features
self.scaler = StandardScaler()
X_train_scaled = self.scaler.fit_transform(X_train)
X_test_scaled = self.scaler.transform(X_test)
# Train model
self.model = RandomForestClassifier(
n_estimators=100,
random_state=self.random_state,
n_jobs=-1
)
self.model.fit(X_train_scaled, y_train)
# Evaluate
train_score = self.model.score(X_train_scaled, y_train)
test_score = self.model.score(X_test_scaled, y_test)
# Cross-validation for robust estimate
cv_scores = cross_val_score(
self.model, X_train_scaled, y_train, cv=5, n_jobs=-1
)
print(f"Train accuracy: {train_score:.4f}")
print(f"Test accuracy: {test_score:.4f}")
print(f"CV mean: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")
return X_test_scaled, y_test
def save(self, model_path='model.joblib', metadata_path='metadata.json'):
"""Save model and metadata for later use"""
joblib.dump(self.model, model_path)
joblib.dump(self.scaler, 'scaler.joblib')
with open(metadata_path, 'w') as f:
json.dump({
**self.metadata,
'feature_names': self.feature_names,
'n_features': len(self.feature_names)
}, f, indent=2)
def load(self, model_path='model.joblib'):
"""Load previously trained model"""
self.model = joblib.load(model_path)
self.scaler = joblib.load('scaler.joblib')
def predict(self, X):
"""Make predictions with loaded model"""
if self.model is None:
raise ValueError("Model not trained or loaded")
X_scaled = self.scaler.transform(X)
return self.model.predict(X_scaled)
# Usage
project = MLProject(random_state=42)
X, y = project.load_and_preprocess('data.csv')
X_test, y_test = project.train(X, y)
project.save()
# Later: load and predict
project_loaded = MLProject()
project_loaded.load()
predictions = project_loaded.predict(X_test)
Key Takeaway
The difference between a prototype and a production ML system isn’t more complex models—it’s reproducibility, proper pipelines, and thoughtful environment management. Master these tools and practices, and you’ll build systems that work the same way six months later.
Practical Exercise
Build a complete reproducible ML system:
- Create a dataset generator that creates reproducible data
- Build a Pipeline with preprocessing and a classifier
- Implement proper train/test split with cross-validation
- Save the entire pipeline (preprocessor + model) to disk
- Load it back and verify predictions are identical
from sklearn.datasets import make_classification
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import joblib
import numpy as np
# Generate reproducible dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15,
n_redundant=5, random_state=42)
# Your solution: build a complete reproducible pipeline that:
# 1. Scales features
# 2. Trains a classifier
# 3. Saves to disk
# 4. Loads back and produces identical predictions
# Bonus: Add cross-validation and print results
The key insight: a well-designed pipeline is not just cleaner code—it’s the foundation of reproducible science and production-ready ML systems.