Foundations

Data Loading, Cleaning, and Preprocessing

Lesson 2 of 4 Estimated Time 45 min

Data Loading, Cleaning, and Preprocessing

In real-world ML projects, the harsh truth is that 80% of your time is spent on data loading, cleaning, and preprocessing. The model-building part? That’s often just 20%. Understanding how to efficiently load, clean, and transform data is the difference between a production ML system and a science experiment that doesn’t work in practice.

Loading Data: From Files to Memory

Your data could live in various formats. Let’s cover the most common ones:

import pandas as pd
import numpy as np

# CSV files (most common)
df = pd.read_csv('data.csv')

# JSON files (great for hierarchical data)
df = pd.read_json('data.json')

# Parquet files (compressed, columnar format—faster for large datasets)
df = pd.read_parquet('data.parquet')

# Excel files (if you really must)
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM users', conn)

For large datasets, loading efficiency matters:

# Read only columns you need
df = pd.read_csv('large_file.csv', usecols=['age', 'income', 'purchased'])

# Read in chunks
chunks = []
for chunk in pd.read_csv('huge_file.csv', chunksize=10000):
    # Process chunk
    chunks.append(chunk)
df = pd.concat(chunks)

# Specify data types to reduce memory
dtypes = {'user_id': 'int32', 'age': 'int8', 'income': 'float32'}
df = pd.read_csv('data.csv', dtype=dtypes)

Parquet is increasingly the standard for ML pipelines because it’s both smaller and faster:

# Write to parquet
df.to_parquet('data.parquet')

# Read back (much faster than CSV)
df = pd.read_parquet('data.parquet')

# Parquet even preserves data types correctly
print(df.dtypes)

Exploratory Data Analysis: Understanding Your Data

Before you preprocess, you need to understand what you’re working with:

import pandas as pd
import numpy as np

df = pd.read_csv('customer_data.csv')

# Basic info
print(f"Shape: {df.shape}")  # How many rows and columns
print(df.info())  # Data types and non-null counts
print(df.head(10))  # First 10 rows
print(df.tail(5))  # Last 5 rows

# Descriptive statistics
print(df.describe())  # Mean, std, min, max for numeric columns

# Identify missing values
print(df.isnull().sum())  # Count of nulls per column
print(df.isnull().sum() / len(df) * 100)  # Percentage missing

# Unique values per column
print(df.nunique())

# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")

Handling Missing Values

Missing values are unavoidable. Your strategy depends on why they’re missing and what column they’re in:

# Check missing patterns
print(df.isnull().sum())

# Strategy 1: Drop rows with any missing values
df_complete = df.dropna()  # Aggressive, might lose data

# Strategy 2: Drop rows where specific columns are missing
df_clean = df.dropna(subset=['critical_column'])

# Strategy 3: Fill missing values
# For numeric columns, common approaches:
df['age'].fillna(df['age'].mean(), inplace=True)  # Fill with mean
df['salary'].fillna(df['salary'].median(), inplace=True)  # Use median
df['experience'].fillna(method='ffill', inplace=True)  # Forward fill (carry last value forward)

# For categorical columns:
df['department'].fillna('Unknown', inplace=True)  # Fill with default
df['category'].fillna(df['category'].mode()[0], inplace=True)  # Fill with most common

# Advanced: interpolation for time series
df['temperature'].interpolate(method='linear', inplace=True)

For your preprocessing pipeline, be strategic about which strategy you choose:

# Smart missing value handling for a customer dataset
def preprocess_customers(df):
    df = df.copy()

    # Drop if customer ID is missing (every customer needs an ID)
    df = df.dropna(subset=['customer_id'])

    # Age: fill with median (reasonable imputation)
    df['age'].fillna(df['age'].median(), inplace=True)

    # Purchase history: fill with 0 (no purchase is different from unknown)
    df['purchases'].fillna(0, inplace=True)

    # Phone: fill with 'Not provided'
    df['phone'].fillna('Not provided', inplace=True)

    return df

Normalization and Standardization

Different features have different scales. Without normalization, ML models can become biased toward features with larger magnitudes.

import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler

# Sample data
data = np.array([
    [100, 5],
    [200, 10],
    [150, 7],
    [300, 15]
])

# Standardization (zero mean, unit variance)
# Formula: (x - mean) / std
scaler = StandardScaler()
standardized = scaler.fit_transform(data)
print("Standardized (mean=0, std=1):")
print(standardized)
print(f"Mean: {standardized.mean(axis=0)}")
print(f"Std: {standardized.std(axis=0)}")

# Min-Max Scaling (range [0, 1])
# Formula: (x - min) / (max - min)
scaler = MinMaxScaler(feature_range=(0, 1))
normalized = scaler.fit_transform(data)
print("\nMin-Max scaled [0, 1]:")
print(normalized)

# Robust Scaling (resistant to outliers)
# Formula: (x - median) / IQR
scaler = RobustScaler()
robust = scaler.fit_transform(data)
print("\nRobust scaled:")
print(robust)

Critical tip for production ML: Always fit scalers on training data only, then apply to test data:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit ONLY on training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply the SAME scaler to test data
X_test_scaled = scaler.transform(X_test)

# WRONG: fitting again on test data causes data leakage
# X_test_scaled = scaler.fit_transform(X_test)  # DON'T DO THIS!

Encoding Categorical Variables

ML models work with numbers. Categorical variables need encoding:

import pandas as pd

df = pd.DataFrame({
    'color': ['red', 'blue', 'red', 'green', 'blue'],
    'size': ['small', 'large', 'medium', 'large', 'small']
})

# One-Hot Encoding (for unordered categories)
df_encoded = pd.get_dummies(df, columns=['color', 'size'], drop_first=False)
print(df_encoded)
# Creates binary columns: color_blue, color_green, color_red, size_large, size_medium, size_small

# Label Encoding (for ordered categories)
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['size_encoded'] = le.fit_transform(df['size'])  # small=0, medium=1, large=2
print(df[['size', 'size_encoded']])

For high-cardinality columns (many unique values), be careful:

# Problem: one-hot encoding creates explosion of columns
df = pd.DataFrame({'city': ['New York', 'London', 'Tokyo', ...] * 1000})  # 1000 unique cities
df_encoded = pd.get_dummies(df)  # Creates 1000 new columns!

# Solution 1: Keep top N categories
top_cities = df['city'].value_counts().head(10).index
df['city'] = df['city'].apply(lambda x: x if x in top_cities else 'Other')
df_encoded = pd.get_dummies(df)  # Now only 11 columns

# Solution 2: Target encoding (requires careful cross-validation)
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)
df_encoded = enc.fit_transform(df[['city']])

Feature Engineering Basics

Creating new features is often more valuable than using raw features:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5],
    'signup_date': pd.to_datetime(['2020-01-15', '2021-03-22', '2019-05-10', '2022-11-30', '2021-07-05']),
    'last_purchase_date': pd.to_datetime(['2023-10-20', '2023-09-15', '2023-12-01', '2023-01-10', '2023-08-30']),
    'purchase_amount': [100, 250, 50, 500, 175],
    'num_purchases': [5, 12, 2, 20, 8]
})

# Temporal features
df['days_since_signup'] = (pd.Timestamp.now() - df['signup_date']).dt.days
df['days_since_last_purchase'] = (pd.Timestamp.now() - df['last_purchase_date']).dt.days
df['signup_month'] = df['signup_date'].dt.month
df['signup_quarter'] = df['signup_date'].dt.quarter

# Aggregate features
df['avg_purchase_amount'] = df['purchase_amount'] / df['num_purchases']
df['purchase_frequency'] = df['num_purchases'] / df['days_since_signup']

# Polynomial features
df['purchase_amount_squared'] = df['purchase_amount'] ** 2
df['interaction'] = df['purchase_amount'] * df['num_purchases']

print(df)

Building a Complete Preprocessing Pipeline

In practice, you’ll chain many preprocessing steps:

import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Load raw data
df_raw = pd.read_csv('raw_data.csv')

# Step 1: Handle missing values
df_raw['age'].fillna(df_raw['age'].median(), inplace=True)
df_raw['income'].fillna(df_raw['income'].median(), inplace=True)
df_raw['region'].fillna('Unknown', inplace=True)

# Step 2: Remove duplicates
df = df_raw.drop_duplicates(subset=['customer_id'])

# Step 3: Feature engineering
df['income_per_age'] = df['income'] / (df['age'] + 1)
df['is_high_value'] = (df['income'] > df['income'].quantile(0.75)).astype(int)

# Step 4: Encode categorical variables
df = pd.get_dummies(df, columns=['region'], drop_first=True)

# Step 5: Normalize numeric features
numeric_cols = ['age', 'income', 'income_per_age']
scaler = StandardScaler()
df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# Step 6: Remove low-variance features (near-constant columns)
variance = df.var()
low_var_cols = variance[variance < 0.01].index
df = df.drop(columns=low_var_cols)

print(f"Final shape: {df.shape}")
print(df.head())

Key Takeaway

Quality preprocessing isn’t glamorous, but it’s the foundation of every successful ML project. Spending time understanding your data, strategically handling missing values, and engineering meaningful features will yield far greater improvements in model performance than hyperparameter tuning ever will.

Practical Exercise

You have a customer dataset with 100,000 rows and the following columns:

  • customer_id: Unique identifier
  • age: Customer age (some missing values)
  • income: Annual income (some missing values, right-skewed)
  • signup_date: When customer signed up
  • purchase_history: Comma-separated dates of purchases
  • region: Customer region (categorical)

Your preprocessing task:

  1. Load the data and identify missing values
  2. Impute age and income appropriately
  3. Create features from signup_date (days since signup, signup year)
  4. Parse purchase_history to create: total purchases, average time between purchases
  5. One-hot encode region
  6. Standardize numeric features
  7. Remove any near-constant features

Write the complete preprocessing function:

import pandas as pd
from sklearn.preprocessing import StandardScaler

def preprocess_customers(filepath):
    """
    Load and preprocess customer data for ML modeling.

    Args:
        filepath: Path to the CSV file

    Returns:
        Processed DataFrame ready for modeling
    """
    # Your implementation here
    pass

# Test your preprocessing
df = preprocess_customers('customers.csv')
print(f"Output shape: {df.shape}")
print(f"Data types: {df.dtypes}")
print(f"Missing values: {df.isnull().sum().sum()}")

This exercise combines all the techniques covered in this lesson. The key is to write it as a reusable pipeline that you could apply to new data.