Privacy-Preserving AI Techniques

Advanced Methods for Secure Machine Learning

Privacy-preserving machine learning uses advanced cryptographic and mathematical techniques to enable AI training and inference without exposing raw data. These techniques are essential for regulated industries and sensitive applications.

Technique 1: Differential Privacy

Differential privacy adds noise to data or results to prevent individual data points from being identified.

How It Works

Before DP:
Training Data: [Alice's record, Bob's record, Carol's record, ...]
Model learns exact data → Can extract individual records

After DP:
Training Data + Noise: [record + noise, record + noise, ...]
Model can't distinguish individual contributions

Implementation

import numpy as np

def apply_differential_privacy(data, epsilon=1.0, delta=1e-5):
    """Apply differential privacy to data."""

    # Scale: determine noise magnitude based on privacy budget
    # epsilon (ε): privacy parameter (lower = more private)
    # delta (δ): failure probability

    # Add Laplace noise
    sensitivity = 1.0  # Maximum change if one record removed
    noise_scale = sensitivity / epsilon

    dp_data = []

    for record in data:
        # Add Laplace noise to each record
        noise = np.random.laplace(0, noise_scale)
        dp_record = record + noise
        dp_data.append(dp_record)

    return np.array(dp_data)


class DifferentiallyPrivateModel:
    def __init__(self, epsilon=1.0):
        self.epsilon = epsilon
        self.privacy_budget_used = 0
        self.max_privacy_budget = 10.0  # Total privacy budget

    def train(self, training_data, learning_rate=0.01):
        """Train model with differential privacy."""

        # Allocate privacy budget
        epsilon_for_training = 1.0
        if self.privacy_budget_used + epsilon_for_training > self.max_privacy_budget:
            raise ValueError("Privacy budget exhausted")

        # Apply DP during training
        dp_data = apply_differential_privacy(training_data, epsilon=epsilon_for_training)

        # Train normally on DP data
        self.model = self.build_model()
        self.model.fit(dp_data, learning_rate=learning_rate)

        # Update budget
        self.privacy_budget_used += epsilon_for_training

        return {
            'privacy_budget_used': self.privacy_budget_used,
            'privacy_budget_remaining': self.max_privacy_budget - self.privacy_budget_used
        }

    def predict(self, x):
        """Make predictions (don't consume privacy budget)."""
        return self.model.predict(x)

Use Cases:

Training on sensitive medical or financial data
Public dataset release
Government statistics

Tradeoff: More privacy = less model accuracy

Technique 2: Federated Learning

Train models without centralizing raw data.

How It Works

Instead of:
Hospital A sends data → Central server
Hospital B sends data → Central server
Hospital C sends data → Central server
Central server trains model on all data

Use Federated Learning:
Hospital A trains local model on local data → Sends only model weights
Hospital B trains local model on local data → Sends only model weights
Hospital C trains local model on local data → Sends only model weights
Central server averages model weights
Repeat until converged

Result: Hospital data never leaves hospital, only model updates are shared

Implementation

class FederatedLearningCoordinator:
    def __init__(self, num_participants):
        self.num_participants = num_participants
        self.global_model = self.initialize_model()
        self.participant_models = {}
        self.round = 0

    def initialize_model(self):
        """Create initial model architecture."""
        # Simple neural network
        return SimpleNN(layers=[64, 32, 1])

    def send_model_to_participant(self, participant_id):
        """Send current model to participant."""
        # Only send model weights, not data
        return {
            'model_weights': self.global_model.get_weights(),
            'round': self.round
        }

    def receive_updated_model(self, participant_id, updated_weights):
        """Receive trained model from participant."""
        self.participant_models[participant_id] = updated_weights

    def aggregate_models(self):
        """Aggregate all participant models into global model."""

        # Simple averaging (FedAvg algorithm)
        aggregated_weights = {}

        for layer_name in self.global_model.layer_names:
            all_weights = [
                self.participant_models[pid][layer_name]
                for pid in self.participant_models
            ]

            # Average weights across participants
            aggregated_weights[layer_name] = np.mean(all_weights, axis=0)

        self.global_model.set_weights(aggregated_weights)

    def federated_learning_round(self):
        """Execute one round of federated learning."""

        # 1. Send model to all participants
        for pid in range(self.num_participants):
            model_data = self.send_model_to_participant(pid)
            # Send to participant (secure channel)

        # 2. Participants train locally
        # (happens on their devices)

        # 3. Receive updated models
        # (participants send updated weights)

        # 4. Aggregate
        self.aggregate_models()

        self.round += 1

        return {
            'round': self.round,
            'global_model_accuracy': self.evaluate_global_model()
        }

Use Cases:

Training on data spread across organizations
Mobile device machine learning
Preserving data sovereignty

Advantages:

Raw data never leaves participant
Regulatory compliance
Better for privacy-sensitive domains

Technique 3: Secure Enclaves

Use trusted execution environments (TEE) for sensitive processing.

How It Works

Untrusted Environment:
[Operating System, Other Applications]
├─ Can see encrypted data
├─ Can't see what's happening inside enclave

Trusted Enclave (Intel SGX, ARM TrustZone):
├─ AI model execution happens here
├─ Data is decrypted only inside enclave
├─ Attestation proves code hasn't been tampered
├─ Results encrypted before leaving enclave

Conceptual Implementation

class SecureEnclaveMLProcessor:
    def __init__(self, enclave_binary_path):
        # In practice, this would use Intel SGX or similar
        # For illustration, we'll simulate the concept

        self.enclave_verified = self.verify_enclave(enclave_binary_path)

    def verify_enclave(self, binary_path):
        """Verify enclave code hasn't been tampered with."""

        expected_hash = "abc123..."  # Hardcoded expected hash

        actual_hash = self.compute_file_hash(binary_path)

        return actual_hash == expected_hash

    def process_sensitive_data(self, encrypted_data, encrypted_model):
        """Process data inside secure enclave."""

        if not self.enclave_verified:
            raise SecurityError("Enclave verification failed")

        # In real scenario, this would happen inside TEE
        # For now, we simulate with careful handling

        # 1. Decrypt inside enclave only
        data = self.decrypt_inside_enclave(encrypted_data)
        model = self.decrypt_inside_enclave(encrypted_model)

        # 2. Process
        result = model.predict(data)

        # 3. Encrypt before leaving enclave
        encrypted_result = self.encrypt_inside_enclave(result)

        return encrypted_result

    def attestation_report(self):
        """Generate attestation proving enclave integrity."""

        return {
            'enclave_code_hash': "abc123...",
            'timestamp': datetime.now(),
            'signature': "..."  # Cryptographically signed
        }

Use Cases:

Processing highly sensitive data (government, medical)
Cross-organization AI without data sharing
Supply chain verification

Limitations:

Hardware dependent (Intel SGX, ARM TrustZone)
Side-channel attacks possible
Performance overhead

Technique 4: Homomorphic Encryption

Perform computation on encrypted data without decryption.

How It Works

Traditional:
Decrypt → Compute → Encrypt

Homomorphic Encryption:
Encrypt(data) → Compute on encrypted → Decrypt(result)
Result is same as if you encrypted the unencrypted computation

Conceptual Implementation

class HomomorphicEncryption:
    def __init__(self):
        # In practice, use libraries like SEAL or CKKS scheme
        # This is simplified illustration

        self.public_key = self.generate_keys()

    def encrypt(self, plaintext):
        """Encrypt number for HE computation."""

        # In real HE: ciphertext can be added/multiplied
        ciphertext = self.apply_homomorphic_encryption(plaintext)
        return ciphertext

    def add_encrypted(self, encrypted_a, encrypted_b):
        """Add two encrypted numbers without decryption."""

        # Mathematical property of HE:
        # Decrypt(Encrypt(a) + Encrypt(b)) == a + b

        encrypted_result = encrypted_a + encrypted_b

        return encrypted_result

    def multiply_encrypted(self, encrypted_a, encrypted_b):
        """Multiply encrypted numbers."""

        encrypted_result = encrypted_a * encrypted_b

        return encrypted_result

    def decrypt(self, ciphertext):
        """Decrypt result."""

        plaintext = self.apply_homomorphic_decryption(ciphertext)
        return plaintext

# Usage example
he = HomomorphicEncryption()

# Encrypt data
patient_age_encrypted = he.encrypt(35)
treatment_cost_encrypted = he.encrypt(5000)

# Compute on encrypted data
# Calculate: adjusted_cost = cost * (1 + age/100)
adjustment_factor_encrypted = he.encrypt(1)
age_factor_encrypted = he.divide_encrypted(patient_age_encrypted, he.encrypt(100))
multiplier_encrypted = he.add_encrypted(adjustment_factor_encrypted, age_factor_encrypted)

adjusted_cost_encrypted = he.multiply_encrypted(
    treatment_cost_encrypted,
    multiplier_encrypted
)

# Decrypt result
adjusted_cost = he.decrypt(adjusted_cost_encrypted)
# adjusted_cost = 5000 * (1 + 35/100) = 6750

Use Cases:

Cloud processing of encrypted data
Privacy-preserving analytics
Secure computation across organizations

Current Limitations:

Slow (thousands of times slower than unencrypted)
Only some operations possible
Rapidly improving

Regulatory Compliance Integration

These techniques help meet regulatory requirements:

class PrivacyComplianceHelper:
    def __init__(self, regulations=['GDPR', 'CCPA', 'HIPAA']):
        self.regulations = regulations
        self.privacy_techniques = {
            'differential_privacy': self.config_dp,
            'federated_learning': self.config_fl,
            'secure_enclaves': self.config_se,
            'homomorphic_encryption': self.config_he
        }

    def get_recommended_techniques(self, use_case):
        """Get privacy techniques for compliance."""

        recommendations = {
            'medical_records': ['secure_enclaves', 'differential_privacy'],
            'financial_data': ['homomorphic_encryption', 'secure_enclaves'],
            'cross_org_training': ['federated_learning'],
            'public_statistics': ['differential_privacy'],
        }

        return recommendations.get(use_case, [])

    def config_dp(self):
        """Configure differential privacy for GDPR."""
        return {
            'epsilon': 1.0,  # Strong privacy
            'delta': 1e-6,
            'mechanism': 'Laplace',
            'purpose': 'GDPR Article 32 - encryption/obfuscation'
        }

    def config_fl(self):
        """Configure federated learning for data minimization."""
        return {
            'purpose': 'GDPR Article 5 - data minimization',
            'raw_data_centralization': False,
            'secure_aggregation': True
        }

Key Takeaway

Key Takeaway: Privacy-preserving AI techniques (differential privacy, federated learning, secure enclaves, homomorphic encryption) enable training and inference on sensitive data without exposing raw data. Each has tradeoffs between privacy, accuracy, and performance. Choose based on your use case and regulatory requirements.

Exercise: Implement Privacy-Preserving ML

Implement differential privacy: Train a model with DP and measure accuracy loss
Simulate federated learning: Train distributed models and aggregate weights
Evaluate techniques: For your use case, which technique is most appropriate?
Compliance mapping: Map your chosen technique to regulatory requirements
Document tradeoffs: Privacy vs accuracy vs performance for your scenario

Next Phase: Intermediate - AI Red Teaming and Advanced Testing.