Privacy-Preserving AI Techniques
Privacy-Preserving AI Techniques
Advanced Methods for Secure Machine Learning
Privacy-preserving machine learning uses advanced cryptographic and mathematical techniques to enable AI training and inference without exposing raw data. These techniques are essential for regulated industries and sensitive applications.
Technique 1: Differential Privacy
Differential privacy adds noise to data or results to prevent individual data points from being identified.
How It Works
Before DP:
Training Data: [Alice's record, Bob's record, Carol's record, ...]
Model learns exact data → Can extract individual records
After DP:
Training Data + Noise: [record + noise, record + noise, ...]
Model can't distinguish individual contributions
Implementation
import numpy as np
def apply_differential_privacy(data, epsilon=1.0, delta=1e-5):
"""Apply differential privacy to data."""
# Scale: determine noise magnitude based on privacy budget
# epsilon (ε): privacy parameter (lower = more private)
# delta (δ): failure probability
# Add Laplace noise
sensitivity = 1.0 # Maximum change if one record removed
noise_scale = sensitivity / epsilon
dp_data = []
for record in data:
# Add Laplace noise to each record
noise = np.random.laplace(0, noise_scale)
dp_record = record + noise
dp_data.append(dp_record)
return np.array(dp_data)
class DifferentiallyPrivateModel:
def __init__(self, epsilon=1.0):
self.epsilon = epsilon
self.privacy_budget_used = 0
self.max_privacy_budget = 10.0 # Total privacy budget
def train(self, training_data, learning_rate=0.01):
"""Train model with differential privacy."""
# Allocate privacy budget
epsilon_for_training = 1.0
if self.privacy_budget_used + epsilon_for_training > self.max_privacy_budget:
raise ValueError("Privacy budget exhausted")
# Apply DP during training
dp_data = apply_differential_privacy(training_data, epsilon=epsilon_for_training)
# Train normally on DP data
self.model = self.build_model()
self.model.fit(dp_data, learning_rate=learning_rate)
# Update budget
self.privacy_budget_used += epsilon_for_training
return {
'privacy_budget_used': self.privacy_budget_used,
'privacy_budget_remaining': self.max_privacy_budget - self.privacy_budget_used
}
def predict(self, x):
"""Make predictions (don't consume privacy budget)."""
return self.model.predict(x)
Use Cases:
- Training on sensitive medical or financial data
- Public dataset release
- Government statistics
Tradeoff: More privacy = less model accuracy
Technique 2: Federated Learning
Train models without centralizing raw data.
How It Works
Instead of:
Hospital A sends data → Central server
Hospital B sends data → Central server
Hospital C sends data → Central server
Central server trains model on all data
Use Federated Learning:
Hospital A trains local model on local data → Sends only model weights
Hospital B trains local model on local data → Sends only model weights
Hospital C trains local model on local data → Sends only model weights
Central server averages model weights
Repeat until converged
Result: Hospital data never leaves hospital, only model updates are shared
Implementation
class FederatedLearningCoordinator:
def __init__(self, num_participants):
self.num_participants = num_participants
self.global_model = self.initialize_model()
self.participant_models = {}
self.round = 0
def initialize_model(self):
"""Create initial model architecture."""
# Simple neural network
return SimpleNN(layers=[64, 32, 1])
def send_model_to_participant(self, participant_id):
"""Send current model to participant."""
# Only send model weights, not data
return {
'model_weights': self.global_model.get_weights(),
'round': self.round
}
def receive_updated_model(self, participant_id, updated_weights):
"""Receive trained model from participant."""
self.participant_models[participant_id] = updated_weights
def aggregate_models(self):
"""Aggregate all participant models into global model."""
# Simple averaging (FedAvg algorithm)
aggregated_weights = {}
for layer_name in self.global_model.layer_names:
all_weights = [
self.participant_models[pid][layer_name]
for pid in self.participant_models
]
# Average weights across participants
aggregated_weights[layer_name] = np.mean(all_weights, axis=0)
self.global_model.set_weights(aggregated_weights)
def federated_learning_round(self):
"""Execute one round of federated learning."""
# 1. Send model to all participants
for pid in range(self.num_participants):
model_data = self.send_model_to_participant(pid)
# Send to participant (secure channel)
# 2. Participants train locally
# (happens on their devices)
# 3. Receive updated models
# (participants send updated weights)
# 4. Aggregate
self.aggregate_models()
self.round += 1
return {
'round': self.round,
'global_model_accuracy': self.evaluate_global_model()
}
Use Cases:
- Training on data spread across organizations
- Mobile device machine learning
- Preserving data sovereignty
Advantages:
- Raw data never leaves participant
- Regulatory compliance
- Better for privacy-sensitive domains
Technique 3: Secure Enclaves
Use trusted execution environments (TEE) for sensitive processing.
How It Works
Untrusted Environment:
[Operating System, Other Applications]
├─ Can see encrypted data
├─ Can't see what's happening inside enclave
Trusted Enclave (Intel SGX, ARM TrustZone):
├─ AI model execution happens here
├─ Data is decrypted only inside enclave
├─ Attestation proves code hasn't been tampered
├─ Results encrypted before leaving enclave
Conceptual Implementation
class SecureEnclaveMLProcessor:
def __init__(self, enclave_binary_path):
# In practice, this would use Intel SGX or similar
# For illustration, we'll simulate the concept
self.enclave_verified = self.verify_enclave(enclave_binary_path)
def verify_enclave(self, binary_path):
"""Verify enclave code hasn't been tampered with."""
expected_hash = "abc123..." # Hardcoded expected hash
actual_hash = self.compute_file_hash(binary_path)
return actual_hash == expected_hash
def process_sensitive_data(self, encrypted_data, encrypted_model):
"""Process data inside secure enclave."""
if not self.enclave_verified:
raise SecurityError("Enclave verification failed")
# In real scenario, this would happen inside TEE
# For now, we simulate with careful handling
# 1. Decrypt inside enclave only
data = self.decrypt_inside_enclave(encrypted_data)
model = self.decrypt_inside_enclave(encrypted_model)
# 2. Process
result = model.predict(data)
# 3. Encrypt before leaving enclave
encrypted_result = self.encrypt_inside_enclave(result)
return encrypted_result
def attestation_report(self):
"""Generate attestation proving enclave integrity."""
return {
'enclave_code_hash': "abc123...",
'timestamp': datetime.now(),
'signature': "..." # Cryptographically signed
}
Use Cases:
- Processing highly sensitive data (government, medical)
- Cross-organization AI without data sharing
- Supply chain verification
Limitations:
- Hardware dependent (Intel SGX, ARM TrustZone)
- Side-channel attacks possible
- Performance overhead
Technique 4: Homomorphic Encryption
Perform computation on encrypted data without decryption.
How It Works
Traditional:
Decrypt → Compute → Encrypt
Homomorphic Encryption:
Encrypt(data) → Compute on encrypted → Decrypt(result)
Result is same as if you encrypted the unencrypted computation
Conceptual Implementation
class HomomorphicEncryption:
def __init__(self):
# In practice, use libraries like SEAL or CKKS scheme
# This is simplified illustration
self.public_key = self.generate_keys()
def encrypt(self, plaintext):
"""Encrypt number for HE computation."""
# In real HE: ciphertext can be added/multiplied
ciphertext = self.apply_homomorphic_encryption(plaintext)
return ciphertext
def add_encrypted(self, encrypted_a, encrypted_b):
"""Add two encrypted numbers without decryption."""
# Mathematical property of HE:
# Decrypt(Encrypt(a) + Encrypt(b)) == a + b
encrypted_result = encrypted_a + encrypted_b
return encrypted_result
def multiply_encrypted(self, encrypted_a, encrypted_b):
"""Multiply encrypted numbers."""
encrypted_result = encrypted_a * encrypted_b
return encrypted_result
def decrypt(self, ciphertext):
"""Decrypt result."""
plaintext = self.apply_homomorphic_decryption(ciphertext)
return plaintext
# Usage example
he = HomomorphicEncryption()
# Encrypt data
patient_age_encrypted = he.encrypt(35)
treatment_cost_encrypted = he.encrypt(5000)
# Compute on encrypted data
# Calculate: adjusted_cost = cost * (1 + age/100)
adjustment_factor_encrypted = he.encrypt(1)
age_factor_encrypted = he.divide_encrypted(patient_age_encrypted, he.encrypt(100))
multiplier_encrypted = he.add_encrypted(adjustment_factor_encrypted, age_factor_encrypted)
adjusted_cost_encrypted = he.multiply_encrypted(
treatment_cost_encrypted,
multiplier_encrypted
)
# Decrypt result
adjusted_cost = he.decrypt(adjusted_cost_encrypted)
# adjusted_cost = 5000 * (1 + 35/100) = 6750
Use Cases:
- Cloud processing of encrypted data
- Privacy-preserving analytics
- Secure computation across organizations
Current Limitations:
- Slow (thousands of times slower than unencrypted)
- Only some operations possible
- Rapidly improving
Regulatory Compliance Integration
These techniques help meet regulatory requirements:
class PrivacyComplianceHelper:
def __init__(self, regulations=['GDPR', 'CCPA', 'HIPAA']):
self.regulations = regulations
self.privacy_techniques = {
'differential_privacy': self.config_dp,
'federated_learning': self.config_fl,
'secure_enclaves': self.config_se,
'homomorphic_encryption': self.config_he
}
def get_recommended_techniques(self, use_case):
"""Get privacy techniques for compliance."""
recommendations = {
'medical_records': ['secure_enclaves', 'differential_privacy'],
'financial_data': ['homomorphic_encryption', 'secure_enclaves'],
'cross_org_training': ['federated_learning'],
'public_statistics': ['differential_privacy'],
}
return recommendations.get(use_case, [])
def config_dp(self):
"""Configure differential privacy for GDPR."""
return {
'epsilon': 1.0, # Strong privacy
'delta': 1e-6,
'mechanism': 'Laplace',
'purpose': 'GDPR Article 32 - encryption/obfuscation'
}
def config_fl(self):
"""Configure federated learning for data minimization."""
return {
'purpose': 'GDPR Article 5 - data minimization',
'raw_data_centralization': False,
'secure_aggregation': True
}
Key Takeaway
Key Takeaway: Privacy-preserving AI techniques (differential privacy, federated learning, secure enclaves, homomorphic encryption) enable training and inference on sensitive data without exposing raw data. Each has tradeoffs between privacy, accuracy, and performance. Choose based on your use case and regulatory requirements.
Exercise: Implement Privacy-Preserving ML
- Implement differential privacy: Train a model with DP and measure accuracy loss
- Simulate federated learning: Train distributed models and aggregate weights
- Evaluate techniques: For your use case, which technique is most appropriate?
- Compliance mapping: Map your chosen technique to regulatory requirements
- Document tradeoffs: Privacy vs accuracy vs performance for your scenario
Next Phase: Intermediate - AI Red Teaming and Advanced Testing.