Post-Incident Recovery
Post-Incident Recovery
Overview
After an AI incident is detected and contained, organizations must focus on recovery: restoring normal operations safely, remediating affected individuals, updating systems and policies to prevent recurrence, and learning from the incident.
Recovery Phases
Phase 1: Stabilization (0-24 hours)
Objectives:
- Stabilize systems in safe state
- Prevent further damage
- Maintain current state for forensics
- Determine root cause sufficiently for interim measures
Key Activities:
Stabilization Phase:
System Assessment:
- "Verify system state and integrity"
- "Run diagnostic tests to confirm stability"
- "Compare system behavior to expected baselines"
- "Check for new alerts or anomalies"
- "Assess whether reverting changes is safe"
Forensic Readiness:
- "Preserve all logs and evidence"
- "Document current system state"
- "Create forensic snapshots"
- "Verify evidence integrity"
- "Prepare forensic analysis environment"
Interim Controls:
- "Maintain human oversight if system is back in production"
- "Increase monitoring frequency"
- "Set up escalation procedures for anomalies"
- "Prepare contingency procedures (human review, service slowdown)"
Root Cause Hypothesis:
- "Based on forensics, develop hypothesis about what happened"
- "Identify most likely root cause among possibilities"
- "Plan targeted investigation"
- "Design tests to confirm or refute hypothesis"
Stakeholder Communication:
- "Provide initial incident summary to leadership"
- "Outline recovery timeline and phases"
- "Establish regular update cadence"
- "Prepare for potential regulatory notifications"
Phase 2: Investigation and Root Cause Analysis (24-72 hours)
Objectives:
- Definitively identify root cause
- Determine full scope of impact
- Assess systemic vulnerabilities
- Plan comprehensive remediation
Root Cause Analysis Framework:
# Root Cause Analysis Framework
class RootCauseAnalyzer:
def conduct_5why_analysis(self, immediate_cause: str) -> dict:
"""Conduct 5-Why analysis to find root cause"""
why_analysis = {
'level_1': {
'question': 'Why did this problem occur?',
'answer': immediate_cause,
'evidence': []
}
}
# Example: Model accuracy degradation
# Why 1: Because new version of model was deployed
# Why 2: Because testing procedure failed to catch degradation
# Why 3: Because test data didn't represent recent user distribution
# Why 4: Because test data refresh procedure wasn't automated
# Why 5: Because there was no accountability for keeping test data fresh
causes = [
immediate_cause,
"New model version deployed without proper testing",
"Testing procedure failed to detect performance regression",
"Test data became stale and unrepresentative",
"No process to refresh test data automatically",
"No clear ownership of test data quality"
]
for i in range(1, min(6, len(causes))):
why_analysis[f'level_{i}'] = {
'question': f'Why did that happen?',
'answer': causes[i],
'evidence': self.gather_evidence_for_cause(causes[i])
}
# Root cause is typically at level 4 or 5
return {
'why_analysis': why_analysis,
'root_cause': causes[-1],
'contributing_factors': causes[1:-1]
}
def gather_evidence_for_cause(self, cause: str) -> list:
"""Gather supporting evidence for identified cause"""
# Would search logs, interviews, documentation
return []
def assess_systemic_vulnerabilities(self, root_cause: str) -> dict:
"""Assess broader systemic issues revealed by this incident"""
vulnerabilities = {
'process_failures': [],
'technical_gaps': [],
'organizational_issues': [],
'policy_gaps': []
}
# Example: If root cause is "no automated test data refresh"
# Systemic vulnerability: "Ad hoc data operations without automation"
# Similar risks in: data labeling, model training, monitoring setup, etc.
return vulnerabilities
Impact Assessment:
Impact Assessment Checklist:
Scope of Impact:
System Impact:
- "Which AI systems/models were affected?"
- "How many decisions/predictions were incorrect?"
- "What percentage of operations were impacted?"
- "Were other systems impacted indirectly?"
User/Customer Impact:
- "How many users were affected?"
- "What was the nature of the impact (incorrect decisions)?"
- "What were the consequences for users?"
- "Can impacted users be identified?"
Organizational Impact:
- "Financial impact (lost transactions, refunds, fines)?"
- "Reputational impact?"
- "Regulatory consequences?"
- "Customer relationships affected?"
Individual/Subject Impact:
- "Were protected individuals discriminated against?"
- "Was personal data compromised?"
- "Could individuals suffer economic/legal harm?"
- "Are notifications required?"
Duration of Impact:
- "When did problem begin?"
- "When was it detected?"
- "When was it fixed?"
- "Timeline: ____ hours/days of incorrect operation"
Scope Quantification:
- "Total affected decisions: ____"
- "Affected users: ____"
- "Affected individuals: ____"
- "Financial impact: $ ____"
Phase 3: Remediation (1-2 weeks)
Objectives:
- Fix technical vulnerabilities
- Remediate affected individuals
- Update policies and processes
- Improve detection and prevention
Remediation Strategy:
Remediation Components:
Technical Fixes:
Code/Model Remediation:
- "Patch identified vulnerability in code"
- "Retrain model with corrected data/process"
- "Remove poisoned data from training set"
- "Update model with debiasing techniques"
- "Implement additional safety checks"
Data Remediation:
- "Clean corrupted or poisoned data"
- "Retrain models with clean data"
- "Validate data quality and integrity"
- "Implement data quality monitoring"
Process Remediation:
- "Implement automated testing that would have caught issue"
- "Add data freshness checks"
- "Improve CI/CD pipeline validation"
- "Add pre-deployment verification steps"
Monitoring/Detection:
- "Implement automated detection for similar issues"
- "Tighten alerting thresholds"
- "Add metrics to catch future degradation"
- "Increase monitoring frequency/coverage"
Remediation of Affected Individuals:
Assessment:
- "Identify all individuals harmed by incident"
- "Determine nature and scope of harm"
- "Calculate appropriate remedy"
Notification:
- "Notify affected individuals of incident and impact"
- "Explain what happened and why"
- "Describe steps taken to prevent recurrence"
- "Offer remedy (reversal of decision, compensation, etc.)"
Remediation Actions:
- "Reverse incorrect decisions (e.g., denied loan now approved)"
- "Offer delayed service (missed opportunity)"
- "Provide compensation for damages"
- "Remove from adverse records (e.g., credit bureau)"
Documentation:
- "Maintain records of notifications sent"
- "Document remediation actions taken"
- "Keep evidence of compliance with obligations"
Policy and Procedural Updates:
Root Cause Prevention:
- "Update policies to prevent root cause"
- "Clarify responsibilities and accountability"
- "Add approval requirements for high-risk changes"
- "Establish governance for model updates"
Detection and Response:
- "Add incident type to incident response playbook"
- "Update escalation procedures"
- "Improve monitoring and alerting"
- "Add detection rules to security tools"
Communication and Transparency:
- "Update disclosure statements to users"
- "Document incident in AI system documentation"
- "Prepare regulatory filings if required"
- "Update training materials"
Verification of Remediation:
- "Test that fix resolves issue"
- "Verify monitoring/detection is working"
- "Confirm policy updates are in effect"
- "Assess effectiveness of remediation"
Phase 4: Return to Normal Operations (1-4 weeks)
Objectives:
- Restore full normal operations
- Validate system stability
- Reduce emergency response level
- Transition to standard monitoring
Return to Normal Checklist:
Return to Normal Checklist:
Technical Readiness:
☐ "Technical fix deployed and tested"
☐ "Monitoring confirms normal behavior"
☐ "Performance metrics back to baseline"
☐ "No new anomalies detected"
☐ "System stability confirmed over 1 week"
Data Readiness:
☐ "Data quality validated"
☐ "Training/test data refreshed"
☐ "Data integrity confirmed"
☐ "No data quality issues detected"
Operational Readiness:
☐ "Human oversight levels normalized"
☐ "Standard monitoring reestablished"
☐ "Incident response status cleared"
☐ "Staff trained on changes"
Regulatory/Compliance Readiness:
☐ "Regulatory notifications sent if required"
☐ "Remediation of affected individuals complete"
☐ "Incident documentation finalized"
☐ "Audit trail maintained for review"
Communication Complete:
☐ "Affected users notified of resolution"
☐ "Public communication (if needed)"
☐ "Staff communication"
☐ "Executive briefing"
Approval for Return to Normal:
- "Incident Commander: _____"
- "Technical Lead: _____"
- "Compliance Officer: _____"
- "Date: _____"
Data Cleanup and Management
Identifying Data to Clean or Delete
Post-incident data management involves:
Data Cleanup Categories:
Poisoned Training Data:
Identification:
- "Data that was maliciously modified or injected"
- "Data that caused incorrect model behavior"
- "Data outside normal distribution"
Action:
- "Remove from training dataset"
- "Retrain model without poisoned data"
- "Validate retraining effectiveness"
Stale or Unrepresentative Data:
Identification:
- "Test data that no longer represents user population"
- "Training data from outdated market conditions"
- "Data from process changes that went undocumented"
Action:
- "Refresh with current, representative data"
- "Retrain model with updated data"
- "Validate accuracy with new data"
Sensitive Data from Logs:
Identification:
- "Personal information captured in decision logs"
- "Sensitive inputs retained for debugging"
- "Credentials or secrets logged"
Action:
- "Encrypt sensitive data in logs"
- "Delete unnecessary sensitive log retention"
- "Implement privacy-preserving logging"
Data Used for Forensics:
Retention:
- "Keep forensic evidence long enough for investigation"
- "Retain for compliance requirements"
- "Archive for future audits"
Cleanup Timeline:
- "After investigation complete: archive to cold storage"
- "Per retention policy: delete when retention expires"
- "Document what was deleted and why"
Data Deletion Procedures
Safe deletion is critical to avoid loss of legitimate data:
# Safe Data Deletion Procedure
class SafeDataDeletion:
def delete_compromised_data(self, data_set_name: str, date_range: tuple):
"""Safely delete compromised data with verification"""
deletion_record = {
'dataset': data_set_name,
'date_range': date_range,
'timestamp': datetime.now(),
'steps': []
}
# Step 1: Create backup
backup_id = self.create_backup(data_set_name)
deletion_record['steps'].append({
'step': 'Backup Created',
'backup_id': backup_id
})
# Step 2: Identify data to delete
data_to_delete = self.identify_records(data_set_name, date_range)
deletion_record['steps'].append({
'step': 'Records Identified',
'count': len(data_to_delete),
'record_ids': data_to_delete[:100] # Log first 100
})
# Step 3: Verify deletion scope (safety check)
if len(data_to_delete) > self.DELETION_SIZE_THRESHOLD:
# Require additional approval for large deletions
approval = self.request_approval(
f"Delete {len(data_to_delete)} records from {data_set_name}?"
)
if not approval:
return {'status': 'CANCELLED', 'reason': 'Approval Denied'}
# Step 4: Delete records
deleted_count = self.delete_records(data_set_name, data_to_delete)
deletion_record['steps'].append({
'step': 'Records Deleted',
'count': deleted_count
})
# Step 5: Verify deletion
remaining = self.verify_deletion(data_set_name, data_to_delete)
if remaining > 0:
raise Exception(f"Deletion verification failed: {remaining} records remain")
deletion_record['steps'].append({
'step': 'Deletion Verified',
'remaining_count': remaining
})
# Step 6: Update related systems
self.update_models(data_set_name, data_to_delete)
deletion_record['steps'].append({
'step': 'Models Updated'
})
# Step 7: Log deletion
self.log_deletion(deletion_record)
return {
'status': 'SUCCESS',
'deleted_count': deleted_count,
'deletion_id': deletion_record['timestamp']
}
def create_backup(self, dataset_name):
"""Create backup before deletion"""
return f"backup_{dataset_name}_{datetime.now().isoformat()}"
def identify_records(self, dataset_name, date_range):
"""Identify records in date range"""
# Query records within date range
return []
def delete_records(self, dataset_name, record_ids):
"""Delete specified records"""
return len(record_ids)
def verify_deletion(self, dataset_name, deleted_ids):
"""Verify records were deleted"""
# Query to confirm deletion
return 0
def update_models(self, dataset_name, deleted_ids):
"""Update models that used deleted data"""
# Retrain or update models
pass
def log_deletion(self, deletion_record):
"""Log deletion for audit"""
# Write to deletion log
pass
DELETION_SIZE_THRESHOLD = 10000 # Require approval for large deletions
Model Redeployment
Testing Before Redeployment
Remediated models must pass extensive testing:
Pre-Deployment Testing for Remediated Models:
Functionality Testing:
- "Basic predictions work correctly"
- "System doesn't crash with edge cases"
- "Latency and throughput meet requirements"
- "Inference succeeds for all input types"
Accuracy and Performance:
- "Accuracy meets minimum specification"
- "Performance is comparable to previous version"
- "Performance is stable across time"
- "No unexpected performance regressions"
Fairness and Bias:
- "No disparate impact detected"
- "Performance parity across demographic groups"
- "Fairness metrics meet regulatory requirements"
- "No known proxies for discrimination"
Robustness:
- "Handles adversarial examples gracefully"
- "Graceful degradation under load"
- "Resilience to data distribution shifts"
- "Appropriate confidence/uncertainty quantification"
Security:
- "Model integrity verified (checksum/signing)"
- "No known vulnerabilities in dependencies"
- "Resistant to known attack vectors"
- "Secure against prompt injection (if applicable)"
Compliance:
- "Meets all regulatory requirements"
- "Documentation complete and accurate"
- "Training for operators completed"
- "Monitoring configured correctly"
Production Validation:
- "Canary deployment to subset of users"
- "Monitor canary performance for 24 hours"
- "User feedback on canary deployment"
- "Gradual rollout upon success"
Deployment Strategy
Safe Deployment of Remediated Model:
Phase 1: Pre-Deployment (1 week before)
- "Model training complete and tested"
- "Release notes and documentation ready"
- "Deployment plan reviewed and approved"
- "Rollback plan prepared"
- "Monitoring dashboards configured"
Phase 2: Canary Deployment (First 24 hours)
- "Deploy to 1-5% of traffic"
- "Monitor key metrics closely"
- "Compare canary vs control performance"
- "Measure user feedback/complaints"
- "Assess any anomalies"
Phase 3: Gradual Rollout (Days 2-7)
- "Increase traffic to remediated model gradually"
- "Monitor metrics at each step"
- "Be ready to halt and rollback if issues"
- "Maintain control group for comparison"
Phase 4: Full Deployment (Week 2)
- "100% traffic to remediated model"
- "Decommission old model"
- "Continue monitoring"
- "Archive previous version for rollback"
Rollback Triggers:
- "Accuracy drops below minimum spec"
- "Fairness metrics degrade"
- "User complaints spike"
- "System errors or crashes"
- "Performance degradation"
Post-Mortem and Lessons Learned
Post-Mortem Framework
Post-Incident Review (Blameless Post-Mortem):
Review Timing:
- "Scheduled 1 week after incident resolution"
- "Before momentum of incident response fades"
- "After stabilization ensures clear perspective"
Participants:
- "Incident Commander"
- "Technical Lead"
- "Key team members involved in response"
- "Management (non-judgmental observer)"
- "Compliance/Legal (if relevant)"
Review Agenda:
1. Timeline Reconstruction:
- "When did incident begin?"
- "When was it detected?"
- "What was detection method?"
- "When was it contained?"
- "When was it resolved?"
- "Duration of impact: ___ hours"
2. Root Cause Analysis:
- "What was the root cause? (Not just symptoms)"
- "What were contributing factors?"
- "How could it have been prevented?"
- "Why weren't existing controls sufficient?"
3. Detection and Response:
- "How was incident detected?"
- "Did detection match planned procedure?"
- "What could have improved detection?"
- "Response execution: planned vs actual"
- "What worked well in response?"
- "What was challenging?"
4. Impact Analysis:
- "How many people/decisions affected?"
- "What was the harm?"
- "Regulatory implications?"
- "Reputational impact?"
5. Improvement Opportunities:
- "What prevention measures would help?"
- "How can detection be improved?"
- "Process changes needed?"
- "Training or staffing gaps?"
- "Technology improvements?"
6. Action Items:
- "Specific, measurable improvements"
- "Clear ownership and timeline"
- "Follow-up tracking"
7. Communication:
- "Summarize findings for leadership"
- "Lessons for other teams?"
- "Update incident response playbooks"
- "Training or awareness needed?"
Creating Actionable Improvements
Post-mortem findings must lead to concrete improvements:
# Tracking Improvements from Incidents
class IncidentImprovementTracker:
def create_improvement_initiatives(self, postmortem):
"""Convert post-mortem findings into tracked improvements"""
improvements = []
for finding in postmortem['improvement_opportunities']:
improvement = {
'id': f"IMP-{datetime.now().strftime('%Y%m%d%H%M%S')}",
'incident_id': postmortem['incident_id'],
'finding': finding,
'improvement_type': self.categorize_improvement(finding),
'owner': self.assign_owner(finding),
'target_date': self.calculate_target_date(finding),
'status': 'PLANNED',
'acceptance_criteria': self.define_criteria(finding),
'tracking': {
'created': datetime.now(),
'status_updates': []
}
}
improvements.append(improvement)
return improvements
def categorize_improvement(self, finding):
"""Categorize improvement type"""
categories = {
'Detection': ['monitoring', 'alerting', 'logging'],
'Prevention': ['test', 'policy', 'process'],
'Response': ['playbook', 'procedure', 'automation'],
'Training': ['awareness', 'skill', 'procedure']
}
# Would implement categorization logic
return 'Prevention'
def assign_owner(self, finding):
"""Determine who will own improvement"""
# Would implement assignment logic
return 'Team Lead'
def calculate_target_date(self, finding):
"""Calculate reasonable target date for improvement"""
# Prevention: 2-4 weeks
# Detection: 1-2 weeks
# Response: 3-4 weeks
return datetime.now() + timedelta(weeks=2)
def define_criteria(self, finding):
"""Define acceptance criteria"""
return [
"Improvement implemented",
"Testing completed",
"Documentation updated",
"Training completed if needed",
"Monitoring in place"
]
def track_improvements(self):
"""Track completion of improvement initiatives"""
# Weekly review of in-progress improvements
# Monthly reporting to leadership
# Archive completed improvements
pass
Key Takeaway
Key Takeaway: Post-incident recovery involves stabilization, investigation, remediation, and return to normal operations. Organizations must remediate affected individuals, update policies and processes to prevent recurrence, and learn from incidents through blameless post-mortems. Continuous improvement based on incident learnings is essential for building increasingly resilient AI systems.
Exercise: Create Incident Recovery Plan
- Recovery phases: Document stages for your organization
- Remediation procedures: How will you fix technical issues?
- Affected individual remediation: How will you compensate those harmed?
- Data cleanup: Procedures for safe deletion of compromised data
- Post-mortem template: Create standard post-mortem structure
- Improvement tracking: System for tracking and completing improvements
Incident Response Module Complete