Continuous AI Security Monitoring

Overview

Continuous monitoring detects security issues in production AI systems before they cause harm. Unlike traditional security monitoring focused on access logs and system metrics, AI monitoring must track model behavior, decision patterns, and potential anomalies indicating attacks or degradation.

Production Monitoring Architecture

Monitoring Stack Components

AI Security Monitoring Stack:

Data Collection Layer:
  Application Instrumentation:
    - "Input logging (prompts, feature vectors)"
    - "Output logging (predictions, decisions)"
    - "Confidence and uncertainty scores"
    - "Latency and resource usage"
    - "Errors and exceptions"

  System Instrumentation:
    - "Model loading and initialization"
    - "Code and model deployments"
    - "System resource usage (CPU, memory, GPU)"
    - "Network connections and API calls"
    - "Access logs and authentication events"

  Infrastructure Monitoring:
    - "Container and orchestration events"
    - "Database query logs"
    - "Cache hit rates"
    - "Storage capacity and growth"

Data Aggregation:
  Log Aggregation:
    - "Collect logs from all sources"
    - "Parse and structure logs"
    - "Filter sensitive information"
    - "Forward to central repository"
    Tools: "ELK Stack, Splunk, Datadog, CloudWatch"

  Time-Series Metrics:
    - "Extract numeric metrics from logs"
    - "Aggregate by time window"
    - "Normalize and standardize metrics"
    - "Enable time-series queries"
    Tools: "Prometheus, InfluxDB, Grafana"

  Event Stream:
    - "Real-time event processing"
    - "Pattern detection in streams"
    - "Windowed aggregations"
    - "Alert triggering"
    Tools: "Kafka, Apache Flink, AWS Kinesis"

Analysis Layer:
  Anomaly Detection:
    - "Statistical anomalies (mean/std deviation)"
    - "Isolation forests for multivariate anomalies"
    - "Autoencoders for pattern anomalies"
    - "Time-series forecasting with ARIMA/Prophet"

  Security Analysis:
    - "Pattern matching against known attacks"
    - "Behavioral analysis of access patterns"
    - "Graph analysis of data flows"
    - "Policy compliance checking"

  Fairness and Bias Monitoring:
    - "Approval/decision rate by demographic"
    - "False positive/negative rate disparity"
    - "Confidence calibration by group"
    - "Drift in decision patterns"

  Performance Monitoring:
    - "Accuracy drift detection"
    - "Latency increase detection"
    - "Error rate tracking"
    - "Data quality metrics"

Alerting and Escalation:
  Alert Generation:
    - "Threshold-based alerts"
    - "Anomaly-based alerts"
    - "Composite alerts (multiple conditions)"
    - "Alert deduplication and grouping"

  Escalation:
    - "Severity levels (info, warning, critical)"
    - "Routing by alert type"
    - "On-call team notification"
    - "Ticket creation"

Visualization and Reporting:
  Dashboards:
    - "Real-time system health"
    - "Key metric trends"
    - "Alert history"
    - "Incident timeline"

  Reports:
    - "Daily summary reports"
    - "Weekly trend analysis"
    - "Monthly security review"
    - "Regulatory compliance reporting"

Monitoring Dashboard Design

Example: Credit Scoring System Monitoring Dashboard:

Real-Time Metrics Section:
  System Health:
    - "Model Serving: 99.98% uptime"
    - "P99 Latency: 45ms"
    - "Requests/second: 12,500"
    - "Error rate: 0.02%"

  Performance:
    - "Accuracy (24h): 87.2% (baseline: 87.5%)"
    - "False positive rate: 12.1% (baseline: 11.8%)"
    - "False negative rate: 8.3% (baseline: 8.2%)"
    - "Approved: 42.3% (baseline: 41.5%)"

  Fairness:
    - "Approval rate (White): 45.2%"
    - "Approval rate (Black): 38.1%"
    - "Disparate impact ratio: 0.84 (threshold: 0.80)"
    - "Alert: Approaching regulatory threshold"

Historical Trends Section:
  Performance Over Time:
    - "Accuracy trend (30 days)"
    - "Latency trend (24 hours)"
    - "Error rate trend (7 days)"

  Fairness Monitoring:
    - "Disparate impact ratio trend (30 days)"
    - "Approval rate by group (trend)"
    - "Decision volume trend (24 hours)"

Alert Status Section:
  Current Alerts:
    - "WARNING: Disparate impact ratio trending down"
    - "INFO: Model retrained yesterday"
    - "RESOLVED: High latency spike (15 min duration)"

  Alert History:
    - "Last 24 hours alert timeline"
    - "Alert resolution times"
    - "Incident correlation"

Human Review Metrics:
  - "Decisions reviewed by humans: 2,341 (18.7%)"
  - "Human override rate: 4.2%"
  - "Average review time: 2.3 minutes"
  - "Inter-rater agreement: 94.3%"

Alerting Strategy

Alert Design

Effective alerts balance detection sensitivity with false positive rates:

Alert Tuning Framework:

Alert Type 1: Accuracy Degradation
  Detection:
    - "Accuracy drops > 3% from baseline in 1 hour"
    - "Confirmed by second metric (F1 score drops)"
  Severity: "HIGH"
  Actions:
    - "Page on-call ML engineer"
    - "Create incident ticket"
    - "Begin investigation"
    - "Prepare incident response team"
  False Positive Rate Target: "< 1%"
  Calibration:
    - "Baseline calculated over 30 days"
    - "Anomaly threshold: mean - 2*std"
    - "Require persistence (> 5 minutes) to alert"

Alert Type 2: Disparate Impact Detection
  Detection:
    - "Disparate impact ratio drops below 0.80"
    - "Significant shift in approval rates by demographic"
  Severity: "CRITICAL"
  Actions:
    - "Immediately page compliance officer"
    - "Create high-priority incident"
    - "Prepare regulatory notification template"
    - "Initiate investigation"
  False Positive Rate Target: "< 0.5%"
  Calibration:
    - "Statistical significance testing (chi-square)"
    - "Minimum sample size (100 decisions per group)"
    - "Confirm with manual analysis"

Alert Type 3: Input Anomalies
  Detection:
    - "Sudden spike in feature values"
    - "Bimodal distribution in single feature"
    - "New categorical values unseen in training"
  Severity: "MEDIUM"
  Actions:
    - "Log anomaly for investigation"
    - "Increase human review for affected inputs"
    - "Alert data quality team"
  False Positive Rate Target: "< 5%"
  Calibration:
    - "Isolation Forest anomaly score > 0.9"
    - "Affected > 1% of requests in 5-minute window"

Alert Type 4: Latency Spike
  Detection:
    - "P99 latency > 500ms (baseline: 50ms)"
    - "Sustained for > 2 minutes"
  Severity: "MEDIUM"
  Actions:
    - "Alert on-call operations"
    - "Check infrastructure metrics"
    - "Assess user impact"
  False Positive Rate Target: "< 3%"
  Calibration:
    - "3-sigma threshold from rolling 24-hour baseline"
    - "Require sustained elevation"

Alert Type 5: Model Serving Errors
  Detection:
    - "Error rate > 1%"
    - "Specific error type spike"
  Severity: "HIGH"
  Actions:
    - "Page on-call engineer"
    - "Begin error investigation"
    - "Prepare incident response"
  False Positive Rate Target: "< 0.5%"
  Calibration:
    - "Baseline: 0.02% error rate"
    - "50x increase = alert"

Drift Detection

Concept Drift Monitoring

Model performance degrades when input data distribution changes (concept drift):

# Drift Detection Implementation

class DriftDetector:
    def __init__(self, baseline_window_days=30):
        self.baseline_window = baseline_window_days
        self.baseline_stats = {}

    def establish_baseline(self, training_data):
        """Establish baseline statistics for drift comparison"""
        self.baseline_stats = {
            'feature_distributions': self.compute_distributions(training_data),
            'target_distribution': self.compute_distribution(training_data['target']),
            'correlations': self.compute_correlations(training_data),
            'outlier_rate': self.estimate_outlier_rate(training_data)
        }

    def detect_data_drift(self, current_data):
        """Detect whether input data has shifted"""
        drift_detected = {}

        # Statistical test: Kolmogorov-Smirnov test
        for feature in current_data.columns:
            ks_stat, p_value = stats.ks_2samp(
                self.baseline_stats['feature_distributions'][feature],
                current_data[feature].values
            )
            drift_detected[feature] = {
                'ks_statistic': ks_stat,
                'p_value': p_value,
                'drifted': p_value < 0.05  # Statistical significance
            }

        return drift_detected

    def detect_label_drift(self, current_labels):
        """Detect whether output distribution has shifted"""
        ks_stat, p_value = stats.ks_2samp(
            self.baseline_stats['target_distribution'],
            current_labels.values
        )

        return {
            'ks_statistic': ks_stat,
            'p_value': p_value,
            'drifted': p_value < 0.05
        }

    def detect_concept_drift(self, current_data, current_predictions):
        """Detect whether relationship between features and target has changed"""
        current_correlations = self.compute_correlations(
            pd.concat([current_data, current_predictions], axis=1)
        )

        correlation_drift = {}
        for feature in current_data.columns:
            old_corr = self.baseline_stats['correlations'].get(feature, 0)
            new_corr = current_correlations.get(feature, 0)

            correlation_change = abs(new_corr - old_corr)
            correlation_drift[feature] = {
                'old_correlation': old_corr,
                'new_correlation': new_corr,
                'change': correlation_change,
                'significant': correlation_change > 0.1
            }

        return correlation_drift

    def compute_distributions(self, data):
        """Compute feature distributions"""
        return {col: data[col].values for col in data.columns}

    def compute_correlations(self, data):
        """Compute feature-target correlations"""
        return data.corr().to_dict()

    def estimate_outlier_rate(self, data):
        """Estimate percentage of outliers"""
        iso_forest = IsolationForest(contamination=0.05)
        outlier_mask = iso_forest.fit_predict(data) == -1
        return outlier_mask.sum() / len(data)

Abuse and Misuse Detection

Detecting Misuse Patterns

Monitoring can detect when systems are used for unintended purposes:

Misuse Detection Framework:

Monitoring Pattern: Prompt Injection Attempts
  Indicators:
    - "Prompts containing keywords: 'ignore', 'system prompt', 'admin mode'"
    - "Sudden shift in prompt length distribution"
    - "Repeated similar prompts (same user)"
  Detection Method:
    - "Regex matching for known patterns"
    - "Machine learning classifier trained on injection examples"
    - "Rate limiting: alert on > N attempts/minute"
  Response:
    - "Log suspicious prompts"
    - "Alert security team"
    - "Increase monitoring for user"
    - "Potentially block user if escalation"

Monitoring Pattern: Model Extraction Attempts
  Indicators:
    - "Single user making thousands of queries"
    - "Queries with minimal feature variation"
    - "Pattern: query -> receive output -> refine query"
    - "Queries attempting to find decision boundaries"
  Detection Method:
    - "Rate limiting monitoring (queries per user)"
    - "Query pattern analysis"
    - "Detection of systematic coverage attempts"
  Response:
    - "Rate limit user"
    - "Require API key verification"
    - "Alert security team"
    - "Block if persistent"

Monitoring Pattern: Adversarial Input Generation
  Indicators:
    - "Inputs containing small perturbations of legitimate inputs"
    - "Inputs optimized to change predictions"
    - "Repeated similar inputs with small variations"
  Detection Method:
    - "Compare inputs to previous patterns"
    - "Detect systematic perturbation"
    - "ML model trained to detect adversarial patterns"
  Response:
    - "Log adversarial inputs"
    - "Alert security team"
    - "May require CAPTCHA or additional verification"
    - "Increase human review"

Monitoring Pattern: Jailbreak Attempts (LLMs)
  Indicators:
    - "Attempts to generate harmful content"
    - "Roleplay scenarios designed to circumvent safety"
    - "Attempts to extract system prompt"
  Detection Method:
    - "Keyword filtering for banned topics"
    - "Model-based toxicity detection"
    - "Pattern matching for known jailbreaks"
  Response:
    - "Refusal to process request"
    - "Safe response provided"
    - "Alert if escalation"
    - "User warning/blocking if repeated"

Key Takeaway

Key Takeaway: Continuous production monitoring enables early detection of performance degradation, fairness violations, adversarial attacks, and system failures. A comprehensive monitoring architecture with alerting, drift detection, and abuse detection protects both the organization and users from AI system failures.

Exercise: Design Your Monitoring Architecture

Monitoring requirements: What metrics must you track?
Alert design: What are critical alerts for your systems?
Drift detection: How will you detect and respond to data drift?
Dashboard design: What should your dashboard show?
Architecture: What tools will you use for collection, aggregation, analysis?
Testing: How will you test alerting before production?

Next: AI Security Metrics and Reporting