PII Detection and Protection

Identifying and Safeguarding Sensitive Data

Personally Identifiable Information (PII) is any data that can identify an individual. As an AI security professional, you must be able to identify PII, understand its sensitivity levels, and implement protection mechanisms.

PII Classification

Different types of PII have different sensitivity levels and regulatory implications:

Tier 1: Highly Sensitive (Regulatory Risk)

These require maximum protection:

Social Security Numbers (SSN)
Bank account numbers
Credit card numbers
Driver’s license numbers
Passport numbers
Biometric data
Genetic information

Tier 2: Sensitive (Privacy Risk)

These require strong protection:

Full names
Home addresses
Phone numbers
Email addresses
Birth dates
Health information
Financial records

Tier 3: Semi-Sensitive (Context-Dependent)

Sensitivity depends on context:

Employer names
Job titles
Education history
Social media profiles
Zip codes
Demographic information

PII Detection Methods

Method 1: Regex-Based Detection

Fast but imprecise:

class RegexBasedPIIDetector:
    def __init__(self):
        self.patterns = {
            'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
            'credit_card': r'\b\d{4}[\s\-]?\d{4}[\s\-]?\d{4}[\s\-]?\d{4}\b',
            'phone': r'\b(?:\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})\b',
            'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
            'zip_code': r'\b\d{5}(?:-\d{4})?\b',
            'ipv4': r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b',
            'date_dob': r'\b(?:0?[1-9]|1[0-2])[/-](?:0?[1-9]|[12][0-9]|3[01])[/-](?:19|20)?\d{2}\b',
        }

    def detect(self, text):
        """Find PII in text using regex patterns."""
        findings = {}

        for pii_type, pattern in self.patterns.items():
            matches = re.finditer(pattern, text)
            findings[pii_type] = [match.group() for match in matches]

        return findings

# Usage
detector = RegexBasedPIIDetector()
findings = detector.detect("My SSN is 123-45-6789 and my email is john@example.com")
# {'ssn': ['123-45-6789'], 'email': ['john@example.com'], ...}

Limitations: High false positives, misses context-dependent PII

Method 2: Named Entity Recognition (NER)

Uses NLP to identify entities:

from transformers import pipeline

class NERBasedPIIDetector:
    def __init__(self):
        # Use a model trained to recognize PII
        self.ner_model = pipeline(
            "ner",
            model="dslim/bert-base-NER"
        )
        self.pii_entity_types = [
            'PER',  # Person
            'LOC',  # Location
            'ORG',  # Organization
            'O',    # Other (can be refined)
        ]

    def detect(self, text):
        """Find PII using NER model."""
        entities = self.ner_model(text)

        pii_findings = {
            'persons': [],
            'locations': [],
            'organizations': [],
            'other': []
        }

        for entity in entities:
            if entity['entity'] == 'B-PER':
                pii_findings['persons'].append(entity['word'])
            elif entity['entity'] == 'B-LOC':
                pii_findings['locations'].append(entity['word'])
            elif entity['entity'] == 'B-ORG':
                pii_findings['organizations'].append(entity['word'])

        return pii_findings

# Usage
detector = NERBasedPIIDetector()
findings = detector.detect("John Smith from Acme Corp sent me a message")
# {'persons': ['John', 'Smith'], 'organizations': ['Acme'], ...}

Advantages: Context-aware, handles variations Limitations: Slower, requires GPU, occasional errors

Method 3: Machine Learning-Based Detection

Train a model specifically for PII detection:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

class MLBasedPIIDetector:
    def __init__(self):
        self.model = self.build_model()

    def build_model(self):
        """Build PII detection model."""
        # Features: character patterns, length, position in text, entropy, etc.
        pipeline = Pipeline([
            ('features', TextFeatureExtractor()),
            ('classifier', RandomForestClassifier(n_estimators=100))
        ])
        return pipeline

    def extract_features(self, text):
        """Extract features for each token."""
        tokens = text.split()
        features = []

        for token in tokens:
            feature_vector = {
                'length': len(token),
                'has_digits': any(c.isdigit() for c in token),
                'has_special': any(c in '-. @' for c in token),
                'is_capitalized': token[0].isupper(),
                'digit_ratio': sum(c.isdigit() for c in token) / len(token),
                'entropy': self.calculate_entropy(token),
            }
            features.append(feature_vector)

        return features

    def calculate_entropy(self, text):
        """Calculate Shannon entropy (randomness) of text."""
        if not text:
            return 0
        entropy = 0
        for char in set(text):
            p_char = text.count(char) / len(text)
            entropy -= p_char * math.log2(p_char)
        return entropy / math.log2(len(set(text)))

Advantages: Can be highly accurate, adaptive Limitations: Requires labeled training data

Method 4: Composite Approach

Combine multiple methods for best results:

class HybridPIIDetector:
    def __init__(self):
        self.regex_detector = RegexBasedPIIDetector()
        self.ner_detector = NERBasedPIIDetector()
        self.ml_detector = MLBasedPIIDetector()

    def detect(self, text):
        """Use multiple detection methods and combine results."""

        # Run all detectors
        regex_findings = self.regex_detector.detect(text)
        ner_findings = self.ner_detector.detect(text)
        ml_findings = self.ml_detector.detect(text)

        # Combine results with confidence scoring
        combined = self.combine_findings(
            regex_findings, ner_findings, ml_findings
        )

        return combined

    def combine_findings(self, regex, ner, ml):
        """Intelligently combine findings from multiple detectors."""
        combined = {}

        # If multiple detectors agree, high confidence
        # If only one detector finds it, medium confidence
        # Regex-only findings = lower confidence (high false positive rate)

        return {
            'high_confidence': self.extract_high_confidence(regex, ner, ml),
            'medium_confidence': self.extract_medium_confidence(regex, ner, ml),
            'low_confidence': self.extract_low_confidence(regex, ner, ml),
        }

PII Redaction Strategies

Strategy 1: Masking

Replace PII with placeholders:

def mask_pii(text, pii_findings):
    """Replace PII with masked versions."""

    masked_text = text

    # Mask SSNs (show last 4 digits)
    for ssn in pii_findings.get('ssn', []):
        masked = f"SSN-****{ssn[-4:]}"
        masked_text = masked_text.replace(ssn, masked)

    # Mask credit cards (show last 4 digits)
    for cc in pii_findings.get('credit_card', []):
        digits_only = ''.join(filter(str.isdigit, cc))
        masked = f"CC-****{digits_only[-4:]}"
        masked_text = masked_text.replace(cc, masked)

    # Mask emails (show domain only)
    for email in pii_findings.get('email', []):
        domain = email.split('@')[1]
        masked = f"***@{domain}"
        masked_text = masked_text.replace(email, masked)

    return masked_text

Strategy 2: Tokenization

Replace PII with tokens:

class PIITokenizer:
    def __init__(self):
        self.token_map = {}
        self.reverse_map = {}
        self.token_counter = 0

    def tokenize_pii(self, text, pii_findings):
        """Replace PII with tokens."""

        tokenized = text

        for pii_type, values in pii_findings.items():
            for value in values:
                token = self.get_or_create_token(pii_type, value)
                tokenized = tokenized.replace(value, token)

        return tokenized

    def get_or_create_token(self, pii_type, value):
        """Get existing token or create new one."""

        key = (pii_type, value)

        if key not in self.token_map:
            token = f"[{pii_type.upper()}_{self.token_counter}]"
            self.token_map[key] = token
            self.reverse_map[token] = value
            self.token_counter += 1

        return self.token_map[key]

    def detokenize(self, text):
        """Replace tokens back with original values."""

        detokenized = text
        for token, original_value in self.reverse_map.items():
            detokenized = detokenized.replace(token, original_value)

        return detokenized

Strategy 3: Synthetic Data Generation

Replace PII with realistic-looking fake data:

from faker import Faker

def replace_with_synthetic_data(text, pii_findings):
    """Replace PII with realistic synthetic data."""

    fake = Faker()
    synthetic_text = text

    # Replace names with fake names
    for name in pii_findings.get('person', []):
        synthetic_name = fake.name()
        synthetic_text = synthetic_text.replace(name, synthetic_name)

    # Replace emails with fake emails
    for email in pii_findings.get('email', []):
        synthetic_email = fake.email()
        synthetic_text = synthetic_text.replace(email, synthetic_email)

    # Replace phone numbers with fake numbers
    for phone in pii_findings.get('phone', []):
        synthetic_phone = fake.phone_number()
        synthetic_text = synthetic_text.replace(phone, synthetic_phone)

    return synthetic_text

Implementation: Complete PII Detection & Protection

class PIIProtectionPipeline:
    def __init__(self):
        self.detector = HybridPIIDetector()
        self.tokenizer = PIITokenizer()
        self.redaction_strategy = 'mask'  # or 'tokenize' or 'synthetic'

    def process_text(self, text, action='redact'):
        """Process text to detect and protect PII."""

        # 1. Detect PII
        findings = self.detector.detect(text)

        # 2. Classify by confidence
        high_confidence = findings['high_confidence']
        medium_confidence = findings['medium_confidence']

        # 3. Redact based on confidence
        processed_text = text

        if action == 'redact':
            # Always redact high-confidence findings
            processed_text = mask_pii(processed_text, high_confidence)

            # Optionally redact medium-confidence findings
            # (depends on your policy)

        elif action == 'tokenize':
            processed_text = self.tokenizer.tokenize_pii(processed_text, high_confidence)

        elif action == 'synthetic':
            processed_text = replace_with_synthetic_data(processed_text, high_confidence)

        return {
            'original_text': text,
            'processed_text': processed_text,
            'pii_found': high_confidence,
            'confidence': 'high'
        }

Compliance Implications

Different regulations require different PII handling:

GDPR: Requires deletion of personal data on request
CCPA: Requires disclosure of collected data
HIPAA: Requires encryption of health information
PCI DSS: Requires protection of credit card data
FERPA: Requires protection of student records

Key Takeaway

Key Takeaway: PII detection and protection requires multiple methods working together. Use regex for fast scanning, NER for context awareness, ML for accuracy, and implement redaction through masking, tokenization, or synthetic data generation. Always classify findings by confidence before acting on them.

Exercise: Build a PII Detection System

Implement all four detection methods
Create a dataset of text with various PII types
Benchmark each method’s precision and recall
Build the hybrid detector
Test each redaction strategy
Document false positives and false negatives

Next Lesson: Securing the AI Data Pipeline—protecting data at rest and in transit.