PII Detection and Protection
PII Detection and Protection
Identifying and Safeguarding Sensitive Data
Personally Identifiable Information (PII) is any data that can identify an individual. As an AI security professional, you must be able to identify PII, understand its sensitivity levels, and implement protection mechanisms.
PII Classification
Different types of PII have different sensitivity levels and regulatory implications:
Tier 1: Highly Sensitive (Regulatory Risk)
These require maximum protection:
- Social Security Numbers (SSN)
- Bank account numbers
- Credit card numbers
- Driver’s license numbers
- Passport numbers
- Biometric data
- Genetic information
Tier 2: Sensitive (Privacy Risk)
These require strong protection:
- Full names
- Home addresses
- Phone numbers
- Email addresses
- Birth dates
- Health information
- Financial records
Tier 3: Semi-Sensitive (Context-Dependent)
Sensitivity depends on context:
- Employer names
- Job titles
- Education history
- Social media profiles
- Zip codes
- Demographic information
PII Detection Methods
Method 1: Regex-Based Detection
Fast but imprecise:
class RegexBasedPIIDetector:
def __init__(self):
self.patterns = {
'ssn': r'\b\d{3}-\d{2}-\d{4}\b',
'credit_card': r'\b\d{4}[\s\-]?\d{4}[\s\-]?\d{4}[\s\-]?\d{4}\b',
'phone': r'\b(?:\+?1[-.\s]?)?\(?([0-9]{3})\)?[-.\s]?([0-9]{3})[-.\s]?([0-9]{4})\b',
'email': r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'zip_code': r'\b\d{5}(?:-\d{4})?\b',
'ipv4': r'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b',
'date_dob': r'\b(?:0?[1-9]|1[0-2])[/-](?:0?[1-9]|[12][0-9]|3[01])[/-](?:19|20)?\d{2}\b',
}
def detect(self, text):
"""Find PII in text using regex patterns."""
findings = {}
for pii_type, pattern in self.patterns.items():
matches = re.finditer(pattern, text)
findings[pii_type] = [match.group() for match in matches]
return findings
# Usage
detector = RegexBasedPIIDetector()
findings = detector.detect("My SSN is 123-45-6789 and my email is john@example.com")
# {'ssn': ['123-45-6789'], 'email': ['john@example.com'], ...}
Limitations: High false positives, misses context-dependent PII
Method 2: Named Entity Recognition (NER)
Uses NLP to identify entities:
from transformers import pipeline
class NERBasedPIIDetector:
def __init__(self):
# Use a model trained to recognize PII
self.ner_model = pipeline(
"ner",
model="dslim/bert-base-NER"
)
self.pii_entity_types = [
'PER', # Person
'LOC', # Location
'ORG', # Organization
'O', # Other (can be refined)
]
def detect(self, text):
"""Find PII using NER model."""
entities = self.ner_model(text)
pii_findings = {
'persons': [],
'locations': [],
'organizations': [],
'other': []
}
for entity in entities:
if entity['entity'] == 'B-PER':
pii_findings['persons'].append(entity['word'])
elif entity['entity'] == 'B-LOC':
pii_findings['locations'].append(entity['word'])
elif entity['entity'] == 'B-ORG':
pii_findings['organizations'].append(entity['word'])
return pii_findings
# Usage
detector = NERBasedPIIDetector()
findings = detector.detect("John Smith from Acme Corp sent me a message")
# {'persons': ['John', 'Smith'], 'organizations': ['Acme'], ...}
Advantages: Context-aware, handles variations Limitations: Slower, requires GPU, occasional errors
Method 3: Machine Learning-Based Detection
Train a model specifically for PII detection:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
class MLBasedPIIDetector:
def __init__(self):
self.model = self.build_model()
def build_model(self):
"""Build PII detection model."""
# Features: character patterns, length, position in text, entropy, etc.
pipeline = Pipeline([
('features', TextFeatureExtractor()),
('classifier', RandomForestClassifier(n_estimators=100))
])
return pipeline
def extract_features(self, text):
"""Extract features for each token."""
tokens = text.split()
features = []
for token in tokens:
feature_vector = {
'length': len(token),
'has_digits': any(c.isdigit() for c in token),
'has_special': any(c in '-. @' for c in token),
'is_capitalized': token[0].isupper(),
'digit_ratio': sum(c.isdigit() for c in token) / len(token),
'entropy': self.calculate_entropy(token),
}
features.append(feature_vector)
return features
def calculate_entropy(self, text):
"""Calculate Shannon entropy (randomness) of text."""
if not text:
return 0
entropy = 0
for char in set(text):
p_char = text.count(char) / len(text)
entropy -= p_char * math.log2(p_char)
return entropy / math.log2(len(set(text)))
Advantages: Can be highly accurate, adaptive Limitations: Requires labeled training data
Method 4: Composite Approach
Combine multiple methods for best results:
class HybridPIIDetector:
def __init__(self):
self.regex_detector = RegexBasedPIIDetector()
self.ner_detector = NERBasedPIIDetector()
self.ml_detector = MLBasedPIIDetector()
def detect(self, text):
"""Use multiple detection methods and combine results."""
# Run all detectors
regex_findings = self.regex_detector.detect(text)
ner_findings = self.ner_detector.detect(text)
ml_findings = self.ml_detector.detect(text)
# Combine results with confidence scoring
combined = self.combine_findings(
regex_findings, ner_findings, ml_findings
)
return combined
def combine_findings(self, regex, ner, ml):
"""Intelligently combine findings from multiple detectors."""
combined = {}
# If multiple detectors agree, high confidence
# If only one detector finds it, medium confidence
# Regex-only findings = lower confidence (high false positive rate)
return {
'high_confidence': self.extract_high_confidence(regex, ner, ml),
'medium_confidence': self.extract_medium_confidence(regex, ner, ml),
'low_confidence': self.extract_low_confidence(regex, ner, ml),
}
PII Redaction Strategies
Strategy 1: Masking
Replace PII with placeholders:
def mask_pii(text, pii_findings):
"""Replace PII with masked versions."""
masked_text = text
# Mask SSNs (show last 4 digits)
for ssn in pii_findings.get('ssn', []):
masked = f"SSN-****{ssn[-4:]}"
masked_text = masked_text.replace(ssn, masked)
# Mask credit cards (show last 4 digits)
for cc in pii_findings.get('credit_card', []):
digits_only = ''.join(filter(str.isdigit, cc))
masked = f"CC-****{digits_only[-4:]}"
masked_text = masked_text.replace(cc, masked)
# Mask emails (show domain only)
for email in pii_findings.get('email', []):
domain = email.split('@')[1]
masked = f"***@{domain}"
masked_text = masked_text.replace(email, masked)
return masked_text
Strategy 2: Tokenization
Replace PII with tokens:
class PIITokenizer:
def __init__(self):
self.token_map = {}
self.reverse_map = {}
self.token_counter = 0
def tokenize_pii(self, text, pii_findings):
"""Replace PII with tokens."""
tokenized = text
for pii_type, values in pii_findings.items():
for value in values:
token = self.get_or_create_token(pii_type, value)
tokenized = tokenized.replace(value, token)
return tokenized
def get_or_create_token(self, pii_type, value):
"""Get existing token or create new one."""
key = (pii_type, value)
if key not in self.token_map:
token = f"[{pii_type.upper()}_{self.token_counter}]"
self.token_map[key] = token
self.reverse_map[token] = value
self.token_counter += 1
return self.token_map[key]
def detokenize(self, text):
"""Replace tokens back with original values."""
detokenized = text
for token, original_value in self.reverse_map.items():
detokenized = detokenized.replace(token, original_value)
return detokenized
Strategy 3: Synthetic Data Generation
Replace PII with realistic-looking fake data:
from faker import Faker
def replace_with_synthetic_data(text, pii_findings):
"""Replace PII with realistic synthetic data."""
fake = Faker()
synthetic_text = text
# Replace names with fake names
for name in pii_findings.get('person', []):
synthetic_name = fake.name()
synthetic_text = synthetic_text.replace(name, synthetic_name)
# Replace emails with fake emails
for email in pii_findings.get('email', []):
synthetic_email = fake.email()
synthetic_text = synthetic_text.replace(email, synthetic_email)
# Replace phone numbers with fake numbers
for phone in pii_findings.get('phone', []):
synthetic_phone = fake.phone_number()
synthetic_text = synthetic_text.replace(phone, synthetic_phone)
return synthetic_text
Implementation: Complete PII Detection & Protection
class PIIProtectionPipeline:
def __init__(self):
self.detector = HybridPIIDetector()
self.tokenizer = PIITokenizer()
self.redaction_strategy = 'mask' # or 'tokenize' or 'synthetic'
def process_text(self, text, action='redact'):
"""Process text to detect and protect PII."""
# 1. Detect PII
findings = self.detector.detect(text)
# 2. Classify by confidence
high_confidence = findings['high_confidence']
medium_confidence = findings['medium_confidence']
# 3. Redact based on confidence
processed_text = text
if action == 'redact':
# Always redact high-confidence findings
processed_text = mask_pii(processed_text, high_confidence)
# Optionally redact medium-confidence findings
# (depends on your policy)
elif action == 'tokenize':
processed_text = self.tokenizer.tokenize_pii(processed_text, high_confidence)
elif action == 'synthetic':
processed_text = replace_with_synthetic_data(processed_text, high_confidence)
return {
'original_text': text,
'processed_text': processed_text,
'pii_found': high_confidence,
'confidence': 'high'
}
Compliance Implications
Different regulations require different PII handling:
- GDPR: Requires deletion of personal data on request
- CCPA: Requires disclosure of collected data
- HIPAA: Requires encryption of health information
- PCI DSS: Requires protection of credit card data
- FERPA: Requires protection of student records
Key Takeaway
Key Takeaway: PII detection and protection requires multiple methods working together. Use regex for fast scanning, NER for context awareness, ML for accuracy, and implement redaction through masking, tokenization, or synthetic data generation. Always classify findings by confidence before acting on them.
Exercise: Build a PII Detection System
- Implement all four detection methods
- Create a dataset of text with various PII types
- Benchmark each method’s precision and recall
- Build the hybrid detector
- Test each redaction strategy
- Document false positives and false negatives
Next Lesson: Securing the AI Data Pipeline—protecting data at rest and in transit.