Named Entity Recognition and Information Extraction
Named Entity Recognition and Information Extraction
Named Entity Recognition (NER) identifies and classifies named entities (persons, organizations, locations) in text. This lesson covers the BIO tagging scheme, token classification with transformers, relation extraction, and practical entity linking applications.
Core Concepts
BIO Tagging Scheme
BIO (Begin-Inside-Outside) marks token-level entities:
Text: "John Smith works at Google"
Tokens: John Smith works at Google
Tags: B-PER I-PER O O B-ORG
Tag meanings:
- B-X: Beginning of entity type X
- I-X: Inside (continuation) of entity type X
- O: Outside any entity
Why BIO? Enables:
- Clear entity boundaries
- Multiple consecutive entities
- Token-level predictions
Alternative schemes:
- BIOES: Beginning-Inside-Outside-End-Single (stricter)
- BILOU: Begin-Inside-Last-Outside-Unit (more expressive)
Token Classification Challenge
Challenges:
-
Subword tokens: WordPiece tokenization splits words
Token: "Washington" → ["Washing", "##ton"] Problem: Which token gets the label? -
Class imbalance: Most tokens are O (outside)
O tokens: 95%, Entity tokens: 5% Model biased toward O prediction -
Entity boundaries: Exactly defining start/end
Solutions:
- Label only first subword of multi-token words
- Use weighted loss functions
- Implement boundary detection post-processing
Relation Extraction
Extract relationships between entities:
Text: "John Smith, CEO of Google, was born in California"
Relations:
- (John Smith, works_for, Google)
- (John Smith, born_in, California)
Approaches:
- Pipeline: First NER, then relation classification
- Joint: Single model for both NER and relations
- Sequence-to-sequence: Generate relations as text
Practical Implementation
Token Classification with BERT
from transformers import (
AutoTokenizer,
AutoModelForTokenClassification,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import classification_report
# Load NER dataset
dataset = load_dataset('conll2003')
# Label mapping
label_names = dataset['train'].features['ner_tags'].names
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}
# Tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForTokenClassification.from_pretrained(
'bert-base-uncased',
num_labels=len(label_names),
id2label=id2label,
label2id=label2id,
)
# Tokenize with proper label alignment
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(
examples['tokens'],
truncation=True,
is_split_into_words=True,
padding='max_length',
max_length=512,
)
labels = []
for i, label in enumerate(examples['ner_tags']):
word_ids = tokenized_inputs.word_ids(batch_index=i)
label_ids = []
previous_word_idx = None
for word_idx in word_ids:
if word_idx is None:
# Special tokens
label_ids.append(-100)
elif word_idx != previous_word_idx:
# First subword of word gets label
label_ids.append(label[word_idx])
else:
# Subsequent subwords get -100 (ignored in loss)
label_ids.append(-100)
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs['labels'] = labels
return tokenized_inputs
tokenized_datasets = dataset.map(
tokenize_and_align_labels,
batched=True,
remove_columns=dataset['train'].column_names,
)
# Metrics
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=2)
# Remove ignored index (special tokens)
true_predictions = [
[id2label[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
true_labels = [
[id2label[l] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
results = classification_report(
np.concatenate(true_labels),
np.concatenate(true_predictions),
output_dict=True
)
return {
'precision': results['weighted avg']['precision'],
'recall': results['weighted avg']['recall'],
'f1': results['weighted avg']['f1-score'],
}
# Training
training_args = TrainingArguments(
output_dir='./ner_model',
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
warmup_steps=500,
weight_decay=0.01,
logging_steps=100,
evaluation_strategy='epoch',
save_strategy='epoch',
load_best_model_at_end=True,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['validation'],
compute_metrics=compute_metrics,
)
trainer.train()
Inference and Post-processing
@torch.no_grad()
def extract_entities(text):
"""Extract entities from text"""
tokenized = tokenizer(
text.split(),
is_split_into_words=True,
return_tensors='pt',
padding=True,
truncation=True,
)
outputs = model(**tokenized)
predictions = torch.argmax(outputs.logits, dim=-1)
word_ids = tokenized.word_ids()
tokens = tokenized.tokens()
entities = []
current_entity = None
current_type = None
for i, (pred, token, word_id) in enumerate(zip(predictions[0], tokens, word_ids)):
label = id2label[pred.item()]
if word_id is None:
continue # Skip special tokens
if label.startswith('B-'):
# Start new entity
if current_entity:
entities.append({'text': current_entity, 'type': current_type})
current_type = label[2:]
current_entity = token.replace('##', '')
elif label.startswith('I-'):
# Continue entity
if current_type == label[2:]:
current_entity += token.replace('##', '')
else:
# Type mismatch, start new
if current_entity:
entities.append({'text': current_entity, 'type': current_type})
current_type = label[2:]
current_entity = token.replace('##', '')
else: # O tag
if current_entity:
entities.append({'text': current_entity, 'type': current_type})
current_entity = None
current_type = None
if current_entity:
entities.append({'text': current_entity, 'type': current_type})
return entities
text = "Steve Jobs founded Apple in Cupertino"
entities = extract_entities(text)
for ent in entities:
print(f"{ent['text']} ({ent['type']})")
Relation Extraction with Transformers
import torch
import torch.nn as nn
class RelationExtractor(nn.Module):
def __init__(self, num_relations=10):
super().__init__()
self.bert = AutoModel.from_pretrained('bert-base-uncased')
self.dropout = nn.Dropout(0.1)
self.relation_classifier = nn.Linear(768 * 3, num_relations)
def forward(self, input_ids, attention_mask, entity1_pos, entity2_pos):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
sequence_output = outputs.last_hidden_state
# Extract entity representations
entity1_repr = sequence_output[range(len(input_ids)), entity1_pos]
entity2_repr = sequence_output[range(len(input_ids)), entity2_pos]
cls_repr = sequence_output[:, 0, :]
# Concatenate representations
combined = torch.cat([entity1_repr, entity2_repr, cls_repr], dim=-1)
combined = self.dropout(combined)
relation_logits = self.relation_classifier(combined)
return relation_logits
# Training with pairs of entities
model = RelationExtractor(num_relations=10)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()
for epoch in range(3):
for batch in train_loader:
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
entity1_pos = batch['entity1_pos']
entity2_pos = batch['entity2_pos']
labels = batch['relation_labels']
optimizer.zero_grad()
logits = model(input_ids, attention_mask, entity1_pos, entity2_pos)
loss = criterion(logits, labels)
loss.backward()
optimizer.step()
Advanced Techniques
Bidirectional LSTM-CRF for NER
from torch.nn import LSTM, Linear, Dropout
from torchcrf import CRF
class LSTMCRFModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embedding_dim)
self.lstm = LSTM(embedding_dim, hidden_dim // 2, num_layers=2,
batch_first=True, bidirectional=True)
self.linear = Linear(hidden_dim, num_labels)
self.crf = CRF(num_labels, batch_first=True)
def forward(self, input_ids, labels=None):
embeddings = self.embedding(input_ids)
lstm_out, _ = self.lstm(embeddings)
emissions = self.linear(lstm_out)
if labels is not None:
loss = -self.crf(emissions, labels)
return loss
else:
return self.crf.decode(emissions)
Entity Linking
from sentence_transformers import SentenceTransformer
import faiss
class EntityLinker:
def __init__(self, knowledge_base):
"""
knowledge_base: List of {'id': str, 'name': str, 'description': str}
"""
self.kb = knowledge_base
self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
# Encode all entities
descriptions = [e['description'] for e in knowledge_base]
self.embeddings = self.encoder.encode(descriptions, convert_to_numpy=True)
# Build FAISS index
self.index = faiss.IndexFlatL2(self.embeddings.shape[1])
self.index.add(self.embeddings)
def link_entity(self, entity_name, top_k=5):
"""Find best matching entities from knowledge base"""
query_embedding = self.encoder.encode([entity_name])
distances, indices = self.index.search(query_embedding, top_k)
results = []
for idx in indices[0]:
results.append(self.kb[idx])
return results
# Usage
kb = [
{'id': 'Q312', 'name': 'Apple', 'description': 'American technology company'},
{'id': 'Q93344', 'name': 'Apple', 'description': 'Fruit'},
]
linker = EntityLinker(kb)
results = linker.link_entity('Apple')
Production Considerations
Confidence Scoring
@torch.no_grad()
def extract_entities_with_confidence(text, confidence_threshold=0.5):
"""Extract entities with confidence scores"""
tokenized = tokenizer(
text.split(),
is_split_into_words=True,
return_tensors='pt',
)
outputs = model(**tokenized)
probabilities = torch.softmax(outputs.logits, dim=-1)
predictions = torch.argmax(probabilities, dim=-1)
entities = []
for pred, prob, word_id in zip(predictions[0], probabilities[0], tokenized.word_ids()):
label = id2label[pred.item()]
confidence = prob[pred].item()
if confidence >= confidence_threshold and label != 'O':
entities.append({
'label': label,
'confidence': confidence
})
return entities
Handling Long Documents
def extract_entities_long_doc(text, stride=256):
"""Process long documents with sliding window"""
sentences = text.split('.')
all_entities = []
for i in range(0, len(sentences), stride):
window = '.'.join(sentences[i:i+stride])
entities = extract_entities(window)
all_entities.extend(entities)
# Deduplicate
unique_entities = {e['text']: e for e in all_entities}
return list(unique_entities.values())
Key Takeaway
Token-level classification powers information extraction, from named entities to relations. Mastering BIO tagging, subword handling, and post-processing unlocks access to the semantic structure of text for knowledge graphs, question answering, and document understanding.
Practical Exercise
Task: Build an information extraction system for scientific papers.
Requirements:
- Fine-tune BERT for paper NER (authors, institutions, methods, results)
- Implement relation extraction (author-affiliation, method-dataset)
- Extract and link entities to external databases
- Handle multi-word entities and abbreviations
- Evaluate on annotated test set
Evaluation:
- Per-entity-type F1 scores
- Relation extraction accuracy
- Entity linking precision
- Handle edge cases (acronyms, abbreviations)
- Benchmark latency on full papers