Named Entity Recognition and Information Extraction

Named Entity Recognition (NER) identifies and classifies named entities (persons, organizations, locations) in text. This lesson covers the BIO tagging scheme, token classification with transformers, relation extraction, and practical entity linking applications.

Core Concepts

BIO Tagging Scheme

BIO (Begin-Inside-Outside) marks token-level entities:

Text: "John Smith works at Google"
Tokens: John    Smith   works  at    Google
Tags:   B-PER   I-PER   O      O     B-ORG

Tag meanings:

B-X: Beginning of entity type X
I-X: Inside (continuation) of entity type X
O: Outside any entity

Why BIO? Enables:

Clear entity boundaries
Multiple consecutive entities
Token-level predictions

Alternative schemes:

BIOES: Beginning-Inside-Outside-End-Single (stricter)
BILOU: Begin-Inside-Last-Outside-Unit (more expressive)

Token Classification Challenge

Challenges:

Subword tokens: WordPiece tokenization splits words

Token: "Washington" → ["Washing", "##ton"]
Problem: Which token gets the label?

Class imbalance: Most tokens are O (outside)

O tokens: 95%, Entity tokens: 5%
Model biased toward O prediction

Entity boundaries: Exactly defining start/end

Solutions:

Label only first subword of multi-token words
Use weighted loss functions
Implement boundary detection post-processing

Relation Extraction

Extract relationships between entities:

Text: "John Smith, CEO of Google, was born in California"
Relations:
- (John Smith, works_for, Google)
- (John Smith, born_in, California)

Approaches:

Pipeline: First NER, then relation classification
Joint: Single model for both NER and relations
Sequence-to-sequence: Generate relations as text

Practical Implementation

Token Classification with BERT

from transformers import (
    AutoTokenizer,
    AutoModelForTokenClassification,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset
import numpy as np
from sklearn.metrics import classification_report

# Load NER dataset
dataset = load_dataset('conll2003')

# Label mapping
label_names = dataset['train'].features['ner_tags'].names
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

# Tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModelForTokenClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=len(label_names),
    id2label=id2label,
    label2id=label2id,
)

# Tokenize with proper label alignment
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['tokens'],
        truncation=True,
        is_split_into_words=True,
        padding='max_length',
        max_length=512,
    )

    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None

        for word_idx in word_ids:
            if word_idx is None:
                # Special tokens
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                # First subword of word gets label
                label_ids.append(label[word_idx])
            else:
                # Subsequent subwords get -100 (ignored in loss)
                label_ids.append(-100)

            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs['labels'] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=dataset['train'].column_names,
)

# Metrics
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [id2label[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = classification_report(
        np.concatenate(true_labels),
        np.concatenate(true_predictions),
        output_dict=True
    )

    return {
        'precision': results['weighted avg']['precision'],
        'recall': results['weighted avg']['recall'],
        'f1': results['weighted avg']['f1-score'],
    }

# Training
training_args = TrainingArguments(
    output_dir='./ner_model',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_steps=100,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    compute_metrics=compute_metrics,
)

trainer.train()

Inference and Post-processing

@torch.no_grad()
def extract_entities(text):
    """Extract entities from text"""
    tokenized = tokenizer(
        text.split(),
        is_split_into_words=True,
        return_tensors='pt',
        padding=True,
        truncation=True,
    )

    outputs = model(**tokenized)
    predictions = torch.argmax(outputs.logits, dim=-1)

    word_ids = tokenized.word_ids()
    tokens = tokenized.tokens()

    entities = []
    current_entity = None
    current_type = None

    for i, (pred, token, word_id) in enumerate(zip(predictions[0], tokens, word_ids)):
        label = id2label[pred.item()]

        if word_id is None:
            continue  # Skip special tokens

        if label.startswith('B-'):
            # Start new entity
            if current_entity:
                entities.append({'text': current_entity, 'type': current_type})

            current_type = label[2:]
            current_entity = token.replace('##', '')

        elif label.startswith('I-'):
            # Continue entity
            if current_type == label[2:]:
                current_entity += token.replace('##', '')
            else:
                # Type mismatch, start new
                if current_entity:
                    entities.append({'text': current_entity, 'type': current_type})
                current_type = label[2:]
                current_entity = token.replace('##', '')

        else:  # O tag
            if current_entity:
                entities.append({'text': current_entity, 'type': current_type})
                current_entity = None
                current_type = None

    if current_entity:
        entities.append({'text': current_entity, 'type': current_type})

    return entities

text = "Steve Jobs founded Apple in Cupertino"
entities = extract_entities(text)
for ent in entities:
    print(f"{ent['text']} ({ent['type']})")

Relation Extraction with Transformers

import torch
import torch.nn as nn

class RelationExtractor(nn.Module):
    def __init__(self, num_relations=10):
        super().__init__()
        self.bert = AutoModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(0.1)
        self.relation_classifier = nn.Linear(768 * 3, num_relations)

    def forward(self, input_ids, attention_mask, entity1_pos, entity2_pos):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs.last_hidden_state

        # Extract entity representations
        entity1_repr = sequence_output[range(len(input_ids)), entity1_pos]
        entity2_repr = sequence_output[range(len(input_ids)), entity2_pos]
        cls_repr = sequence_output[:, 0, :]

        # Concatenate representations
        combined = torch.cat([entity1_repr, entity2_repr, cls_repr], dim=-1)
        combined = self.dropout(combined)

        relation_logits = self.relation_classifier(combined)
        return relation_logits

# Training with pairs of entities
model = RelationExtractor(num_relations=10)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-5)
criterion = nn.CrossEntropyLoss()

for epoch in range(3):
    for batch in train_loader:
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        entity1_pos = batch['entity1_pos']
        entity2_pos = batch['entity2_pos']
        labels = batch['relation_labels']

        optimizer.zero_grad()

        logits = model(input_ids, attention_mask, entity1_pos, entity2_pos)
        loss = criterion(logits, labels)

        loss.backward()
        optimizer.step()

Advanced Techniques

Bidirectional LSTM-CRF for NER

from torch.nn import LSTM, Linear, Dropout
from torchcrf import CRF

class LSTMCRFModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, num_labels):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = LSTM(embedding_dim, hidden_dim // 2, num_layers=2,
                        batch_first=True, bidirectional=True)
        self.linear = Linear(hidden_dim, num_labels)
        self.crf = CRF(num_labels, batch_first=True)

    def forward(self, input_ids, labels=None):
        embeddings = self.embedding(input_ids)
        lstm_out, _ = self.lstm(embeddings)
        emissions = self.linear(lstm_out)

        if labels is not None:
            loss = -self.crf(emissions, labels)
            return loss
        else:
            return self.crf.decode(emissions)

Entity Linking

from sentence_transformers import SentenceTransformer
import faiss

class EntityLinker:
    def __init__(self, knowledge_base):
        """
        knowledge_base: List of {'id': str, 'name': str, 'description': str}
        """
        self.kb = knowledge_base
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')

        # Encode all entities
        descriptions = [e['description'] for e in knowledge_base]
        self.embeddings = self.encoder.encode(descriptions, convert_to_numpy=True)

        # Build FAISS index
        self.index = faiss.IndexFlatL2(self.embeddings.shape[1])
        self.index.add(self.embeddings)

    def link_entity(self, entity_name, top_k=5):
        """Find best matching entities from knowledge base"""
        query_embedding = self.encoder.encode([entity_name])

        distances, indices = self.index.search(query_embedding, top_k)

        results = []
        for idx in indices[0]:
            results.append(self.kb[idx])

        return results

# Usage
kb = [
    {'id': 'Q312', 'name': 'Apple', 'description': 'American technology company'},
    {'id': 'Q93344', 'name': 'Apple', 'description': 'Fruit'},
]

linker = EntityLinker(kb)
results = linker.link_entity('Apple')

Production Considerations

Confidence Scoring

@torch.no_grad()
def extract_entities_with_confidence(text, confidence_threshold=0.5):
    """Extract entities with confidence scores"""
    tokenized = tokenizer(
        text.split(),
        is_split_into_words=True,
        return_tensors='pt',
    )

    outputs = model(**tokenized)
    probabilities = torch.softmax(outputs.logits, dim=-1)
    predictions = torch.argmax(probabilities, dim=-1)

    entities = []
    for pred, prob, word_id in zip(predictions[0], probabilities[0], tokenized.word_ids()):
        label = id2label[pred.item()]
        confidence = prob[pred].item()

        if confidence >= confidence_threshold and label != 'O':
            entities.append({
                'label': label,
                'confidence': confidence
            })

    return entities

Handling Long Documents

def extract_entities_long_doc(text, stride=256):
    """Process long documents with sliding window"""
    sentences = text.split('.')
    all_entities = []

    for i in range(0, len(sentences), stride):
        window = '.'.join(sentences[i:i+stride])
        entities = extract_entities(window)
        all_entities.extend(entities)

    # Deduplicate
    unique_entities = {e['text']: e for e in all_entities}
    return list(unique_entities.values())

Key Takeaway

Token-level classification powers information extraction, from named entities to relations. Mastering BIO tagging, subword handling, and post-processing unlocks access to the semantic structure of text for knowledge graphs, question answering, and document understanding.

Practical Exercise

Task: Build an information extraction system for scientific papers.

Requirements:

Fine-tune BERT for paper NER (authors, institutions, methods, results)
Implement relation extraction (author-affiliation, method-dataset)
Extract and link entities to external databases
Handle multi-word entities and abbreviations
Evaluate on annotated test set

Evaluation:

Per-entity-type F1 scores
Relation extraction accuracy
Entity linking precision
Handle edge cases (acronyms, abbreviations)
Benchmark latency on full papers