Pre-Training LLMs: Data and Architecture
Pre-Training LLMs: Data and Architecture
Pre-training large language models on massive corpora is the foundation of modern NLP. This lesson covers data pipeline design, deduplication strategies, tokenization, training objectives (causal LM, masked LM, next sentence prediction), and scaling laws that govern model and data size.
Core Concepts
Data Pipeline for LLM Pre-training
Data Sources:
- Common Crawl: 2 trillion tokens
- Wikipedia: 20 billion tokens
- Books: 100+ billion tokens (Project Gutenberg, BookCorpus)
- Code: GitHub, StackOverflow
- Academic: ArXiv papers, scientific documents
Scale Considerations:
- GPT-2: 40GB (CommonCrawl filtered)
- GPT-3: 300+ billion tokens
- PaLM 2: Over 1 trillion tokens
- Rule: Use 20x more data than parameters (Chinchilla scaling)
Data Deduplication and Filtering
Challenges:
- Web data contains duplicates, low-quality content, toxic material
- Naive deduplication can lose valuable information
- Must balance quality and diversity
Techniques:
- Exact Deduplication: Remove identical documents
- Near-Duplicate Detection: MinHash, Bloom filters
- Quality Filtering: Language models, heuristics (URL patterns, length)
- PII Removal: Detect and mask personal information
Input: 1 trillion tokens
├─ Remove exact duplicates: 800 billion
├─ Remove low quality: 600 billion
├─ Remove PII: 550 billion
└─ Final dataset: 550 billion high-quality tokens
Tokenization for LLMs
Byte-Pair Encoding (BPE):
- Iteratively merge most frequent byte pairs
- Effective for multiple languages
- Vocabulary size typically 50K-100K
Sentencepiece:
- Language-agnostic
- Treats spaces as special tokens
- Popular in multilingual models
WordPiece:
- Merges pairs that maximize likelihood
- Used in BERT and other models
Token Budget Allocation:
Total vocab: 100,000
├─ Common words: 50,000
├─ Rare words/subwords: 40,000
└─ Special tokens ([CLS], [MASK], etc): 10,000
Scaling Laws
Empirical findings (from Scaling Laws research):
Loss = E + A/(N^α) + B/(D^β)
Where:
- E: irreducible error
- N: parameter count
- D: dataset size
- α, β ≈ 0.07 (optimal allocation)
Practical implications:
- Increase compute by 2x → token improvements ≈ 2^0.07 = 1.05 (5% better)
- Optimal compute allocation: 20% to parameters, 80% to training
- Larger models need proportionally more data
Practical Implementation
Building a Data Pipeline
import datasets
from datasets import load_dataset, concatenate_datasets
import hashlib
from collections import defaultdict
class DataPipeline:
def __init__(self, output_path='preprocessed_data'):
self.output_path = output_path
self.duplicates = defaultdict(list)
def load_raw_data(self, sources):
"""Load data from multiple sources"""
all_data = []
for source in sources:
if source == 'wikipedia':
data = load_dataset('wikipedia', '20220301.en', split='train')
all_data.append(data.rename_column('text', 'content'))
elif source == 'common_crawl':
data = load_dataset('common_crawl', split='train')
all_data.append(data.rename_column('text', 'content'))
elif source == 'books':
data = load_dataset('openwebtext', split='train')
all_data.append(data.rename_column('text', 'content'))
return concatenate_datasets(all_data)
def filter_quality(self, dataset):
"""Remove low-quality documents"""
def is_quality(example):
text = example['content']
# Length check
if len(text.split()) < 50:
return False
# Language check (simple heuristic)
english_words = set(open('english_words.txt').read().split())
word_ratio = sum(1 for w in text.split()[:100] if w.lower() in english_words) / 100
if word_ratio < 0.3:
return False
# Toxicity check (simplified)
toxic_words = {'hate', 'kill', 'violence'}
toxic_ratio = sum(1 for w in text.split() if w.lower() in toxic_words) / len(text.split())
if toxic_ratio > 0.05:
return False
return True
return dataset.filter(is_quality, batched=True)
def deduplicate(self, dataset, threshold=0.95):
"""Remove exact and near-duplicates"""
seen_hashes = set()
unique_docs = []
for doc in dataset:
doc_hash = hashlib.md5(doc['content'].encode()).hexdigest()
if doc_hash not in seen_hashes:
seen_hashes.add(doc_hash)
unique_docs.append(doc)
return datasets.Dataset.from_dict({
'content': [d['content'] for d in unique_docs]
})
def tokenize(self, dataset, tokenizer):
"""Tokenize documents"""
def tokenize_function(examples):
return tokenizer(
examples['content'],
truncation=False,
return_special_tokens_mask=True,
)
return dataset.map(
tokenize_function,
batched=True,
remove_columns=['content'],
)
def process(self, sources, tokenizer):
"""Full pipeline"""
print("Loading data...")
dataset = self.load_raw_data(sources)
print("Filtering quality...")
dataset = self.filter_quality(dataset)
print("Deduplicating...")
dataset = self.deduplicate(dataset)
print("Tokenizing...")
dataset = self.tokenize(dataset, tokenizer)
print(f"Final dataset: {len(dataset)} documents")
return dataset
# Usage
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')
pipeline = DataPipeline()
dataset = pipeline.process(['wikipedia', 'common_crawl'], tokenizer)
Causal Language Modeling Training
from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
from datasets import load_dataset
class CLMTrainer:
def __init__(self, model_name='gpt2', block_size=1024):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.block_size = block_size
def preprocess_dataset(self, dataset):
"""Tokenize and concatenate into fixed-size blocks"""
def tokenize_function(examples):
return self.tokenizer(examples['text'], truncation=True, max_length=self.block_size)
tokenized = dataset.map(
tokenize_function,
batched=True,
remove_columns=['text'],
)
# Concatenate tokenized texts
def group_texts(examples):
concatenated = {k: sum(examples[k], []) for k in examples.keys()}
total_length = len(concatenated[list(examples.keys())[0]])
total_length = (total_length // self.block_size) * self.block_size
result = {
k: [t[i:i + self.block_size] for i in range(0, total_length, self.block_size)]
for k, t in concatenated.items()
}
result['labels'] = result['input_ids'].copy()
return result
return tokenized.map(
group_texts,
batched=True,
batch_size=1000,
)
def train(self, dataset, output_dir='./gpt2-pretrained', num_epochs=3):
"""Train causal LM"""
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=num_epochs,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
warmup_steps=500,
weight_decay=0.01,
logging_steps=100,
save_steps=1000,
gradient_accumulation_steps=4,
bf16=True,
ddp_find_unused_parameters=False,
)
trainer = Trainer(
model=self.model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
return trainer
# Usage
trainer = CLMTrainer(model_name='gpt2')
dataset = trainer.preprocess_dataset(load_dataset('wikitext', 'wikitext-2')['train'])
trainer.train(dataset)
Multi-GPU Training with DeepSpeed
from deepspeed import initialize, get_optimizer, get_lr_scheduler
import torch
from torch.utils.data import DataLoader
class DeepSpeedTrainer:
def __init__(self, model, model_engine):
self.model = model
self.model_engine = model_engine
def train_epoch(self, train_loader):
total_loss = 0
for batch in train_loader:
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
labels = batch['labels']
outputs = self.model_engine(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs['loss']
self.model_engine.backward(loss)
self.model_engine.step()
total_loss += loss.item()
return total_loss / len(train_loader)
# Initialize with DeepSpeed config
ds_config = {
'train_batch_size': 32,
'gradient_accumulation_steps': 4,
'optimizer': {
'type': 'AdamW',
'params': {'lr': 2e-5, 'betas': [0.9, 0.95]}
},
'zero_optimization': {
'stage': 2,
'offload_optimizer': {'device': 'cpu'}
},
'fp16': {'enabled': True},
}
model_engine, optimizer, _, _ = initialize(
model=model,
config=ds_config,
model_parameters=model.parameters(),
)
trainer = DeepSpeedTrainer(model, model_engine)
Advanced Techniques
Token Mixing and Curriculum Learning
class CurriculumLearning:
def __init__(self, model):
self.model = model
self.current_difficulty = 1.0
def adjust_difficulty(self, validation_loss, epoch):
"""Gradually increase difficulty (longer sequences, more complex data)"""
if epoch % 10 == 0:
if validation_loss < self.threshold:
self.current_difficulty = min(1.0, self.current_difficulty + 0.1)
return int(512 * self.current_difficulty)
Computing Optimal Allocation (Chinchilla)
def compute_optimal_allocation(total_flops):
"""
Chinchilla scaling laws:
FLOPs ≈ 6 * N * D
Optimal: N = D (equal compute on parameters and tokens)
"""
flops_per_token_per_param = 6
N_D_product = total_flops / flops_per_token_per_param
# N = D for optimal
N = D = int((N_D_product) ** 0.5)
return N, D
# Example: 1 trillion FLOPs
params, tokens = compute_optimal_allocation(1e12)
print(f"Parameters: {params/1e9:.1f}B, Tokens: {tokens/1e12:.1f}T")
Production Considerations
Checkpointing Strategy
class CheckpointManager:
def __init__(self, model_dir='checkpoints'):
self.model_dir = model_dir
self.best_loss = float('inf')
def save_checkpoint(self, model, optimizer, epoch, loss):
"""Save checkpoint if best loss"""
if loss < self.best_loss:
self.best_loss = loss
checkpoint = {
'epoch': epoch,
'model': model.state_dict(),
'optimizer': optimizer.state_dict(),
'loss': loss,
}
torch.save(checkpoint, f'{self.model_dir}/best_model.pt')
def load_checkpoint(self, model, optimizer, checkpoint_path):
"""Resume training from checkpoint"""
checkpoint = torch.load(checkpoint_path)
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
return checkpoint['epoch']
Monitoring and Logging
import wandb
def setup_logging():
wandb.init(
project='llm-pretraining',
config={
'model': 'gpt2-large',
'batch_size': 32,
'learning_rate': 2e-5,
}
)
def log_metrics(epoch, loss, perplexity, learning_rate):
wandb.log({
'epoch': epoch,
'loss': loss,
'perplexity': perplexity,
'learning_rate': learning_rate,
})
Key Takeaway
Pre-training at scale requires careful orchestration of data pipelines, deduplication, tokenization, and scaling laws. Understanding Chinchilla scaling and optimal compute allocation guides decisions about model and dataset size, ensuring efficient use of computational resources.
Practical Exercise
Task: Pre-train a small LLM from scratch on a curated corpus.
Requirements:
- Collect and preprocess 50GB+ of text data
- Implement deduplication and quality filtering
- Train GPT-2 sized model (124M parameters)
- Apply Chinchilla scaling laws for optimal allocation
- Monitor training with perplexity and validation loss
- Save and evaluate checkpoints
Evaluation:
- Final validation perplexity < 25
- Reproduce scaling law predictions
- Analyze data efficiency (tokens vs loss curve)
- Compare with public pre-trained models
- Profile training speed and memory usage