Hugging Face Transformers Library
Hugging Face Transformers Library
The Hugging Face Transformers library provides a unified interface to thousands of pre-trained models. This lesson covers the core APIs (pipeline, AutoModel, AutoTokenizer), fine-tuning with the Trainer API, and publishing models to the Hugging Face Model Hub.
Core Concepts
The Transformers Ecosystem
Core Components:
- Transformers Library: 50,000+ pre-trained models
- Model Hub: Central repository for sharing models
- Datasets Library: Easy access to 1000+ datasets
- Accelerate: Distributed training made simple
- PEFT: Parameter-efficient fine-tuning (LoRA, adapters)
Pipeline API: High-Level Interface
from transformers import pipeline
# Unified interface for common tasks
classifier = pipeline('sentiment-analysis')
result = classifier('This movie is absolutely wonderful!')
# Output: [{'label': 'POSITIVE', 'score': 0.999}]
Available pipelines:
- Text classification
- Token classification (NER)
- Question answering
- Fill-mask
- Text generation
- Translation
- Summarization
- Zero-shot classification
AutoModel and AutoTokenizer
Auto classes automatically detect and load correct model/tokenizer:
from transformers import AutoModel, AutoTokenizer
# No need to specify "BertModel", "GPT2LMHeadModel", etc.
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Works for any model architecture
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name) # Different architecture, same API
model = AutoModel.from_pretrained(model_name)
Task-Specific Model Classes
from transformers import (
AutoModelForSequenceClassification,
AutoModelForTokenClassification,
AutoModelForCausalLM,
AutoModelForMaskedLM,
AutoModelForQuestionAnswering,
)
# Load pre-configured for task
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Automatically adds classification head
Practical Implementation
Quick Start: Text Classification
from transformers import pipeline
# Zero-shot: no fine-tuning needed
classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')
text = "I absolutely love this product!"
labels = ['positive', 'negative', 'neutral']
result = classifier(text, labels)
print(result)
# {'sequence': 'I absolutely love this product!',
# 'labels': ['positive', 'negative', 'neutral'],
# 'scores': [0.96, 0.03, 0.01]}
Fine-tuning with Trainer API
from transformers import (
AutoModelForSequenceClassification,
AutoTokenizer,
TrainingArguments,
Trainer,
)
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score
# Load dataset and model
dataset = load_dataset('glue', 'sst2')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# Tokenize
def preprocess_function(examples):
return tokenizer(
examples['sentence'],
padding='max_length',
truncation=True,
max_length=128,
)
tokenized_datasets = dataset.map(preprocess_function, batched=True)
# Define metrics
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return {
'accuracy': accuracy_score(labels, predictions),
'f1': f1_score(labels, predictions),
}
# Training arguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
save_steps=1000,
evaluation_strategy='steps',
eval_steps=500,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['validation'],
compute_metrics=compute_metrics,
data_collator=DataCollatorWithPadding(tokenizer),
)
# Train
trainer.train()
# Evaluate
results = trainer.evaluate()
print(results)
Named Entity Recognition
from transformers import (
AutoModelForTokenClassification,
AutoTokenizer,
)
from datasets import load_dataset
model = AutoModelForTokenClassification.from_pretrained('distilbert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
# Load NER dataset
dataset = load_dataset('conll2003')
# BIO tag mapping
label_list = dataset['train'].features['ner_tags'].names
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(
examples['tokens'],
truncation=True,
is_split_into_words=True,
)
labels = []
for i, label in enumerate(examples['ner_tags']):
word_ids = tokenized_inputs.word_ids(batch_index=i)
label_ids = []
previous_word_idx = None
for word_idx in word_ids:
if word_idx is None:
label_ids.append(-100) # Special tokens
elif word_idx != previous_word_idx:
label_ids.append(label[word_idx])
else:
label_ids.append(label[word_idx])
previous_word_idx = word_idx
labels.append(label_ids)
tokenized_inputs['labels'] = labels
return tokenized_inputs
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)
# Training (same Trainer pattern as classification)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['validation'],
)
trainer.train()
Question Answering
from transformers import (
AutoModelForQuestionAnswering,
AutoTokenizer,
squad_convert_examples_to_features,
Trainer,
)
model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# SQuAD format dataset
dataset = load_dataset('squad')
def prepare_train_features(examples):
# Tokenize context and question
tokenized_examples = tokenizer(
examples['question'],
examples['context'],
truncation='only_second',
max_length=384,
stride=128,
return_overflowing_tokens=True,
return_offsets_mapping=True,
padding='max_length',
)
sample_mapping = tokenized_examples.pop('overflow_to_sample_mapping')
offset_mapping = tokenized_examples.pop('offset_mapping')
tokenized_examples['start_positions'] = []
tokenized_examples['end_positions'] = []
for i, offsets in enumerate(offset_mapping):
input_ids = tokenized_examples['input_ids'][i]
cls_index = input_ids.index(tokenizer.cls_token_id)
sequence_ids = tokenized_examples.sequence_ids(i)
sample_index = sample_mapping[i]
answers = examples['answers'][sample_index]
answer_start = answers['answer_start'][0]
answer_end = answer_start + len(answers['text'][0])
# Find token indices
token_start_index = 0
while sequence_ids[token_start_index] != 1:
token_start_index += 1
token_end_index = len(input_ids) - 1
while sequence_ids[token_end_index] != 1:
token_end_index -= 1
# Check answer is in span
if not (offsets[token_start_index][0] <= answer_start and answer_end <= offsets[token_end_index][1]):
tokenized_examples['start_positions'].append(cls_index)
tokenized_examples['end_positions'].append(cls_index)
else:
while token_start_index < len(offsets) and offsets[token_start_index][0] < answer_start:
token_start_index += 1
while offsets[token_end_index][1] > answer_end:
token_end_index -= 1
tokenized_examples['start_positions'].append(token_start_index)
tokenized_examples['end_positions'].append(token_end_index)
return tokenized_examples
train_dataset = dataset['train'].map(prepare_train_features, batched=True)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
Advanced Techniques
Parameter-Efficient Fine-tuning with LoRA
from peft import get_peft_model, LoraConfig, TaskType
# LoRA configuration
lora_config = LoraConfig(
task_type=TaskType.SEQ_CLS,
r=8, # Rank
lora_alpha=32,
lora_dropout=0.1,
bias='none',
target_modules=['query', 'value'], # Which modules to adapt
)
# Load base model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 592,130 || total params: 109,482,240 || trainable: 0.54%
# Train as usual with Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
eval_dataset=tokenized_datasets['validation'],
)
trainer.train()
# Save adapter
model.save_pretrained('lora_checkpoint')
# Load adapter later
from peft import AutoPeftModelForSequenceClassification
model = AutoPeftModelForSequenceClassification.from_pretrained('lora_checkpoint')
Pushing to Model Hub
# Authenticate with Hub
from huggingface_hub import notebook_login
notebook_login()
# Save locally
model.save_pretrained('./my-model')
tokenizer.save_pretrained('./my-model')
# Push to Hub
model.push_to_hub('my-awesome-model')
tokenizer.push_to_hub('my-awesome-model')
# Create model card
model_card = '''---
license: apache-2.0
language: en
datasets:
- glue
metrics:
- accuracy
- f1
---
# My Awesome Model
This is a BERT model fine-tuned on SST-2 for sentiment analysis.
## Usage
\\`\\`\\`python
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model='myusername/my-awesome-model')
\\`\\`\\`
'''
with open('./my-model/README.md', 'w') as f:
f.write(model_card)
# Push again with README
model.push_to_hub('my-awesome-model', commit_message='Add model card')
Custom Dataset and DataCollator
from datasets import Dataset
from transformers import DataCollatorWithPadding
# Create custom dataset
data = {
'text': ['positive example 1', 'negative example 1'],
'label': [1, 0],
}
dataset = Dataset.from_dict(data)
# Tokenize
def tokenize_function(examples):
return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)
tokenized_dataset = dataset.map(tokenize_function)
# Custom data collator
collator = DataCollatorWithPadding(tokenizer)
# Use in trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
data_collator=collator,
)
Production Considerations
Model Quantization
from transformers import BitsAndBytesConfig
# Load with 8-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_8bit=True,
llm_int8_threshold=200.0,
)
model = AutoModelForSequenceClassification.from_pretrained(
'bert-base-uncased',
quantization_config=quantization_config,
num_labels=2,
)
Distributed Training
# Just change training arguments
training_args = TrainingArguments(
output_dir='./results',
per_device_train_batch_size=32,
per_device_eval_batch_size=32,
gradient_accumulation_steps=4,
bf16=True, # bfloat16 precision
ddp_find_unused_parameters=False,
# Will automatically use all GPUs
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets['train'],
)
# Run: python -m torch.distributed.launch --nproc_per_node=4 train.py
Inference API
from transformers import TextClassificationPipeline, pipeline
import fastapi
app = fastapi.FastAPI()
classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')
@app.post('/predict')
def predict(text: str):
result = classifier(text)
return result
Key Takeaway
Hugging Face Transformers democratizes deep learning by providing a unified interface to thousands of models. The Pipeline API handles common tasks instantly, while the Trainer API simplifies fine-tuning, making state-of-the-art NLP accessible to everyone.
Practical Exercise
Task: Build an end-to-end NLP application using Hugging Face: toxic comment classification with model hub deployment.
Requirements:
- Load Jigsaw Toxic Comments dataset
- Fine-tune BERT using Trainer API
- Implement multi-label classification (6 toxic categories)
- Optimize with class weights for imbalanced data
- Push best model to Hugging Face Hub
- Create API endpoint with FastAPI
Evaluation:
- Achieve 0.80+ ROC-AUC on validation
- Create interactive Gradio demo
- Write comprehensive model card
- Share on Hugging Face with 50+ stars goal
- Profile inference latency and memory usage