Intermediate

Hugging Face Transformers Library

Lesson 4 of 4 Estimated Time 50 min

Hugging Face Transformers Library

The Hugging Face Transformers library provides a unified interface to thousands of pre-trained models. This lesson covers the core APIs (pipeline, AutoModel, AutoTokenizer), fine-tuning with the Trainer API, and publishing models to the Hugging Face Model Hub.

Core Concepts

The Transformers Ecosystem

Core Components:

  • Transformers Library: 50,000+ pre-trained models
  • Model Hub: Central repository for sharing models
  • Datasets Library: Easy access to 1000+ datasets
  • Accelerate: Distributed training made simple
  • PEFT: Parameter-efficient fine-tuning (LoRA, adapters)

Pipeline API: High-Level Interface

from transformers import pipeline

# Unified interface for common tasks
classifier = pipeline('sentiment-analysis')
result = classifier('This movie is absolutely wonderful!')
# Output: [{'label': 'POSITIVE', 'score': 0.999}]

Available pipelines:

  • Text classification
  • Token classification (NER)
  • Question answering
  • Fill-mask
  • Text generation
  • Translation
  • Summarization
  • Zero-shot classification

AutoModel and AutoTokenizer

Auto classes automatically detect and load correct model/tokenizer:

from transformers import AutoModel, AutoTokenizer

# No need to specify "BertModel", "GPT2LMHeadModel", etc.
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Works for any model architecture
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)  # Different architecture, same API
model = AutoModel.from_pretrained(model_name)

Task-Specific Model Classes

from transformers import (
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoModelForCausalLM,
    AutoModelForMaskedLM,
    AutoModelForQuestionAnswering,
)

# Load pre-configured for task
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Automatically adds classification head

Practical Implementation

Quick Start: Text Classification

from transformers import pipeline

# Zero-shot: no fine-tuning needed
classifier = pipeline('zero-shot-classification', model='facebook/bart-large-mnli')

text = "I absolutely love this product!"
labels = ['positive', 'negative', 'neutral']

result = classifier(text, labels)
print(result)
# {'sequence': 'I absolutely love this product!',
#  'labels': ['positive', 'negative', 'neutral'],
#  'scores': [0.96, 0.03, 0.01]}

Fine-tuning with Trainer API

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
)
from datasets import load_dataset
from sklearn.metrics import accuracy_score, f1_score

# Load dataset and model
dataset = load_dataset('glue', 'sst2')
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Tokenize
def preprocess_function(examples):
    return tokenizer(
        examples['sentence'],
        padding='max_length',
        truncation=True,
        max_length=128,
    )

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Define metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {
        'accuracy': accuracy_score(labels, predictions),
        'f1': f1_score(labels, predictions),
    }

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    save_steps=1000,
    evaluation_strategy='steps',
    eval_steps=500,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
    compute_metrics=compute_metrics,
    data_collator=DataCollatorWithPadding(tokenizer),
)

# Train
trainer.train()

# Evaluate
results = trainer.evaluate()
print(results)

Named Entity Recognition

from transformers import (
    AutoModelForTokenClassification,
    AutoTokenizer,
)
from datasets import load_dataset

model = AutoModelForTokenClassification.from_pretrained('distilbert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

# Load NER dataset
dataset = load_dataset('conll2003')

# BIO tag mapping
label_list = dataset['train'].features['ner_tags'].names

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['tokens'],
        truncation=True,
        is_split_into_words=True,
    )

    labels = []
    for i, label in enumerate(examples['ner_tags']):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        label_ids = []
        previous_word_idx = None
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)  # Special tokens
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(label[word_idx])
            previous_word_idx = word_idx
        labels.append(label_ids)

    tokenized_inputs['labels'] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)

# Training (same Trainer pattern as classification)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
)

trainer.train()

Question Answering

from transformers import (
    AutoModelForQuestionAnswering,
    AutoTokenizer,
    squad_convert_examples_to_features,
    Trainer,
)

model = AutoModelForQuestionAnswering.from_pretrained('bert-base-uncased')
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# SQuAD format dataset
dataset = load_dataset('squad')

def prepare_train_features(examples):
    # Tokenize context and question
    tokenized_examples = tokenizer(
        examples['question'],
        examples['context'],
        truncation='only_second',
        max_length=384,
        stride=128,
        return_overflowing_tokens=True,
        return_offsets_mapping=True,
        padding='max_length',
    )

    sample_mapping = tokenized_examples.pop('overflow_to_sample_mapping')
    offset_mapping = tokenized_examples.pop('offset_mapping')

    tokenized_examples['start_positions'] = []
    tokenized_examples['end_positions'] = []

    for i, offsets in enumerate(offset_mapping):
        input_ids = tokenized_examples['input_ids'][i]
        cls_index = input_ids.index(tokenizer.cls_token_id)

        sequence_ids = tokenized_examples.sequence_ids(i)
        sample_index = sample_mapping[i]
        answers = examples['answers'][sample_index]
        answer_start = answers['answer_start'][0]
        answer_end = answer_start + len(answers['text'][0])

        # Find token indices
        token_start_index = 0
        while sequence_ids[token_start_index] != 1:
            token_start_index += 1

        token_end_index = len(input_ids) - 1
        while sequence_ids[token_end_index] != 1:
            token_end_index -= 1

        # Check answer is in span
        if not (offsets[token_start_index][0] <= answer_start and answer_end <= offsets[token_end_index][1]):
            tokenized_examples['start_positions'].append(cls_index)
            tokenized_examples['end_positions'].append(cls_index)
        else:
            while token_start_index < len(offsets) and offsets[token_start_index][0] < answer_start:
                token_start_index += 1
            while offsets[token_end_index][1] > answer_end:
                token_end_index -= 1

            tokenized_examples['start_positions'].append(token_start_index)
            tokenized_examples['end_positions'].append(token_end_index)

    return tokenized_examples

train_dataset = dataset['train'].map(prepare_train_features, batched=True)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

Advanced Techniques

Parameter-Efficient Fine-tuning with LoRA

from peft import get_peft_model, LoraConfig, TaskType

# LoRA configuration
lora_config = LoraConfig(
    task_type=TaskType.SEQ_CLS,
    r=8,  # Rank
    lora_alpha=32,
    lora_dropout=0.1,
    bias='none',
    target_modules=['query', 'value'],  # Which modules to adapt
)

# Load base model
model = AutoModelForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Apply LoRA
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 592,130 || total params: 109,482,240 || trainable: 0.54%

# Train as usual with Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
    eval_dataset=tokenized_datasets['validation'],
)

trainer.train()

# Save adapter
model.save_pretrained('lora_checkpoint')

# Load adapter later
from peft import AutoPeftModelForSequenceClassification
model = AutoPeftModelForSequenceClassification.from_pretrained('lora_checkpoint')

Pushing to Model Hub

# Authenticate with Hub
from huggingface_hub import notebook_login
notebook_login()

# Save locally
model.save_pretrained('./my-model')
tokenizer.save_pretrained('./my-model')

# Push to Hub
model.push_to_hub('my-awesome-model')
tokenizer.push_to_hub('my-awesome-model')

# Create model card
model_card = '''---
license: apache-2.0
language: en
datasets:
- glue
metrics:
- accuracy
- f1
---

# My Awesome Model

This is a BERT model fine-tuned on SST-2 for sentiment analysis.

## Usage

\\`\\`\\`python
from transformers import pipeline
classifier = pipeline('sentiment-analysis', model='myusername/my-awesome-model')
\\`\\`\\`
'''

with open('./my-model/README.md', 'w') as f:
    f.write(model_card)

# Push again with README
model.push_to_hub('my-awesome-model', commit_message='Add model card')

Custom Dataset and DataCollator

from datasets import Dataset
from transformers import DataCollatorWithPadding

# Create custom dataset
data = {
    'text': ['positive example 1', 'negative example 1'],
    'label': [1, 0],
}
dataset = Dataset.from_dict(data)

# Tokenize
def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=128)

tokenized_dataset = dataset.map(tokenize_function)

# Custom data collator
collator = DataCollatorWithPadding(tokenizer)

# Use in trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=collator,
)

Production Considerations

Model Quantization

from transformers import BitsAndBytesConfig

# Load with 8-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=200.0,
)

model = AutoModelForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    quantization_config=quantization_config,
    num_labels=2,
)

Distributed Training

# Just change training arguments
training_args = TrainingArguments(
    output_dir='./results',
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=4,
    bf16=True,  # bfloat16 precision
    ddp_find_unused_parameters=False,
    # Will automatically use all GPUs
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets['train'],
)

# Run: python -m torch.distributed.launch --nproc_per_node=4 train.py

Inference API

from transformers import TextClassificationPipeline, pipeline
import fastapi

app = fastapi.FastAPI()

classifier = pipeline('sentiment-analysis', model='distilbert-base-uncased-finetuned-sst-2-english')

@app.post('/predict')
def predict(text: str):
    result = classifier(text)
    return result

Key Takeaway

Hugging Face Transformers democratizes deep learning by providing a unified interface to thousands of models. The Pipeline API handles common tasks instantly, while the Trainer API simplifies fine-tuning, making state-of-the-art NLP accessible to everyone.

Practical Exercise

Task: Build an end-to-end NLP application using Hugging Face: toxic comment classification with model hub deployment.

Requirements:

  1. Load Jigsaw Toxic Comments dataset
  2. Fine-tune BERT using Trainer API
  3. Implement multi-label classification (6 toxic categories)
  4. Optimize with class weights for imbalanced data
  5. Push best model to Hugging Face Hub
  6. Create API endpoint with FastAPI

Evaluation:

  • Achieve 0.80+ ROC-AUC on validation
  • Create interactive Gradio demo
  • Write comprehensive model card
  • Share on Hugging Face with 50+ stars goal
  • Profile inference latency and memory usage