Nitish Kumar

Before Fine-Tuning: Deep Dive into Transformers

Published on Sunday, Jul 6, 2025

1 min read


Recently, I started exploring fine-tuning models but realized I needed a solid understanding of transformers to know what exactly I am fine-tuning. This post captures my structured notes, learnings, and explanations for transformers, gathered from reading and videos, to build a strong foundation.


🚩 Why Transformers Matter

Transformers are the backbone of modern NLP, forming the architecture behind models like BERT, GPT, LLaMA, and T5. Introduced in Attention Is All You Need, transformers removed sequential bottlenecks from RNNs/LSTMs by using attention mechanisms, enabling parallelization, scalability, and efficiency.


πŸ› οΈ The Architecture: Encoder-Decoder Structure

Transformers are composed of stacked encoder and decoder layers:

  • Encoder layers: Encode the input sequence into context-rich representations.
  • Decoder layers: Generate output sequences (useful in tasks like translation).

Each encoder layer contains:

  1. Multi-head self-attention
  2. Feed Forward Neural Network (FFNN)
  3. Layer Normalization + Residual Connections

Each decoder layer includes:

  1. Masked multi-head self-attention
  2. Encoder-decoder (cross) attention
  3. FFNN + Layer Normalization + Residuals

This structure enables deep representation learning while retaining interpretability.


✨ Why Transformers > RNNs/LSTMs

From my notes and DataCamp:

  • RNNs/LSTMs process sequentially, making them slow to train and hard to parallelize.
  • They struggle with long-term dependencies due to vanishing/exploding gradients despite LSTM’s gating.
  • Transformers use self-attention, allowing: βœ… Better long-range dependency handling
    βœ… Parallel training
    βœ… Scalability to large datasets

Encoder-Decoder Architecture

🧩 The Core: Self-Attention and Q, K, V

Self-attention enables each token to attend to every other token and compute weighted representations.

Key steps:

  1. For each token:

    • Compute:
      • Query (Q)
      • Key (K)
      • Value (V)
    • These are linear projections of the embeddings.
  2. Compute attention scores using dot product:

    $$ \text{Attention Score} = Q \cdot K^T $$

  3. Scale scores to stabilize gradients:

    $$ \text{Scaled Score} = \frac{Q \cdot K^T}{\sqrt{d_k}} $$

    where ( d_k ) = dimension of key vectors.

  4. Apply softmax for normalized attention weights:

    $$ \alpha = \text{softmax}\left(\frac{Q \cdot K^T}{\sqrt{d_k}}\right) $$

  5. Multiply with value vectors:

    $$ \text{Output} = \alpha \cdot V $$

This enables contextual representations, where tokens dynamically decide what to focus on.


πŸ”Ž Multi-Head Attention: Detective Analogy

Single attention head limitations: May capture only one type of relationship.

Multi-head attention:

  • Projects Q, K, V into multiple subspaces.
  • Performs attention in parallel across heads.
  • Concatenates and projects outputs.

Analogy from the Medium blog:

Think of multiple detectives solving a case:

  • One checks fingerprints.
  • Another investigates the timeline.
  • Another interviews witnesses.

Each focuses on different clues, combining them for a holistic understanding, similar to how multi-head attention captures various dependencies simultaneously.


⚑ Positional Encoding: Bringing Order to Tokens

Transformers lack recurrence and process tokens in parallel, requiring positional encodings to inject order information.

They: βœ… Help differentiate β€œNitish loves AI.” vs. β€œAI loves Nitish.”
βœ… Allow the model to learn position-based relationships.

Sinusoidal positional encoding:

$$ PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$

$$ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) $$

where:

  • ( pos ): position in sequence
  • ( i ): dimension index
  • ( d_{model} ): embedding dimension

These are added to embeddings before feeding into attention layers.


πŸͺ Feed Forward Networks and Layer Normalization

After multi-head attention, each layer includes:

  • Position-wise FFNN:

    $$ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 $$

  • Layer normalization and residual connections to stabilize and speed up training.

Transformer Architecture

πŸͺ„ Putting It All Together

A single transformer encoder layer:

  1. Input + positional encodings
  2. Multi-head self-attention + residual + layer norm
  3. FFNN + residual + layer norm

Stacking multiple layers allows deep, hierarchical understanding of input sequences.


πŸ’‘ Practical Insights (From My Notes and Readings)

βœ… Transformers are parallelizable, allowing faster GPU training.
βœ… They handle long dependencies efficiently.
βœ… Core building blocks:

  • Self-attention
  • Multi-head attention
  • Positional encodings
  • FFNN layers

They are the foundation of models like GPT, BERT, LLaMA, and T5, which are pre-trained on large corpora and fine-tuned for downstream tasks.


πŸ“š Resources I Used

These resources helped me clarify transformers intuitively:

  1. πŸ“˜ How Transformers Work - DataCamp
  2. ✍️ Transformers & Attention - Rakshit Kalra, Medium

If you found this helpful, let’s connect! I’m learning in public and would love to hear how you understood transformers or which part you found tricky while starting.


Happy Learning! πŸš€