Stanford CME295 — Lec 4: Guide to LLM Training, Optimization, and Efficient Fine-Tuning

Published on Saturday, Jan 10, 2026

1 min read


This weekend, I dove into Stanford CME295 (Lecture 4) to build a deeper understanding of how Large Language Models (LLMs) are trained, optimized, and fine-tuned in practice.
These lectures bridge the gap between theory and real-world systems, covering everything from next-token prediction to FlashAttention, LoRA, and QLoRA.


1. Foundations: How LLMs Generate Text

At their core, LLMs are trained to predict the next token given a sequence of previous tokens.

Decoding Strategies (Inference-Time)

Once a model outputs a probability distribution over the vocabulary, we must decide how to pick the next token.

1. Greedy Decoding

  • Always selects the highest-probability token
  • Fast and deterministic
  • Often leads to repetitive or dull outputs
  • Keeps track of the top-k most probable sequences
  • Explores multiple paths simultaneously
  • Better quality than greedy, but expensive and less diverse

3. Sampling-Based Decoding

  • Samples the next token from the probability distribution
  • Introduces randomness and diversity
  • Key hyperparameter: Temperature
    • Low temperature → more deterministic
    • High temperature → more creative, more random

2. Inference Optimization (Production Reality)

Inference is often the dominant cost in real-world LLM systems.

Key Techniques

KV Cache (Very Important)

  • Stores Key and Value tensors from previous tokens
  • Avoids recomputing attention for the entire context every step
  • Crucial for fast autoregressive decoding

PagedAttention

  • Memory-efficient attention mechanism
  • Helps avoid large contiguous memory allocations
  • Widely used in production inference engines

3. Paradigm Shift in Machine Learning

Old ML Era

  • Train a model from scratch for each task

Transfer Learning Era

  • Reuse pretrained models
  • Fine-tune them for specific tasks

LLM Training Philosophy

  1. Pretraining: Learn general language understanding
  2. Tuning: Adapt the model to a specific behavior or task

4. Stage 1: Pretraining Large Language Models

What Is Pretraining?

  • Train on massive datasets
  • Objective: next-token prediction
  • Computationally and financially expensive

Data Mixtures

  • Web-scraped data (Common Crawl, Wikipedia)
  • Code (GitHub, StackOverflow)

Scale

  • Training data size: trillions of tokens
ModelPretraining Size (Tokens)
GPT-3300 Billion
LLaMA 3~15 Trillion

5. Compute Notation (Used Everywhere)

FLOPs

  • Floating Point Operations
  • Measure of total compute
  • Training LLMs ≈ O(10²⁵ FLOPs)
  • Function of:
    • Number of parameters
    • Number of training tokens

FLOPS (FLOPs/sec)

  • Measures hardware speed
  • How fast your GPU executes operations

6. Chinchilla Scaling Law

Given a fixed compute budget:

  • Optimal training occurs when
    #training tokens ≈ 20 × #model parameters

This showed that:

  • Many earlier models were over-parameterized and under-trained
  • Data quality and quantity matter as much as model size

7. Architectural Assumption

Most modern LLMs are decoder-only Transformer models


8. Challenges of Pretraining

Practical Challenges

  1. Extremely high cost (millions of dollars)
  2. Long training times
  3. Environmental impact (energy consumption)

Model-Level Challenges

  1. Knowledge cutoff
  2. Hard to edit or remove knowledge
  3. Risk of plagiarism and memorization

9. Training Optimization Techniques

LLMs are heavy on matrix multiplications, so efficiency is critical.


9.1 Data Parallelism

  • Split batch across multiple devices
  • Each device holds a full copy of the model

9.2 Data Parallelism with ZeRO

ZeRO (Zero Redundancy Optimization):

  • Shards optimizer states, gradients, and parameters
  • Dramatically reduces memory usage
  • Enables training much larger models

9.3 Model Parallelism

Split the model itself across devices.

Common Variants

  • Tensor Parallelism (TP): Split matrix multiplications
  • Context Parallelism (CP): Split sequence dimensions

10. FlashAttention: A Major Breakthrough

GPU Memory Hierarchy

  • HBM: Large, slow
  • SRAM: Small, fast

The goal: maximize SRAM usage, minimize HBM reads/writes


Vanilla Self-Attention

[ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ]

Naive implementation:

  1. Load Q, K from HBM
  2. Compute (S = QK^T), write to HBM
  3. Read S, compute P=Softmax(S), write to HBM
  4. Read P, compute output O=PV
  5. Lots of read and writes to HBM, which becomes bottleneck.

➡️ HBM bandwidth becomes the bottleneck.


FlashAttention Idea #1: Tiling

  • Load small blocks of Q, K, V into SRAM
  • Perform end-to-end attention computation
  • Write final output to HBM

Key mathematical trick: [ \text{Softmax}([S_1, S_2, …, S_n]) = \sum_i \alpha_i \cdot \text{Softmax}(S_i) ]

No approximation — exact computation.


FlashAttention Idea #2: Recompute Instead of Store

  • Don’t store intermediate tensors (S, P)
  • Recompute them when needed
  • More FLOPs, less memory traffic
  • Net result: faster runtime

11. Precision & Quantization

Floating-Point Representation

  • Sign
  • Exponent
  • Mantissa

Common formats:

  • FP64
  • FP32
  • FP16
  • BFLOAT16

Example:

  • FP16 → (1 sign, 5 exponent, 10 mantissa bits)

Using FP16 instead of FP64:

  • ~2× faster on GPUs like NVIDIA H100
  • Less memory usage

12. Mixed Precision Training

Objective

  • Reduce memory usage
  • Increase training speed

Strategy

  • Forward pass: low precision (FP16)
  • Backward pass: gradients in low precision
  • Weight updates: stored in high precision (FP32)

Result:

  • Minimal accuracy loss
  • Significant performance gains

13. Supervised Fine-Tuning (SFT)

Core Idea

  • Start from pretrained weights
  • Further train on task-specific labeled data

Steps

  1. Collect (input, desired output) pairs
  2. Train using next-token prediction conditioned on input

Instruction Tuning (Special Case of SFT)

  • Data consists of instructions + responses
  • Goal: make the model a helpful assistant
  • “Graduates” the model into chat-style behavior

14. Parameter-Efficient Fine-Tuning (PEFT)

Motivation

  • Full fine-tuning is GPU-intensive
  • Not accessible to everyone

LoRA (Low-Rank Adaptation)

Core Idea

Instead of updating full weight matrix: [ W = W_0 + B \times A ]

Where:

  • (W_0): frozen pretrained weights
  • (A, B): low-rank trainable matrices

Only a small fraction of parameters are trained.


Practical Observations

  • Similar performance to full fine-tuning
  • Most effective in feed-forward layers of decoder models

Fun Facts

  1. LoRA needs a higher learning rate
  2. LoRA performs worse than full fine-tuning at large batch sizes

15. QLoRA: Pushing Efficiency Further

Key Idea

  • Quantize frozen weights
  • Train LoRA adapters in full precision

[ W_0 \rightarrow \text{Quantized}, \quad B \times A \rightarrow \text{FP precision} ]

Key Technique

  • 4-bit NormalFloat (NF4)
  • Assumes weights follow a normal distribution

Advantages

  1. Massive VRAM savings
  2. Enables fine-tuning on smaller GPUs
  3. Excellent memory–quality tradeoff

Closing Thoughts

These lectures clearly show that LLM performance is not just about bigger models — it’s about:

  • Efficient training
  • Smart memory usage
  • Transfer learning
  • Parameter-efficient adaptation

Understanding these concepts is essential if you’re working with LLMs beyond toy demos and into production systems.


If you’re exploring LLM systems, this is foundational knowledge worth revisiting multiple times.