Stanford CME295 — Lec 4: Guide to LLM Training, Optimization, and Efficient Fine-Tuning
Published on Saturday, Jan 10, 2026
1 min read
This weekend, I dove into Stanford CME295 (Lecture 4) to build a deeper understanding of how Large Language Models (LLMs) are trained, optimized, and fine-tuned in practice.
These lectures bridge the gap between theory and real-world systems, covering everything from next-token prediction to FlashAttention, LoRA, and QLoRA.
1. Foundations: How LLMs Generate Text
At their core, LLMs are trained to predict the next token given a sequence of previous tokens.
Decoding Strategies (Inference-Time)
Once a model outputs a probability distribution over the vocabulary, we must decide how to pick the next token.
1. Greedy Decoding
- Always selects the highest-probability token
- Fast and deterministic
- Often leads to repetitive or dull outputs
2. Beam Search
- Keeps track of the top-k most probable sequences
- Explores multiple paths simultaneously
- Better quality than greedy, but expensive and less diverse
3. Sampling-Based Decoding
- Samples the next token from the probability distribution
- Introduces randomness and diversity
- Key hyperparameter: Temperature
- Low temperature → more deterministic
- High temperature → more creative, more random
2. Inference Optimization (Production Reality)
Inference is often the dominant cost in real-world LLM systems.
Key Techniques
KV Cache (Very Important)
- Stores Key and Value tensors from previous tokens
- Avoids recomputing attention for the entire context every step
- Crucial for fast autoregressive decoding
PagedAttention
- Memory-efficient attention mechanism
- Helps avoid large contiguous memory allocations
- Widely used in production inference engines
3. Paradigm Shift in Machine Learning
Old ML Era
- Train a model from scratch for each task
Transfer Learning Era
- Reuse pretrained models
- Fine-tune them for specific tasks
LLM Training Philosophy
- Pretraining: Learn general language understanding
- Tuning: Adapt the model to a specific behavior or task
4. Stage 1: Pretraining Large Language Models
What Is Pretraining?
- Train on massive datasets
- Objective: next-token prediction
- Computationally and financially expensive
Data Mixtures
- Web-scraped data (Common Crawl, Wikipedia)
- Code (GitHub, StackOverflow)
Scale
- Training data size: trillions of tokens
| Model | Pretraining Size (Tokens) |
|---|---|
| GPT-3 | 300 Billion |
| LLaMA 3 | ~15 Trillion |
5. Compute Notation (Used Everywhere)
FLOPs
- Floating Point Operations
- Measure of total compute
- Training LLMs ≈ O(10²⁵ FLOPs)
- Function of:
- Number of parameters
- Number of training tokens
FLOPS (FLOPs/sec)
- Measures hardware speed
- How fast your GPU executes operations
6. Chinchilla Scaling Law
Given a fixed compute budget:
- Optimal training occurs when
#training tokens ≈ 20 × #model parameters
This showed that:
- Many earlier models were over-parameterized and under-trained
- Data quality and quantity matter as much as model size
7. Architectural Assumption
Most modern LLMs are decoder-only Transformer models
8. Challenges of Pretraining
Practical Challenges
- Extremely high cost (millions of dollars)
- Long training times
- Environmental impact (energy consumption)
Model-Level Challenges
- Knowledge cutoff
- Hard to edit or remove knowledge
- Risk of plagiarism and memorization
9. Training Optimization Techniques
LLMs are heavy on matrix multiplications, so efficiency is critical.
9.1 Data Parallelism
- Split batch across multiple devices
- Each device holds a full copy of the model
9.2 Data Parallelism with ZeRO
ZeRO (Zero Redundancy Optimization):
- Shards optimizer states, gradients, and parameters
- Dramatically reduces memory usage
- Enables training much larger models
9.3 Model Parallelism
Split the model itself across devices.
Common Variants
- Tensor Parallelism (TP): Split matrix multiplications
- Context Parallelism (CP): Split sequence dimensions
10. FlashAttention: A Major Breakthrough
GPU Memory Hierarchy
- HBM: Large, slow
- SRAM: Small, fast
The goal: maximize SRAM usage, minimize HBM reads/writes
Vanilla Self-Attention
[ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ]
Naive implementation:
- Load Q, K from HBM
- Compute (S = QK^T), write to HBM
- Read S, compute P=Softmax(S), write to HBM
- Read P, compute output O=PV
- Lots of read and writes to HBM, which becomes bottleneck.
➡️ HBM bandwidth becomes the bottleneck.
FlashAttention Idea #1: Tiling
- Load small blocks of Q, K, V into SRAM
- Perform end-to-end attention computation
- Write final output to HBM
Key mathematical trick: [ \text{Softmax}([S_1, S_2, …, S_n]) = \sum_i \alpha_i \cdot \text{Softmax}(S_i) ]
No approximation — exact computation.
FlashAttention Idea #2: Recompute Instead of Store
- Don’t store intermediate tensors (S, P)
- Recompute them when needed
- More FLOPs, less memory traffic
- Net result: faster runtime
11. Precision & Quantization
Floating-Point Representation
- Sign
- Exponent
- Mantissa
Common formats:
- FP64
- FP32
- FP16
- BFLOAT16
Example:
- FP16 → (1 sign, 5 exponent, 10 mantissa bits)
Using FP16 instead of FP64:
- ~2× faster on GPUs like NVIDIA H100
- Less memory usage
12. Mixed Precision Training
Objective
- Reduce memory usage
- Increase training speed
Strategy
- Forward pass: low precision (FP16)
- Backward pass: gradients in low precision
- Weight updates: stored in high precision (FP32)
Result:
- Minimal accuracy loss
- Significant performance gains
13. Supervised Fine-Tuning (SFT)
Core Idea
- Start from pretrained weights
- Further train on task-specific labeled data
Steps
- Collect (input, desired output) pairs
- Train using next-token prediction conditioned on input
Instruction Tuning (Special Case of SFT)
- Data consists of instructions + responses
- Goal: make the model a helpful assistant
- “Graduates” the model into chat-style behavior
14. Parameter-Efficient Fine-Tuning (PEFT)
Motivation
- Full fine-tuning is GPU-intensive
- Not accessible to everyone
LoRA (Low-Rank Adaptation)
Core Idea
Instead of updating full weight matrix: [ W = W_0 + B \times A ]
Where:
- (W_0): frozen pretrained weights
- (A, B): low-rank trainable matrices
Only a small fraction of parameters are trained.
Practical Observations
- Similar performance to full fine-tuning
- Most effective in feed-forward layers of decoder models
Fun Facts
- LoRA needs a higher learning rate
- LoRA performs worse than full fine-tuning at large batch sizes
15. QLoRA: Pushing Efficiency Further
Key Idea
- Quantize frozen weights
- Train LoRA adapters in full precision
[ W_0 \rightarrow \text{Quantized}, \quad B \times A \rightarrow \text{FP precision} ]
Key Technique
- 4-bit NormalFloat (NF4)
- Assumes weights follow a normal distribution
Advantages
- Massive VRAM savings
- Enables fine-tuning on smaller GPUs
- Excellent memory–quality tradeoff
Closing Thoughts
These lectures clearly show that LLM performance is not just about bigger models — it’s about:
- Efficient training
- Smart memory usage
- Transfer learning
- Parameter-efficient adaptation
Understanding these concepts is essential if you’re working with LLMs beyond toy demos and into production systems.
If you’re exploring LLM systems, this is foundational knowledge worth revisiting multiple times.