Stanford CME295 — Lec 4: Guide to LLM Training, Optimization, and Efficient Fine-Tuning

Published on Saturday, Jan 10, 2026

1 min read

This weekend, I dove into Stanford CME295 (Lecture 4) to build a deeper understanding of how Large Language Models (LLMs) are trained, optimized, and fine-tuned in practice.
These lectures bridge the gap between theory and real-world systems, covering everything from next-token prediction to FlashAttention, LoRA, and QLoRA.

1. Foundations: How LLMs Generate Text

At their core, LLMs are trained to predict the next token given a sequence of previous tokens.

Decoding Strategies (Inference-Time)

Once a model outputs a probability distribution over the vocabulary, we must decide how to pick the next token.

1. Greedy Decoding

Always selects the highest-probability token
Fast and deterministic
Often leads to repetitive or dull outputs

2. Beam Search

Keeps track of the top-k most probable sequences
Explores multiple paths simultaneously
Better quality than greedy, but expensive and less diverse

3. Sampling-Based Decoding

Samples the next token from the probability distribution
Introduces randomness and diversity
Key hyperparameter: Temperature
- Low temperature → more deterministic
- High temperature → more creative, more random

2. Inference Optimization (Production Reality)

Inference is often the dominant cost in real-world LLM systems.

Key Techniques

KV Cache (Very Important)

Stores Key and Value tensors from previous tokens
Avoids recomputing attention for the entire context every step
Crucial for fast autoregressive decoding

PagedAttention

Memory-efficient attention mechanism
Helps avoid large contiguous memory allocations
Widely used in production inference engines

3. Paradigm Shift in Machine Learning

Old ML Era

Train a model from scratch for each task

Transfer Learning Era

Reuse pretrained models
Fine-tune them for specific tasks

LLM Training Philosophy

Pretraining: Learn general language understanding
Tuning: Adapt the model to a specific behavior or task

4. Stage 1: Pretraining Large Language Models

What Is Pretraining?

Train on massive datasets
Objective: next-token prediction
Computationally and financially expensive

Data Mixtures

Web-scraped data (Common Crawl, Wikipedia)
Code (GitHub, StackOverflow)

Scale

Training data size: trillions of tokens

Model	Pretraining Size (Tokens)
GPT-3	300 Billion
LLaMA 3	~15 Trillion

5. Compute Notation (Used Everywhere)

FLOPs

Floating Point Operations
Measure of total compute
Training LLMs ≈ O(10²⁵ FLOPs)
Function of:
- Number of parameters
- Number of training tokens

FLOPS (FLOPs/sec)

Measures hardware speed
How fast your GPU executes operations

6. Chinchilla Scaling Law

Given a fixed compute budget:

Optimal training occurs when
#training tokens ≈ 20 × #model parameters

This showed that:

Many earlier models were over-parameterized and under-trained
Data quality and quantity matter as much as model size

7. Architectural Assumption

Most modern LLMs are decoder-only Transformer models

8. Challenges of Pretraining

Practical Challenges

Extremely high cost (millions of dollars)
Long training times
Environmental impact (energy consumption)

Model-Level Challenges

Knowledge cutoff
Hard to edit or remove knowledge
Risk of plagiarism and memorization

9. Training Optimization Techniques

LLMs are heavy on matrix multiplications, so efficiency is critical.

9.1 Data Parallelism

Split batch across multiple devices
Each device holds a full copy of the model

9.2 Data Parallelism with ZeRO

ZeRO (Zero Redundancy Optimization):

Shards optimizer states, gradients, and parameters
Dramatically reduces memory usage
Enables training much larger models

9.3 Model Parallelism

Split the model itself across devices.

Common Variants

Tensor Parallelism (TP): Split matrix multiplications
Context Parallelism (CP): Split sequence dimensions

10. FlashAttention: A Major Breakthrough

GPU Memory Hierarchy

HBM: Large, slow
SRAM: Small, fast

The goal: maximize SRAM usage, minimize HBM reads/writes

Vanilla Self-Attention

[ \text{Attention}(Q,K,V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ]

Naive implementation:

Load Q, K from HBM
Compute (S = QK^T), write to HBM
Read S, compute P=Softmax(S), write to HBM
Read P, compute output O=PV
Lots of read and writes to HBM, which becomes bottleneck.

➡️ HBM bandwidth becomes the bottleneck.

FlashAttention Idea #1: Tiling

Load small blocks of Q, K, V into SRAM
Perform end-to-end attention computation
Write final output to HBM

Key mathematical trick: [ \text{Softmax}([S_1, S_2, …, S_n]) = \sum_i \alpha_i \cdot \text{Softmax}(S_i) ]

No approximation — exact computation.

FlashAttention Idea #2: Recompute Instead of Store

Don’t store intermediate tensors (S, P)
Recompute them when needed
More FLOPs, less memory traffic
Net result: faster runtime

11. Precision & Quantization

Floating-Point Representation

Sign
Exponent
Mantissa

Common formats:

FP64
FP32
FP16
BFLOAT16

Example:

FP16 → (1 sign, 5 exponent, 10 mantissa bits)

Using FP16 instead of FP64:

~2× faster on GPUs like NVIDIA H100
Less memory usage

12. Mixed Precision Training

Objective

Reduce memory usage
Increase training speed

Strategy

Forward pass: low precision (FP16)
Backward pass: gradients in low precision
Weight updates: stored in high precision (FP32)

Result:

Minimal accuracy loss
Significant performance gains

13. Supervised Fine-Tuning (SFT)

Core Idea

Start from pretrained weights
Further train on task-specific labeled data

Steps

Collect (input, desired output) pairs
Train using next-token prediction conditioned on input

Instruction Tuning (Special Case of SFT)

Data consists of instructions + responses
Goal: make the model a helpful assistant
“Graduates” the model into chat-style behavior

14. Parameter-Efficient Fine-Tuning (PEFT)

Motivation

Full fine-tuning is GPU-intensive
Not accessible to everyone

LoRA (Low-Rank Adaptation)

Core Idea

Instead of updating full weight matrix: [ W = W_0 + B \times A ]

Where:

(W_0): frozen pretrained weights
(A, B): low-rank trainable matrices

Only a small fraction of parameters are trained.

Practical Observations

Similar performance to full fine-tuning
Most effective in feed-forward layers of decoder models

Fun Facts

LoRA needs a higher learning rate
LoRA performs worse than full fine-tuning at large batch sizes

15. QLoRA: Pushing Efficiency Further

Key Idea

Quantize frozen weights
Train LoRA adapters in full precision

[ W_0 \rightarrow \text{Quantized}, \quad B \times A \rightarrow \text{FP precision} ]

Key Technique

4-bit NormalFloat (NF4)
Assumes weights follow a normal distribution

Advantages

Massive VRAM savings
Enables fine-tuning on smaller GPUs
Excellent memory–quality tradeoff

Closing Thoughts

These lectures clearly show that LLM performance is not just about bigger models — it’s about:

Efficient training
Smart memory usage
Transfer learning
Parameter-efficient adaptation

Understanding these concepts is essential if you’re working with LLMs beyond toy demos and into production systems.

If you’re exploring LLM systems, this is foundational knowledge worth revisiting multiple times.