Notes on LLM
Preparation
Visit the official website: Stanford CS336 | Language Modeling from Scratch
Watch the recordings on YouTube: CS336_Spring_2026
Lectures Materials on GitHub: CS336_Spring_2026
Online Lecture Materials Trace - lecture_01
lecture 1. Overview, Tokenization
1. One bitter lesson
Wrong interpretation: Scale is all that matters, algorithms don’t matter.
Right interpretation: Algorithms and scale are what matter.
Accuracy = Efficiency * Rescources
2. History of Language Module
2.1 Pre-neural (before 2010s)
- Language model to measure the entropy of English, shannon_1950
- N-gram language models (used in machine translation and speech recognition systems), brants_2007
2.2 Neural ingredients (2010s)
- Long-Short Term Memory (LSTM), lstm_1997
- First neural language model, bengio_2003
- Sequence-to-sequence modeling (for machine translation), seq2seq_2014
- Adam optimizer, adam_2014
- Attention mechanism (for machine translation), bahdanau_2015_attention
- Transformer architecture (for machine translation), transformer_2017
- Mixture of experts, moe_2017
- Model parallelism, gpipe_2018, zero_2019 and megatron_lm_2019
2.3 Early foundation models (late 2010s)
- ELMo: pretraining with LSTMs, fine-tuning improves downstream tasks, elmo_2018
- BERT: pretraining with Transformer, fine-tuning improves downstream tasks, bert_2018
- Google’s T5 (11B): cast everything as text-to-text, t5_2019
2.4 Embracing scaling
- OpenAI’s GPT-2 (1.5B): fluent text, first signs of zero-shot, gpt2_2019
- Scaling laws: provide hope / predictability for scaling, kaplan_scaling_laws_2020
- OpenAI’s GPT-3 (175B): in-context learning, gpt_3_2020
- Google’s PaLM (540B): massive scale, undertrained, palm_2022
- DeepMind’s Chinchilla (70B): compute-optimal scaling laws, chinchilla_2022
2.5 Open models
- Early attempts (attempts to replicate GPT-3)
- EleutherAI’s open datasets (The Pile) and models (GPT-J), the_pile_2020 and gpt_j_2021
- Meta’s OPT (175B): GPT-3 replication, lots of hardware issues, opt_175b_2022
- Hugging Face / BigScience’s BLOOM (176B): focused on data sourcing, bloom_2022
2.6 Credible open-weight models (weights + paper)
- Meta’s Llama models, llama_2023, llama_2_2023 and llama_3_2024
- Mistral's models, mistral_7b_2023 and mixtral_2024
- DeepSeek's models, deepseek_67b_2024, deepseek_v2_2024 and deepseek_v3_2024
- Alibaba's Qwen models, qwen_2_5_2024 and qwen_3_2025
- Moonshot’s Kimi models, kimi_1_5_2025 and kimi_k2_5_2026
- Z.ai’s GLM models, glm_4_5_2025 and glm_5_2026
- Minimax's models, minimax_m2_5_2026
- Xiaomi’s MIMO models, xiaomi_mimo_v2_2026
- These models are approaching closed models (GPT, Claude, Gemini, etc.).
3. Course Syllabus
| Part | Assignment |
|---|---|
| basics | Assignment 1: tokenization, model architecture, training |
| systems | Assignment 2: kernels, parallelism, inference |
| scaling_laws | Assignment 3: scaling laws |
| data | Assignment 4: evaluation, curation, transformation, filtering, deduplication, mixing |
| alignment | Assignment 5: RLHF, RL algorithms, RL systems |
3. Tokenization
3.1 String, Token and Indices
Raw input text is generally represented as Unicode strings.
e.g.
1 | string = "Hello, world!" |
A string can be cut into smaller units called token. For example, Hello, ,, world and !.
A language module places a probability distribution over sequences of tokens (usually represented by integer indices).
1 | indices = [15496, 11, 995, 0] |
In fact, we have a vocabulary to convert the word(Note: not char) into an integer.
To bridge strings and integers, we use a vocabulary (a mapping table) that converts tokens (not necessarily whole words!) to integer IDs.
| Operation | Direction | Example |
|---|---|---|
| Encode | string → tokens → indices | "Hello" → 15496 |
| Decode | indices → tokens → string | 15496 → "Hello" |
Feel how tokenizers work: interplay
Observations
- A word and its preceding space are part of the same token (e.g., " world").
- A word at the beginning and in the middle are represented differently (e.g., “hello hello”). This will be discussed later.
- Numbers are tokenized into every few digits.
GPT-5 tokenizer from OpenAI (tiktoken) in action:
1 | tokenizer = get_gpt5_tokenizer() |
test code:
1 | import tiktoken |
run result:
1 | Token IDs: [9906, 1917, 22691] |
3.2 Compression Ratio
Compression Ratio: number of bytes per token
In the UTF-8, we have 128 chars, each needs 8 bits to express. So each char takes 1 byte memory. So the string “artificial intelligence” takes 23 bytes.
e.g.
| Input String | Vocabulary Size | byte_num | token[] | token_num | Compression Ratio |
|---|---|---|---|---|---|
| “artificial intelligence” | small | 24 | [art, i, fic, i, al, , in, t, el, lig, en, ce] | 12 | 1.92 |
| “artificial intelligence” | large | 24 | [artificial, , intelligence] | 3 | 7.67 |
Besides, here are two more conceptions: Char-level compression ratio and Byte-level compression ratio.
In normal English, 1 character(a, b, c…) takes one byte memory, while in other characters like Chinese characters, a character may take 2 or 3 bytes to store.
For an instance, in English, ‘a’ takes 1 char and 1 byte; while in Chinese ‘你’ takes 1 char but 3 bytes. So in Char-level and Byte-level, the compression ratios are different.
| Input String | Vocabulary Size | byte_num | char_num | token[] | token_num | Compression Ratio(Char) | Compression Ratio(Byte) |
|---|---|---|---|---|---|---|---|
| “artificial intelligence” | small | 23 | 23 | [art, i, fic, i, al, , in, t, el, lig, en, ce] | 12 | 23/12 = 1.92 | 23/12 = 1.92 |
| “artificial intelligence” | large | 23 | 23 | [artificial, , intelligence] | 3 | 23/3 = 7.67 | 23/3 = 7.67 |
| “你好” | small | 6 | 2 | [你, 好] | 2 | 2/2 = 1 | 6/2 = 3 |
| “你好” | large | 6 | 2 | [你好] | 1 | 2/1 = 2 | 6/1 = 6 |
There approximately 150K Unicode characters now, which means it needs a super large vocabulary while most characters are quite rare(Inefficient!). And the compression ratio reflects this is not good. (Note: I don’t get it now.)
In fact, to improve the compression ratio, we use the word_tokenizer(closer to what was done classically in NLP).
We use methods like regex to split string into chunks.
1 | string = "I'll say supercalifragilisticexpialidocious!" |
the outputs:
1 | string is: I'll say supercalifragilisticexpialidocious! |