Notes on LLM

Posted on 2026-06-26 In tech Views:

Notes on LLM.

Preparation

Visit the official website: Stanford CS336 | Language Modeling from Scratch

Watch the recordings on YouTube: CS336_Spring_2026

Lectures Materials on GitHub: CS336_Spring_2026

Online Lecture Materials Trace - lecture_01

lecture 1. Overview, Tokenization

1. One bitter lesson

Wrong interpretation: Scale is all that matters, algorithms don’t matter.

Right interpretation: Algorithms and scale are what matter.

Accuracy = Efficiency * Rescources

2. History of Language Module

2.1 Pre-neural (before 2010s)

Language model to measure the entropy of English, shannon_1950
N-gram language models (used in machine translation and speech recognition systems), brants_2007

2.2 Neural ingredients (2010s)

Long-Short Term Memory (LSTM), lstm_1997
First neural language model, bengio_2003
Sequence-to-sequence modeling (for machine translation), seq2seq_2014
Adam optimizer, adam_2014
Attention mechanism (for machine translation), bahdanau_2015_attention
Transformer architecture (for machine translation), transformer_2017
Mixture of experts, moe_2017
Model parallelism, gpipe_2018, zero_2019 and megatron_lm_2019

2.3 Early foundation models (late 2010s)

ELMo: pretraining with LSTMs, fine-tuning improves downstream tasks, elmo_2018
BERT: pretraining with Transformer, fine-tuning improves downstream tasks, bert_2018
Google’s T5 (11B): cast everything as text-to-text, t5_2019

2.4 Embracing scaling

OpenAI’s GPT-2 (1.5B): fluent text, first signs of zero-shot, gpt2_2019
Scaling laws: provide hope / predictability for scaling, kaplan_scaling_laws_2020
OpenAI’s GPT-3 (175B): in-context learning, gpt_3_2020
Google’s PaLM (540B): massive scale, undertrained, palm_2022
DeepMind’s Chinchilla (70B): compute-optimal scaling laws, chinchilla_2022

2.5 Open models

Early attempts (attempts to replicate GPT-3)
EleutherAI’s open datasets (The Pile) and models (GPT-J), the_pile_2020 and gpt_j_2021
Meta’s OPT (175B): GPT-3 replication, lots of hardware issues, opt_175b_2022
Hugging Face / BigScience’s BLOOM (176B): focused on data sourcing, bloom_2022

2.6 Credible open-weight models (weights + paper)

Meta’s Llama models, llama_2023, llama_2_2023 and llama_3_2024
Mistral's models, mistral_7b_2023 and mixtral_2024
DeepSeek's models, deepseek_67b_2024, deepseek_v2_2024 and deepseek_v3_2024
Alibaba's Qwen models, qwen_2_5_2024 and qwen_3_2025
Moonshot’s Kimi models, kimi_1_5_2025 and kimi_k2_5_2026
Z.ai’s GLM models, glm_4_5_2025 and glm_5_2026
Minimax's models, minimax_m2_5_2026
Xiaomi’s MIMO models, xiaomi_mimo_v2_2026
These models are approaching closed models (GPT, Claude, Gemini, etc.).

3. Course Syllabus

Part	Assignment
basics	Assignment 1: tokenization, model architecture, training
systems	Assignment 2: kernels, parallelism, inference
scaling_laws	Assignment 3: scaling laws
data	Assignment 4: evaluation, curation, transformation, filtering, deduplication, mixing
alignment	Assignment 5: RLHF, RL algorithms, RL systems

3. Tokenization

3.1 String, Token and Indices

Raw input text is generally represented as Unicode strings.

e.g.

1	string = "Hello, world!"

A string can be cut into smaller units called token. For example, Hello, ,, world and !.

A language module places a probability distribution over sequences of tokens (usually represented by integer indices).

1	indices = [15496, 11, 995, 0]

In fact, we have a vocabulary to convert the word(Note: not char) into an integer.

To bridge strings and integers, we use a vocabulary (a mapping table) that converts tokens (not necessarily whole words!) to integer IDs.

Operation	Direction	Example
Encode	string → tokens → indices	`"Hello"` → `15496`
Decode	indices → tokens → string	`15496` → `"Hello"`

Feel how tokenizers work: interplay

Observations

A word and its preceding space are part of the same token (e.g., " world").
A word at the beginning and in the middle are represented differently (e.g., “hello hello”). This will be discussed later.
Numbers are tokenized into every few digits.

GPT-5 tokenizer from OpenAI (tiktoken) in action:

tokenizer = get_gpt5_tokenizer()
string = "Hello, 🌍! 你好!"

indices = tokenizer.encode(string)
reconstructed_string = tokenizer.decode(indices)

assert string == reconstructed_string

test code:

import tiktoken

def main():
    originString = "Hello world Hello"
    encoder = tiktoken.encoding_for_model("gpt-4")

    tokens = encoder.encode(originString)
    print("Token IDs:", tokens)
    
    rebuiltString = encoder.decode(tokens)
    print("The originString is:", originString)
    print("The rebuiltString is:", rebuiltString)
    assert originString == rebuiltString

if __name__ == "__main__":
    main()

run result:

1
2
3

Token IDs: [9906, 1917, 22691]
The originString is: Hello world Hello
The rebuiltString is: Hello world Hello

3.2 Compression Ratio

Compression Ratio: number of bytes per token

In the UTF-8, we have 128 chars, each needs 8 bits to express. So each char takes 1 byte memory. So the string “artificial intelligence” takes 23 bytes.

e.g.

Input String	Vocabulary Size	byte_num	token[]	token_num	Compression Ratio
“artificial intelligence”	small	24	[art, i, fic, i, al, , in, t, el, lig, en, ce]	12	1.92
“artificial intelligence”	large	24	[artificial, , intelligence]	3	7.67

Besides, here are two more conceptions: Char-level compression ratio and Byte-level compression ratio.

In normal English, 1 character(a, b, c…) takes one byte memory, while in other characters like Chinese characters, a character may take 2 or 3 bytes to store.

For an instance, in English, ‘a’ takes 1 char and 1 byte; while in Chinese ‘你’ takes 1 char but 3 bytes. So in Char-level and Byte-level, the compression ratios are different.

Input String	Vocabulary Size	byte_num	char_num	token[]	token_num	Compression Ratio(Char)	Compression Ratio(Byte)
“artificial intelligence”	small	23	23	[art, i, fic, i, al, , in, t, el, lig, en, ce]	12	23/12 = 1.92	23/12 = 1.92
“artificial intelligence”	large	23	23	[artificial, , intelligence]	3	23/3 = 7.67	23/3 = 7.67
“你好”	small	6	2	[你, 好]	2	2/2 = 1	6/2 = 3
“你好”	large	6	2	[你好]	1	2/1 = 2	6/1 = 6

There approximately 150K Unicode characters now, which means it needs a super large vocabulary while most characters are quite rare(Inefficient!). And the compression ratio reflects this is not good. (Note: I don’t get it now.)

In fact, to improve the compression ratio, we use the word_tokenizer(closer to what was done classically in NLP).

We use methods like regex to split string into chunks.

string = "I'll say supercalifragilisticexpialidocious!"
chunks = regex.findall(r"\w+|.", string)

print("string is:", string)
print("chunks are:", chunks)

the outputs:

1 2	string is: I'll say supercalifragilisticexpialidocious! chunks are: ['I', "'", 'll', ' ', 'say', ' ', 'supercalifragilisticexpialidocious', '!']