Notes on LLM

Preparation

Visit the official website: Stanford CS336 | Language Modeling from Scratch

Watch the recordings on YouTube: CS336_Spring_2026

Lectures Materials on GitHub: CS336_Spring_2026

Online Lecture Materials Trace - lecture_01

lecture 1. Overview, Tokenization

1. One bitter lesson

Wrong interpretation: Scale is all that matters, algorithms don’t matter.

Right interpretation: Algorithms and scale are what matter.

Accuracy = Efficiency * Rescources

2. History of Language Module

2.1 Pre-neural (before 2010s)

  • Language model to measure the entropy of English, shannon_1950
  • N-gram language models (used in machine translation and speech recognition systems), brants_2007

2.2 Neural ingredients (2010s)

  • Long-Short Term Memory (LSTM), lstm_1997
  • First neural language model, bengio_2003
  • Sequence-to-sequence modeling (for machine translation), seq2seq_2014
  • Adam optimizer, adam_2014
  • Attention mechanism (for machine translation), bahdanau_2015_attention
  • Transformer architecture (for machine translation), transformer_2017
  • Mixture of experts, moe_2017
  • Model parallelism, gpipe_2018, zero_2019 and megatron_lm_2019

2.3 Early foundation models (late 2010s)

  • ELMo: pretraining with LSTMs, fine-tuning improves downstream tasks, elmo_2018
  • BERT: pretraining with Transformer, fine-tuning improves downstream tasks, bert_2018
  • Google’s T5 (11B): cast everything as text-to-text, t5_2019

2.4 Embracing scaling

  • OpenAI’s GPT-2 (1.5B): fluent text, first signs of zero-shot, gpt2_2019
  • Scaling laws: provide hope / predictability for scaling, kaplan_scaling_laws_2020
  • OpenAI’s GPT-3 (175B): in-context learning, gpt_3_2020
  • Google’s PaLM (540B): massive scale, undertrained, palm_2022
  • DeepMind’s Chinchilla (70B): compute-optimal scaling laws, chinchilla_2022

2.5 Open models

  • Early attempts (attempts to replicate GPT-3)
  • EleutherAI’s open datasets (The Pile) and models (GPT-J), the_pile_2020 and gpt_j_2021
  • Meta’s OPT (175B): GPT-3 replication, lots of hardware issues, opt_175b_2022
  • Hugging Face / BigScience’s BLOOM (176B): focused on data sourcing, bloom_2022

2.6 Credible open-weight models (weights + paper)

  • Meta’s Llama models, llama_2023, llama_2_2023 and llama_3_2024
  • Mistral's models, mistral_7b_2023 and mixtral_2024
  • DeepSeek's models, deepseek_67b_2024, deepseek_v2_2024 and deepseek_v3_2024
  • Alibaba's Qwen models, qwen_2_5_2024 and qwen_3_2025
  • Moonshot’s Kimi models, kimi_1_5_2025 and kimi_k2_5_2026
  • Z.ai’s GLM models, glm_4_5_2025 and glm_5_2026
  • Minimax's models, minimax_m2_5_2026
  • Xiaomi’s MIMO models, xiaomi_mimo_v2_2026
  • These models are approaching closed models (GPT, Claude, Gemini, etc.).

3. Course Syllabus

Part Assignment
basics Assignment 1: tokenization, model architecture, training
systems Assignment 2: kernels, parallelism, inference
scaling_laws Assignment 3: scaling laws
data Assignment 4: evaluation, curation, transformation, filtering, deduplication, mixing
alignment Assignment 5: RLHF, RL algorithms, RL systems

3. Tokenization

3.1 String, Token and Indices

Raw input text is generally represented as Unicode strings.

e.g.

1
string = "Hello, world!"

A string can be cut into smaller units called token. For example, Hello, ,, world and !.

A language module places a probability distribution over sequences of tokens (usually represented by integer indices).

1
indices = [15496, 11, 995, 0]

In fact, we have a vocabulary to convert the word(Note: not char) into an integer.

To bridge strings and integers, we use a vocabulary (a mapping table) that converts tokens (not necessarily whole words!) to integer IDs.

Operation Direction Example
Encode string → tokens → indices "Hello"15496
Decode indices → tokens → string 15496"Hello"

Feel how tokenizers work: interplay

Observations

  • A word and its preceding space are part of the same token (e.g., " world").
  • A word at the beginning and in the middle are represented differently (e.g., “hello hello”). This will be discussed later.
  • Numbers are tokenized into every few digits.

GPT-5 tokenizer from OpenAI (tiktoken) in action:

1
2
3
4
5
6
7
tokenizer = get_gpt5_tokenizer()
string = "Hello, 🌍! 你好!"

indices = tokenizer.encode(string)
reconstructed_string = tokenizer.decode(indices)

assert string == reconstructed_string

test code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import tiktoken

def main():
originString = "Hello world Hello"
encoder = tiktoken.encoding_for_model("gpt-4")

tokens = encoder.encode(originString)
print("Token IDs:", tokens)

rebuiltString = encoder.decode(tokens)
print("The originString is:", originString)
print("The rebuiltString is:", rebuiltString)
assert originString == rebuiltString

if __name__ == "__main__":
main()

run result:

1
2
3
Token IDs: [9906, 1917, 22691]
The originString is: Hello world Hello
The rebuiltString is: Hello world Hello

3.2 Compression Ratio

Compression Ratio: number of bytes per token

In the UTF-8, we have 128 chars, each needs 8 bits to express. So each char takes 1 byte memory. So the string “artificial intelligence” takes 23 bytes.

e.g.

Input String Vocabulary Size byte_num token[] token_num Compression Ratio
“artificial intelligence” small 24 [art, i, fic, i, al, , in, t, el, lig, en, ce] 12 1.92
“artificial intelligence” large 24 [artificial, , intelligence] 3 7.67

Besides, here are two more conceptions: Char-level compression ratio and Byte-level compression ratio.

In normal English, 1 character(a, b, c…) takes one byte memory, while in other characters like Chinese characters, a character may take 2 or 3 bytes to store.

For an instance, in English, ‘a’ takes 1 char and 1 byte; while in Chinese ‘你’ takes 1 char but 3 bytes. So in Char-level and Byte-level, the compression ratios are different.

Input String Vocabulary Size byte_num char_num token[] token_num Compression Ratio(Char) Compression Ratio(Byte)
“artificial intelligence” small 23 23 [art, i, fic, i, al, , in, t, el, lig, en, ce] 12 23/12 = 1.92 23/12 = 1.92
“artificial intelligence” large 23 23 [artificial, , intelligence] 3 23/3 = 7.67 23/3 = 7.67
“你好” small 6 2 [你, 好] 2 2/2 = 1 6/2 = 3
“你好” large 6 2 [你好] 1 2/1 = 2 6/1 = 6

There approximately 150K Unicode characters now, which means it needs a super large vocabulary while most characters are quite rare(Inefficient!). And the compression ratio reflects this is not good. (Note: I don’t get it now.)

In fact, to improve the compression ratio, we use the word_tokenizer(closer to what was done classically in NLP).

We use methods like regex to split string into chunks.

1
2
3
4
5
string = "I'll say supercalifragilisticexpialidocious!"
chunks = regex.findall(r"\w+|.", string)

print("string is:", string)
print("chunks are:", chunks)

the outputs:

1
2
string is: I'll say supercalifragilisticexpialidocious!
chunks are: ['I', "'", 'll', ' ', 'say', ' ', 'supercalifragilisticexpialidocious', '!']

3.3 Byte Pair Encoding (BPE)

本文链接:https://wangyier.top/Notes-on-LLM/

版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC-SA 4.0 许可协议。转载请注明来自 The Great Library