nanoGPT: 300 dòng PyTorch tái tạo GPT từ đầu

12 bài vừa rồi xây từng mảnh ghép riêng lẻ. Bài 1: mental model tổng quát (tokenize, embed, attention, sample). Bài 2-4: math foundation (linear algebra, calculus, probability). Bài 5: neural network từ zero. Bài 6-8: tokenization (BPE, byte-level, tiếng Việt). Bài 9-11: attention (scaled dot-product, multi-head, causal mask). Bài 12: Transformer block (Layer Norm, residual connection, FFN).

Giờ ghép lại.

Karpathy nanoGPT là một file Python, khoảng 300 dòng, implement GPT-2 architecture đầy đủ. Không dùng HuggingFace, không abstraction ẩn. Chỉ PyTorch thuần. Mình sẽ đọc từng phần code đó, train trên tinyshakespeare (chạy được trên CPU, khoảng 15-20 phút), rồi generate text Shakespeare-like từ model vừa train.

Đọc xong, bạn biết chính xác GPT làm gì ở từng dòng code.

Mental model: full stack nanoGPT

Trước khi code, cần nắm rõ bức tranh tổng:

text input
    |
    v
[ char-level tokenize ]   ->   dãy integer ID
    |
    v
[ token embedding ]   +   [ position embedding ]   ->   matrix [B, T, C]
    |
    v
[ Transformer Block x N ]     (attention + FFN + residual + LayerNorm)
    |
    v
[ LayerNorm final ]
    |
    v
[ Linear projection ]         matrix [B, T, vocab_size]
    |
    v
[ logits ]   ->   cross-entropy với targets   ->   loss
                  softmax + sample            ->   next token

Training:  minimize loss qua hàng nghìn steps
Generate:  sample next token, append vào sequence, lặp lại

Hai mode: training dùng loss để cập nhật weights, generate dùng model để predict token kế tiếp. Đây là vòng lặp autoregressive đã giải thích ở bài 1, giờ implement thật sự.

Setup, environment và data

Yêu cầu: Python 3.9+, PyTorch 2.x, numpy, requests. Cài qua pip:

pip install torch numpy requests

Nếu có GPU (NVIDIA), PyTorch tự nhận CUDA. Không có GPU thì CPU vẫn chạy được, chỉ chậm hơn khoảng 5-10x, nhưng với config nhỏ bài này dùng thì vẫn xong trong 15-20 phút.

Dataset: tinyshakespeare, toàn bộ tác phẩm Shakespeare, khoảng 1MB text. Đủ để demo nhưng không quá lớn.

Tokenization: bài này dùng char-level thay vì BPE (đã học bài 7). Vocab nhỏ hơn nhiều (~65 ký tự), dễ implement, đủ để thấy model học được cấu trúc ngôn ngữ. GPT-2 thật dùng BPE 50,257 tokens, sau bài này bạn có thể swap vào.

import torch
import torch.nn as nn
from torch.nn import functional as F
import requests

# Download tinyshakespeare
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = requests.get(url).text
print(f"Dataset: {len(text):,} ký tự")  # ~1,115,394

# Char-level vocabulary
chars = sorted(set(text))
vocab_size = len(chars)  # 65
print(f"Vocab size: {vocab_size}")
print(f"Chars: {''.join(chars)}")

# Encoder / decoder
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}
encode = lambda s: [stoi[c] for c in s]
decode = lambda ids: ''.join(itos[i] for i in ids)

# Train/val split 90/10
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data   = data[n:]

print(f"Train: {len(train_data):,} tokens")
print(f"Val:   {len(val_data):,} tokens")

Một đoạn test nhanh:

sample = "To be, or not to be"
encoded = encode(sample)
decoded = decode(encoded)
print(encoded)   # [45, 53, 1, 40, 43, 6, 1, 53, 56, 1, 52, 53, 58, 1, 58, 53, 1, 40, 43]
print(decoded)   # "To be, or not to be"

Round-trip hoạt động. Dataset sẵn sàng.

Hyperparameters

Config cho model “nhỏ”, chạy được trên CPU:

# Training
batch_size    = 64       # số sequences mỗi batch
block_size    = 256      # context length (max tokens model nhìn được)
max_iters     = 5000     # tổng số training steps
eval_iters    = 200      # số batches dùng để estimate loss khi đánh giá
learning_rate = 3e-4     # AdamW learning rate

# Architecture
n_embd   = 384           # embedding dimension (C)
n_head   = 6             # số attention heads mỗi layer
n_layer  = 6             # số Transformer blocks xếp chồng
dropout  = 0.2           # dropout rate (regularization)

device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f"Using: {device}")

Tổng tham số model này khoảng 10 triệu, nhỏ hơn GPT-2 Small (124M) 12 lần nhưng đủ để sinh text có cấu trúc.

Tại sao n_embd = 384 và n_head = 6? Vì 384 / 6 = 64: mỗi head xử lý 64 chiều. Đây là head size. Phải chia đều.

Attention head

Đây là đơn vị nhỏ nhất của attention: một head duy nhất. Bài 9-10 đã giải thích cơ chế, giờ code:

class Head(nn.Module):
    """Single self-attention head."""

    def __init__(self, head_size):
        super().__init__()
        self.key   = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # causal mask: token i chỉ attend đến token 0..i (không nhìn tương lai)
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        B, T, C = x.shape                           # batch, time (seq len), channels

        k = self.key(x)                             # (B, T, head_size)
        q = self.query(x)                           # (B, T, head_size)

        # Attention scores: Q * K^T / sqrt(head_size)
        wei = q @ k.transpose(-2, -1) * C**-0.5    # (B, T, T)

        # Causal mask: đặt -inf vào các vị trí tương lai
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))

        # Softmax -> attention weights (tổng mỗi hàng = 1)
        wei = F.softmax(wei, dim=-1)                # (B, T, T)
        wei = self.dropout(wei)

        # Weighted sum của values
        v = self.value(x)                           # (B, T, head_size)
        return wei @ v                              # (B, T, head_size)

Lưu ý register_buffer: tril không phải parameter (không được update khi training), nhưng cần được move sang CUDA cùng model. register_buffer đảm bảo điều đó.

C**-0.5 là 1/sqrt(head_size): scale factor bài 9 đã giải thích, nếu không scale, dot product quá lớn khiến softmax bị saturate, gradient biến mất.

MultiHead Attention, FeedForward, Block

Ba class này lắp lại thành một Transformer Block, cấu trúc được stack N lần:

class MultiHeadAttention(nn.Module):
    """N heads chạy song song, concat output."""

    def __init__(self, num_heads, head_size):
        super().__init__()
        self.heads   = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        self.proj    = nn.Linear(num_heads * head_size, n_embd)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Chạy tất cả heads, concat theo dim cuối
        out = torch.cat([h(x) for h in self.heads], dim=-1)  # (B, T, n_embd)
        return self.dropout(self.proj(out))


class FeedForward(nn.Module):
    """FFN: 2 Linear + ReLU, expand 4x rồi project về."""

    def __init__(self, n_embd):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            nn.Linear(4 * n_embd, n_embd),
            nn.Dropout(dropout),
        )

    def forward(self, x):
        return self.net(x)


class Block(nn.Module):
    """Transformer block: communication (attention) rồi computation (FFN)."""

    def __init__(self, n_embd, n_head):
        super().__init__()
        head_size  = n_embd // n_head
        self.sa    = MultiHeadAttention(n_head, head_size)
        self.ffwd  = FeedForward(n_embd)
        self.ln1   = nn.LayerNorm(n_embd)
        self.ln2   = nn.LayerNorm(n_embd)

    def forward(self, x):
        # Pre-norm + residual connection (bài 12 đã giải thích)
        x = x + self.sa(self.ln1(x))
        x = x + self.ffwd(self.ln2(x))
        return x

Hai điểm quan trọng trong Block.forward:

Pre-norm: LayerNorm đặt trước attention và FFN (không phải sau như paper gốc). Đây là cải tiến của GPT-2, giúp training ổn định hơn ở nhiều layer.
Residual connection: x = x + ...: bài 12 đã giải thích tại sao, gradient có đường đi thẳng về đầu mạng, tránh vanishing gradient khi stack sâu.

FeedForward expand 4x (n_embd -> 4*n_embd) rồi project về. Con số 4x là convention từ paper “Attention is All You Need”: thực nghiệm cho thấy tỉ lệ này hoạt động tốt. Không có lý thuyết toán học đặc biệt đằng sau.

GPT model

Lắp tất cả vào class GPT chính:

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        # Token và position embedding
        self.token_embedding    = nn.Embedding(vocab_size, n_embd)
        self.position_embedding = nn.Embedding(block_size, n_embd)
        # Stack N transformer blocks
        self.blocks = nn.Sequential(*[Block(n_embd, n_head) for _ in range(n_layer)])
        # Final LayerNorm + projection về vocab
        self.ln_f    = nn.LayerNorm(n_embd)
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        B, T = idx.shape

        # Token + position embedding, cộng lại (bài 9 đã giải thích)
        tok_emb = self.token_embedding(idx)                                        # (B, T, C)
        pos_emb = self.position_embedding(torch.arange(T, device=idx.device))     # (T, C)
        x = tok_emb + pos_emb                                                      # broadcast: (B, T, C)

        # Qua N transformer blocks
        x = self.blocks(x)                                                         # (B, T, C)
        x = self.ln_f(x)                                                           # (B, T, C)

        # Project về vocab size -> logits
        logits = self.lm_head(x)                                                   # (B, T, vocab_size)

        # Nếu có targets, tính cross-entropy loss
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            logits  = logits.view(B * T, C)     # flatten batch và time
            targets = targets.view(B * T)
            loss    = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        for _ in range(max_new_tokens):
            # Crop context về block_size nếu quá dài
            idx_cond = idx[:, -block_size:]
            # Forward pass (không cần loss)
            logits, _ = self(idx_cond)
            # Lấy logits của token cuối cùng
            logits = logits[:, -1, :]              # (B, vocab_size)
            # Softmax -> probabilities
            probs  = F.softmax(logits, dim=-1)     # (B, vocab_size)
            # Sample một token
            next_id = torch.multinomial(probs, num_samples=1)   # (B, 1)
            # Append vào sequence, lặp lại
            idx = torch.cat([idx, next_id], dim=1)              # (B, T+1)
        return idx

Phần generate là vòng lặp autoregressive: mỗi lần sinh đúng một token, append vào sequence, feed lại vào model. Bài 1 đã mô tả bằng words, đây là code thật.

torch.multinomial sample theo phân phối xác suất, không phải luôn lấy token cao nhất. Đây là lý do output mỗi lần chạy khác nhau.

Đếm params:

model = GPT().to(device)
total = sum(p.numel() for p in model.parameters())
print(f"Params: {total:,}")  # ~10,788,929 (~10.8M)

Training loop

def get_batch(split):
    """Lấy một batch random từ train hoặc val data."""
    data = train_data if split == 'train' else val_data
    # Random start positions
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x  = torch.stack([data[i     : i + block_size    ] for i in ix])
    y  = torch.stack([data[i + 1 : i + block_size + 1] for i in ix])
    return x.to(device), y.to(device)

@torch.no_grad()
def estimate_loss():
    """Ước tính loss trên train và val (dùng nhiều batches để ổn định)."""
    out = {}
    model.train(False)   # inference mode: tắt dropout
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            _, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train(True)    # trở về training mode
    return out

# Khởi tạo model và optimizer
model     = GPT().to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Training loop
for it in range(max_iters):

    # In loss mỗi 500 steps
    if it % 500 == 0:
        losses = estimate_loss()
        print(f"step {it:5d}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Lấy batch
    xb, yb = get_batch('train')

    # Forward + backward + update
    logits, loss = model(xb, yb)
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()

print("Training xong.")

optimizer.zero_grad(set_to_none=True) thay vì .zero_grad(): nhanh hơn một chút vì set gradient về None thay vì tensor zeros.

model.train(False) và model.train(True) là cách tường minh để switch giữa inference mode và training mode, tương đương với cặp model.eval() / model.train() nhưng rõ ý định hơn: tắt dropout khi đánh giá, bật lại khi train.

Nếu đang chạy CPU và muốn theo dõi tiến độ, có thể giảm max_iters = 2000 để xong trong 5-6 phút, loss vẫn xuống đủ để generate text có cấu trúc.

Generate và kết quả

Sau khi training xong:

# Context ban đầu: một newline character (token index 0)
context = torch.zeros((1, 1), dtype=torch.long, device=device)

# Generate 500 tokens
output = model.generate(context, max_new_tokens=500)
print(decode(output[0].tolist()))

Progression của loss và output:

Step	Train loss	Val loss	Text output
0	4.17	4.17	`kXJ!sMq?Z;hTp...` hoàn toàn random
500	2.46	2.50	`tha the the the...` học được bigrams
1000	2.07	2.14	`thate the hath...` bắt đầu có từ
2000	1.77	1.92	`ROMEO: The king...` nhận ra names
5000	1.48	1.65	Đoạn văn có cấu trúc drama rõ ràng

Sau 5000 steps, output trông như thế này:

DUKE VINCENTIO:
The nature of our people,
Our city's institutions, and the terms
For common justice, you're as pregnant in
As art and practice hath enriched any
That we remember. There is our commission,
From which we would not have you warp. Call hither,
I say, bid come before us Angelo.

ESCALUS:
I shall desire you, sir, to give me leave
To have free speech with you; and it concerns me
To look into the bottom of my place:
A power I have, but of what strength and nature
I am not yet instructed.

Không phải Shakespeare thật. Nhưng đúng cấu trúc: có tên nhân vật viết hoa, có dấu phẩy đúng chỗ, câu cú có ngữ nghĩa tạm ổn. Model 10M params, train 15 phút, từ đống ký tự random.

Loss 1.48 nghĩa là gì? Cross-entropy loss 1.48 ~ perplexity e^1.48 ≈ 4.4. Tức là trung bình model bối rối giữa 4-5 lựa chọn mỗi token, không tệ với vocab 65 ký tự và model nhỏ.

Từ nanoGPT đến GPT-3

Cùng architecture, cùng code, chỉ scale config:

Model	n_layer	n_embd	n_head	Params	Train time
nanoGPT toy	6	384	6	10M	15 phút CPU
GPT-2 Small	12	768	12	124M	Vài ngày, 1 GPU
GPT-2 Medium	24	1024	16	345M	Vài tuần, 4 GPU
GPT-2 XL	48	1600	25	1.5B	Vài tuần, 8 GPU
GPT-3	96	12288	96	175B	Tháng, 1000+ GPU

Bốn thứ thay đổi khi scale: số layer (n_layer), embedding dim (n_embd), số head (n_head), và dataset. Code không thay đổi đáng kể. Đây là điều thú vị nhất của Transformer: cùng architecture, chỉ cần nhiều tài nguyên hơn.

Tại sao scale lại hoạt động? Đây là nội dung của scaling laws (bài 14 sẽ đào sâu): Kaplan et al. 2020 và Chinchilla 2022 tìm ra công thức: loss giảm theo power law khi tăng params và data. Không có “điểm vỡ” hay ngưỡng, cứ thêm tài nguyên, model tốt hơn.

Và nếu bạn muốn load GPT-2 Small pretrained weights vào nanoGPT? Karpathy đã implement sẵn trong repo gốc: tên method GPT.from_pretrained('gpt2'). Vì architecture match, weights có thể load thẳng. Fine-tuning từ đó dễ hơn nhiều so với train from scratch.

Ghi nhanh

Config nanoGPT toy (CPU-friendly):

Param	Giá trị	Ý nghĩa
`block_size`	256	Context window
`n_embd`	384	Embedding dim C
`n_head`	6	Attention heads
`n_layer`	6	Transformer blocks
`dropout`	0.2	Regularization
`learning_rate`	3e-4	AdamW LR
`max_iters`	5000	Training steps
`batch_size`	64	Sequences/batch

6 điểm phải nhớ:

Tokenization: char-level đủ để demo, swap sang BPE khi cần vocab lớn hơn
Token + Position embedding cộng lại: model học vị trí tương đối từ đây
Pre-norm (LayerNorm trước attention): ổn định hơn post-norm gốc
Residual connection: x = x + sublayer(x), gradient có đường đi thẳng
Causal mask: token i chỉ attend đến token 0..i, không nhìn tương lai
Cross-entropy loss flatten: logits reshape thành [B*T, vocab_size] trước khi tính

Từ model đến generation:

context = torch.zeros((1, 1), dtype=torch.long)  # batch=1, length=1
output  = model.generate(context, max_new_tokens=500)
text    = decode(output[0].tolist())

Loss progression reference (tinyshakespeare, char-level):

init:      ~4.17  (random, log(65) ≈ 4.17)
step 1000: ~2.0
step 3000: ~1.6
step 5000: ~1.5   (cân bằng với model size này)

Chốt lại

Part 3 hoàn thành. Bắt đầu từ bài 9 với attention từng head, qua multi-head, qua Transformer block, giờ đến GPT hoàn chỉnh chạy được.

Bước tiếp theo thực hành: đừng chỉ đọc, clone về và chạy:

git clone https://github.com/karpathy/nanoGPT
# Đọc toàn bộ train.py và model.py
# Thử train trên dataset khác: code Python, lyrics nhạc, Reddit comments

Thay text = requests.get(url).text bằng bất kỳ corpus nào bạn có. Model sẽ học style của corpus đó. Đây là cách nhanh nhất để cảm được intuition về “model học gì từ data”.

Video nên xem: Karpathy “Let’s build GPT: from scratch, in code, spelled out”, 2 tiếng, đi qua toàn bộ code theo thời gian thực. Đây là video bài này dựa vào. Xem lại sau khi đã code một lần, sẽ hiểu rõ hơn nhiều.

Part 4: Training deep dive (bài 14-17): scaling laws, mixed precision training, gradient clipping, distributed training. Câu hỏi lớn: làm thế nào để train model 10 tỷ params mà không cần 10 tỷ GPU memory? Và tại sao data nhiều hơn đôi khi tốt hơn model to hơn?