Build BPE tokenizer từ đầu (theo Karpathy minbpe)

Bài 6 giải thích BPE hoạt động thế nào trên lý thuyết. Bạn đọc xong và gật đầu: “à, cứ tìm cặp byte phổ biến nhất rồi merge, lặp lại cho đến khi đủ vocab size”. Hiểu vậy là hiểu được 60%.

40% còn lại chỉ đến khi bạn code nó.

Andrej Karpathy — cựu Director of AI tại Tesla, một trong những người xây GPT-2 tại OpenAI — có repo minbpe: BPE từ zero, sạch, không magic, dưới 400 dòng. Bài này implement core algorithm từ đó xuống còn 150 dòng Python thuần, không dependency ngoài standard library. Train trên một đoạn Shakespeare thật, encode/decode câu thật, thấy merges diễn ra từng bước.

Sau bài này, tokenizer không còn là hộp đen.

Recap: BPE algorithm

Trước khi code, nhắc lại pseudocode từ bài 6 để đồng bộ:

Input:  raw text (string)
Step 1: encode sang UTF-8 bytes -> list of integer IDs (0-255)
Step 2: đếm tất cả các consecutive pairs (a, b) trong list
Step 3: tìm pair (a, b) có count cao nhất
Step 4: gán token ID mới = 256 + số merges đã làm
Step 5: thay tất cả (a, b) trong list bằng ID mới
Step 6: lưu merge rule: (a, b) -> new_id
Step 7: lặp lại từ Step 2 cho đến khi vocab_size đạt mục tiêu
Output: vocab (id -> bytes) + merges (pair -> id)

Đơn giản theo nghĩa đen: greedy, không backtrack, không toán fancy. Độ phức tạp thuật toán cao hơn naive một chút nhưng implement straight-forward.

Ba điểm quan trọng hay bị bỏ qua khi đọc paper:

Bắt đầu từ bytes, không phải ký tự: UTF-8 encode → 0-255. Đây là lý do tokenizer handle được mọi ngôn ngữ và emoji, kể cả tiếng Việt có dấu.
Vocab là bytes ban đầu cộng thêm merge results: không cần khai báo ký tự ban đầu, 256 byte IDs đã là vocab nền tảng.
Decode luôn khả thi: mỗi token ID map về một chuỗi bytes xác định, concat lại là ra text gốc.

Phần 1: Utility functions

Hai function chạy lặp đi lặp lại trong toàn bộ algorithm — đáng viết clean.

def get_stats(ids):
    """Đếm tất cả consecutive pairs trong list ids."""
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts


def merge(ids, pair, idx):
    """Thay tất cả occurrences của pair trong ids bằng idx."""
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

get_stats dùng zip(ids, ids[1:]) để tạo ra các cặp liên tiếp. Với [72, 101, 108, 108, 111], zip tạo ra [(72,101), (101,108), (108,108), (108,111)]. Đếm count từng cặp, trả về dict.

merge đi qua list một lần, kiểm tra từng vị trí: nếu ids[i], ids[i+1] khớp với pair cần merge, thay bằng idx và nhảy i += 2. Không khớp thì giữ nguyên ids[i], nhảy i += 1. Linear scan, O(n).

Hai function này không có gì kỳ diệu, nhưng viết đúng là quan trọng — một bug nhỏ ở đây sẽ khiến toàn bộ tokenizer train sai mà không có error message nào.

Phần 2: Train — học merge rules từ text

class BasicTokenizer:
    def __init__(self):
        self.merges = {}   # (int, int) -> int
        self.vocab = {idx: bytes([idx]) for idx in range(256)}

    def train(self, text, vocab_size):
        assert vocab_size >= 256, "vocab_size phải >= 256 (đủ cho base bytes)"
        num_merges = vocab_size - 256

        # Bước 1: encode text sang list byte integers
        ids = list(text.encode("utf-8"))

        # Bước 2: train loop
        for i in range(num_merges):
            stats = get_stats(ids)
            pair = max(stats, key=stats.get)
            idx = 256 + i

            # Thực hiện merge
            ids = merge(ids, pair, idx)

            # Lưu kết quả
            self.merges[pair] = idx
            self.vocab[idx] = self.vocab[pair[0]] + self.vocab[pair[1]]

            print(f"merge {i+1}/{num_merges}: {pair} -> {idx} "
                  f"({self.vocab[idx]}) count={stats[pair]}")

Giải thích từng bước trong train:

ids = list(text.encode("utf-8")) — đây là toàn bộ text đã được flat thành một list số nguyên dài. Ký tự ASCII như "A" (65) thành một integer. Ký tự tiếng Việt như "ê" thành 2 integers (UTF-8 multi-byte). Emoji thành 3-4 integers. Mọi thứ đều handle được.

pair = max(stats, key=stats.get) — greedy choice: chọn pair phổ biến nhất. Nếu có tie (hai pair cùng count), Python max trả về pair nào đến trước khi iteration — không ảnh hưởng đến correctness, chỉ ảnh hưởng đến thứ tự merge nếu train lại.

idx = 256 + i — token IDs 0-255 đã được reserve cho 256 byte values. Merge đầu tiên là 256, thứ hai là 257, cứ vậy. Nếu train với vocab_size=512, cuối cùng có 256 merges, IDs từ 256 đến 511.

self.vocab[idx] = self.vocab[pair[0]] + self.vocab[pair[1]] — concatenate bytes. Nếu pair là (65, 66) (bytes cho A và B), thì vocab của token mới là b"AB". Nếu pair là (256, 67) và token 256 đã là b"AB", thì token mới là b"ABC". Vocab tự-consistent theo cách này.

Sau khi train xong, self.merges là một dict với num_merges entries. Đây là tất cả kiến thức tokenizer học được. Vocab có thể rebuild từ merges + 256 base bytes bất cứ lúc nào.

Phần 3: Encode — text mới thành token IDs

Encode một string chưa từng thấy sử dụng merges đã học, theo đúng thứ tự đã train:

    def encode(self, text):
        ids = list(text.encode("utf-8"))
        while len(ids) >= 2:
            stats = get_stats(ids)
            # Tìm pair có merge priority thấp nhất (= merge sớm nhất khi train)
            pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
            if pair not in self.merges:
                break  # Không còn pair nào có thể merge
            idx = self.merges[pair]
            ids = merge(ids, pair, idx)
        return ids

Tại sao lại min thay vì max?

Khi train, merge xảy ra theo thứ tự: merge đầu tiên (idx=256) là pair phổ biến nhất lúc bắt đầu. Merge thứ hai (idx=257) là pair phổ biến nhất sau khi đã apply merge đầu tiên. Vân vân.

Khi encode text mới, phải áp dụng merges theo đúng thứ tự đó. self.merges[pair] cho biết pair này có idx bao nhiêu — idx thấp hơn nghĩa là merge xảy ra sớm hơn trong training. min với key là idx sẽ luôn chọn merge “ưu tiên cao nhất” (xảy ra sớm nhất).

Nếu không có pair nào trong self.merges — tức là text chứa những cặp byte chưa từng merge trong training — break ra. Các byte đó giữ nguyên ID của chúng (0-255).

Đây là chi tiết quan trọng: tokenizer không crash khi gặp text lạ. Worst case, mỗi byte là một token riêng. Điều này đảm bảo bất kỳ string Unicode nào cũng decode được.

Phần 4: Decode — token IDs về text

    def decode(self, ids):
        tokens = b"".join(self.vocab[idx] for idx in ids)
        return tokens.decode("utf-8", errors="replace")

Đơn giản đến mức ngạc nhiên. Mỗi ID trong vocab map về một bytes object. Join tất cả lại thành một bytes sequence dài, rồi decode UTF-8.

errors="replace" xử lý edge case: nếu merge bị cắt giữa một multi-byte UTF-8 sequence (ví dụ encode một nửa ký tự tiếng Việt), decode sẽ thay bằng ký tự placeholder � thay vì raise exception. Trong thực tế với text lành mạnh điều này không xảy ra, nhưng defensive coding.

Quan sát quan trọng: decode không cần merges, chỉ cần vocab. Đây là một trong những điểm thiết kế đẹp của BPE — vocab và merges là hai artifacts tách biệt. Để inference một tokenizer đã train xong, chỉ cần load vocab (và merges nếu cần encode).

Phần 5: Hands-on — train trên Shakespeare thật

Đủ rồi, chạy thực tế:

import urllib.request

# Download tinyshakespeare (~1MB)
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
text = urllib.request.urlopen(url).read().decode("utf-8")

print(f"Total chars: {len(text):,}")
print(f"First 100 chars: {text[:100]!r}")

# Train trên 100,000 ký tự đầu với vocab_size=512 (256 merges)
tokenizer = BasicTokenizer()
tokenizer.train(text[:100_000], vocab_size=512)

# Test encode/decode
test_sentences = [
    "To be or not to be",
    "All the world's a stage",
    "What's in a name?",
]

for sentence in test_sentences:
    encoded = tokenizer.encode(sentence)
    decoded = tokenizer.decode(encoded)
    print(f"\nOriginal : {sentence!r}")
    print(f"Encoded  : {encoded}")
    print(f"Decoded  : {decoded!r}")
    print(f"Compress : {len(sentence.encode('utf-8'))} bytes -> {len(encoded)} tokens "
          f"({len(encoded)/len(sentence.encode('utf-8')):.2f}x)")

Khi chạy, bạn sẽ thấy output merge log như:

merge 1/256: (101, 32) -> 256 (b'e ') count=5765
merge 2/256: (116, 104) -> 257 (b'th') count=5435
merge 3/256: (257, 101) -> 258 (b'the') count=4464
merge 4/256: (32, 258) -> 259 (b' the') count=3543
merge 5/256: (105, 110) -> 260 (b'in') count=3221
...
merge 100/256: (32, 119) -> 355 (b' w') count=1089
...
merge 256/256: (97, 108) -> 511 (b'al') count=450

Đọc merge log này là cực kỳ thú vị. Merge đầu tiên: (101, 32) là e (byte 101) + space (byte 32) = b'e '. Trong Shakespeare (và English nói chung), "e " ở cuối từ cực kỳ phổ biến — "the ", "be ", "have ", "make ". Hợp lý hoàn toàn.

Merge thứ 3: (257, 101) = token th + e = b'the'. Merge thứ 4: space + the = b' the'. Theo dõi được cách tokenizer “học” từ phổ biến nhất trong tiếng Anh.

Sau test_sentences, output sẽ tương tự:

Original : 'To be or not to be'
Encoded  : [84, 111, 32, 98, 259, 111, 114, 32, 110, 111, 116, 32, 116, 111, 32, 98, 259]
Decoded  : 'To be or not to be'
Compress : 18 bytes -> 17 tokens (0.94x)

Compression ratio gần 1.0 vì vocab_size=512 còn nhỏ — chỉ có 256 merges. GPT-4 tokenizer có vocab_size=100,277 (gần 100,000 merges), compress ratio thực tế khoảng 3-4x cho English text. Tức là 1000 ký tự English thường cho ra khoảng 250-333 tokens.

Thử tăng vocab_size lên 1024 hoặc 2048, chạy lại — xem compression ratio cải thiện thế nào.

Phần 6: Regex pre-tokenization — trick của GPT-2

Tokenizer vừa code hoạt động đúng nhưng còn một vấn đề tinh tế.

Xét câu: "dog! Dog. dog". Ba occurrences của "dog" sẽ được xử lý như một pool chung — merge của d, o, g sẽ bao gồm tất cả. Nhưng token "dog" khi xuất hiện sau dấu chấm (". dog") có ngữ cảnh khác với "dog" ở đầu câu hay "dog!" trước dấu chấm than.

GPT-2 fix điều này bằng cách split text bằng regex trước khi BPE:

import re

# Regex pattern của GPT-2 (simplified)
GPT2_PATTERN = re.compile(
    r"""'s|'t|'re|'ve|'m|'ll|'d| ?\w+| ?\d+| ?[^\s\w\d]+|\s+(?!\S)|\s+"""
)

def pretokenize(text):
    """Tách text thành chunks, BPE chạy riêng trên từng chunk."""
    return re.findall(GPT2_PATTERN, text)

Pattern này đảm bảo:

Contractions như "don't" → ["don", "'t"] — tách riêng, không merge qua ranh giới
Punctuation "dog!" → ["dog", "!"] — dấu câu không bao giờ merge vào từ
Leading space được giữ nguyên với từ theo sau: " the" là một chunk, "the" là chunk khác — đây là lý do GPT-2 tokenizer có cả "the" lẫn " the" như hai tokens riêng

BPE sau đó chạy độc lập trên từng chunk. Merges không vượt qua ranh giới do regex tạo ra.

Full GPT-4 regex (gọi là cl100k_base pattern) phức tạp hơn nhiều — handle Unicode categories, handle số khác ký tự, handle emoji sequences. Không implement ở đây, nhưng biết tại sao nó tồn tại là quan trọng.

Bài học: BPE “pure” như đã implement là foundation. Production tokenizer như tiktoken thêm regex pre-tokenization lên trên đó để kiểm soát granularity tốt hơn.

Cheatsheet

BPE algorithm — 5 bước:

Bước	Thao tác	Code tương ứng
1. Khởi tạo	Text → UTF-8 bytes → list integers	`ids = list(text.encode("utf-8"))`
2. Thống kê	Đếm tất cả consecutive pairs	`get_stats(ids)`
3. Chọn pair	Pair có count cao nhất	`max(stats, key=stats.get)`
4. Merge	Thay pair bằng new ID	`merge(ids, pair, idx)`
5. Lưu	Ghi merge rule + update vocab	`self.merges[pair] = idx`

Phải nhớ sau bài này:

BPE bắt đầu từ bytes (0-255), không phải characters — handle Unicode tự nhiên
Training sinh ra hai artifacts: vocab (id → bytes) và merges (pair → id)
Encode text mới phải áp dụng merges theo đúng thứ tự train (ưu tiên merge có idx thấp hơn)
Decode chỉ cần vocab — lookup + concat + UTF-8 decode
Regex pre-tokenization là layer nằm trước BPE, kiểm soát ranh giới merge
Tăng vocab_size → nhiều merges hơn → compression ratio tốt hơn → ít tokens hơn cho cùng một đoạn text → inference nhanh hơn và context window chứa được nhiều text hơn

Kích thước vocab trong thực tế:

Tokenizer	Vocab size	Dùng trong
GPT-2	50,257	GPT-2
GPT-4 / cl100k_base	100,277	GPT-3.5, GPT-4
Llama 3	128,256	Llama 3
Gemma	256,000	Gemma

Xu hướng: vocab ngày càng lớn để tăng compression và cover nhiều ngôn ngữ tốt hơn.

Toàn bộ code — copy-paste ready

Để tiện tham khảo, đây là toàn bộ 150 dòng trong một file:

"""
BasicTokenizer — BPE từ zero, không dependency.
Inspired by Andrej Karpathy's minbpe: https://github.com/karpathy/minbpe
"""

import urllib.request


def get_stats(ids):
    """Đếm tất cả consecutive pairs trong list ids."""
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts


def merge(ids, pair, idx):
    """Thay tất cả occurrences của pair trong ids bằng idx."""
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids


class BasicTokenizer:
    def __init__(self):
        self.merges = {}
        self.vocab = {idx: bytes([idx]) for idx in range(256)}

    def train(self, text, vocab_size):
        assert vocab_size >= 256
        num_merges = vocab_size - 256
        ids = list(text.encode("utf-8"))

        for i in range(num_merges):
            stats = get_stats(ids)
            pair = max(stats, key=stats.get)
            idx = 256 + i
            ids = merge(ids, pair, idx)
            self.merges[pair] = idx
            self.vocab[idx] = self.vocab[pair[0]] + self.vocab[pair[1]]
            print(f"merge {i+1}/{num_merges}: {pair} -> {idx} "
                  f"({self.vocab[idx]}) count={stats[pair]}")

    def encode(self, text):
        ids = list(text.encode("utf-8"))
        while len(ids) >= 2:
            stats = get_stats(ids)
            pair = min(stats, key=lambda p: self.merges.get(p, float("inf")))
            if pair not in self.merges:
                break
            idx = self.merges[pair]
            ids = merge(ids, pair, idx)
        return ids

    def decode(self, ids):
        tokens = b"".join(self.vocab[idx] for idx in ids)
        return tokens.decode("utf-8", errors="replace")


if __name__ == "__main__":
    # Download Shakespeare
    url = ("https://raw.githubusercontent.com/karpathy/char-rnn/"
           "master/data/tinyshakespeare/input.txt")
    print("Downloading tinyshakespeare...")
    text = urllib.request.urlopen(url).read().decode("utf-8")
    print(f"Downloaded {len(text):,} characters")

    # Train
    tokenizer = BasicTokenizer()
    tokenizer.train(text[:100_000], vocab_size=512)

    # Test
    test_cases = [
        "To be or not to be",
        "All the world's a stage",
        "Shall I compare thee to a summer's day?",
    ]

    print("\n" + "=" * 60)
    for sentence in test_cases:
        encoded = tokenizer.encode(sentence)
        decoded = tokenizer.decode(encoded)
        original_bytes = len(sentence.encode("utf-8"))
        ratio = len(encoded) / original_bytes
        print(f"Text    : {sentence!r}")
        print(f"Encoded : {encoded}")
        print(f"Decoded : {decoded!r}")
        print(f"Ratio   : {original_bytes} bytes -> {len(encoded)} tokens ({ratio:.2f}x)")
        print(f"Match   : {sentence == decoded}")
        print()

Lưu thành basic_tokenizer.py, chạy python basic_tokenizer.py. Không cần install gì cả.

Lời kết

Bạn vừa viết một BPE tokenizer hoạt động thực tế. Không phải demo giả, không phải stub — đây là đúng thuật toán GPT-2 dùng, thiếu regex pre-tokenization và một số optimization nhỏ.

Bước tiếp theo nếu muốn đi sâu hơn: clone repo gốc của Karpathy (git clone https://github.com/karpathy/minbpe), đọc minbpe/base.py và minbpe/regex.py. File regex.py implement regex pre-tokenization và GPT-4 pattern đầy đủ. So sánh với tiktoken bằng cách encode cùng một string rồi print ra hai bên — nếu tokenizer của bạn match tiktoken output, bạn đã implement đúng.

Bài 8 sẽ bước sang territory mới: Embeddings. Từ "king" thành vector, tại sao king - man + woman ≈ queen, rồi tiến đến contextual embeddings (mỗi token có vector khác nhau tùy ngữ cảnh) và RoPE — positional encoding model hiện đại dùng. Nếu bài này dạy bạn cách text trở thành token IDs, bài 8 dạy cách token IDs trở thành thứ mà Transformer có thể “suy nghĩ” được.