Available in: English · Tiếng Việt

There’s a widespread claim in the Vietnamese developer community using LLMs: “Writing prompts in Vietnamese costs more than 2x the tokens of English, so just write English to save money.” The community doesn’t say “exactly 2x”, they say “>2x” (and public benchmarks do show Vietnamese multiplier around 2.8x, clearly above 2x). The reasoning is sound. Tokenizers are trained on English-dominant corpora, and Vietnamese with diacritics forces BPE to split into smaller chunks.

But when I scanned my entire Claude Code session history (5626 user messages across 555 sessions) and re-measured, the result was counterintuitive: my default style of Vietnamese-without-diacritics + English technical terms was actually more token-efficient than pure English on longer messages (saving 39% tokens per character), accounted for 49% of all prompts, and was the dominant style in 71% of sessions. The “>2x token” claim only holds for pure diacritic Vietnamese, which was just 2.9% of my actual usage.

This post walks through that benchmark: dataset, methodology, results, the why, and recommendations for anyone debating where to optimize LLM token usage. Though Vietnamese is the case study here, the pattern likely generalizes to any non-English developer audience that naturally code-switches.

Where the “>2x” claim comes from

The technical basis for the claim is real. BPE (Byte Pair Encoding) tokenizers used by GPT-4 (cl100k_base) and Claude are trained on English-dominant internet corpora. Consequences:

  • Vietnamese words with diacritics get split into many more tokens (the UTF-8 byte sequences for diacritics don’t match pre-existing vocabulary entries).
  • For the same meaning, a Vietnamese message with diacritics typically costs 2.5-3x more tokens than the English equivalent.

Public benchmarks show Vietnamese multipliers around 2.8x on the same test sentence, Thai 3.7x, Tamil/Bengali 6-7x. Latin languages close to English (Spanish, Polish) hover around 1.5-2x.

But there’s something papers and blogs rarely measure: nobody tests on real usage. Those benchmarks translate one sentence across many languages and count tokens. Do developers using LLMs daily actually write like that? I suspected not. I noticed myself mixing languages, dropping diacritics when typing fast, stuffing English technical terms into Vietnamese sentences. That hadn’t been measured.

There’s a more interesting finding from the Lost in the Mix paper (arxiv 2506.14012): when researchers embedded English tokens into non-English matrix languages (Arabic, German, French, Chinese), LLM comprehension improved compared to pure non-English. The reverse (embedding foreign into English) hurt performance. In other words, mixing with the pattern “native matrix + English technical” is one the model actually prefers, not a workaround.

Benchmark pipeline

My dataset is the full ~/.claude/projects/ directory. Claude Code stores each session as JSONL, with one event per line (user message, assistant message, tool result, system reminder). I scanned all of it, filtered to user messages, classified by language style, tokenized the text, and aggregated per-style metrics.

Token benchmark pipeline Four-stage pipeline: ingest JSONL sessions, filter user messages, classify language style, tokenize, compute metrics. # benchmark pipeline :: raw sessions -> classified metrics [1/4] INGEST source: jsonl files: 1811 events: full log user role [2/4] FILTER drop: tool_result drop: system_msg keep: 5626 msgs rules [3/4] CLASSIFY signal: diacritic signal: markers output: 5 styles cl100k [4/4] METRICS m1: tok/char m2: out/prompt m3: dist % + tokenizer: tiktoken cl100k_base (GPT-4); Claude uses its own tokenizer, BPE behavior within +/-5% + rule-based classifier: diacritic detection + VN marker word list (~180 words) + EN tech list (~140 words) + 5 styles: pure_english, mixed_no_diacritic, vn_no_diacritic, mixed_diacritic, pure_vn_diacritic

The classifier logic is simple: for each message, check for Vietnamese diacritic characters, count Vietnamese no-diacritic marker words (toi, la, cua, khong, de, lam, …), and English technical terms (api, deploy, config, agent, …). From those signals, assign one of 5 classes.

The tokenizer is cl100k_base from GPT-4, not Claude. Anthropic’s tokenizer isn’t fully public, but BPE behavior is close enough for relative comparisons (within ±5%). The key is comparing styles within the same tokenizer, not absolute cost.

Result 1: Distribution, how do I actually write?

Breakdown by style across 5626 user messages:

StyleN messagesShareAvg chars/msg
mixed_no_diacritic276349.1%1167
pure_english101818.1%120
vn_no_diacritic99417.7%50
mixed_diacritic3736.6%1417
pure_vn_diacritic1632.9%86
unknown (slash commands, images)3155.6%5

Nearly half of messages are mixed_no_diacritic: Vietnamese without diacritics combined with English technical terms. This is the main working mode. Pure English was just 18%, and pure diacritic Vietnamese (the “most expensive” style per the “>2x” claim) was just 2.9%, barely used in real coding work.

Worth noting: mixed_no_diacritic messages averaged 1167 characters (substantive, context-heavy). pure_english averaged 120 chars (mostly short commands). These are two different usage modes, not just language variants.

Result 2: Length-matched tokens per character

This is the most important table. Metric: tokens per character, lower is more efficient. Comparing within the same length bucket removes length bias (longer text benefits from BPE merges, independent of language choice).

Length (chars)pure_englishmixed_no_diavn_no_diapure_vn_dia
0-500.2620.2980.3250.425
50-1500.3830.2900.3080.392
150-5000.3180.2970.3040.376
500-15000.3380.3090.300-
1500-50000.4480.275-0.476
5000+0.2550.266--

Reading the table:

  • Very short messages (<50 chars): pure_english wins (0.262) vs mixed_no_dia (0.298). Gap = 14%. Reason: short commands don’t have enough prose for BPE merges, so languages compete at the byte level directly.
  • Medium & long (50-5000 chars): mixed_no_diacritic beats pure English in every bucket. At the 1500-5000 char bucket (substantive prompts), mixed is 39% more efficient than pure English (0.275 vs 0.448).
  • pure_vn_diacritic is always worst, range 0.376-0.476, worse than pure English by 1.2-1.5x. This is the “>2x token” style from the original claim. But remember this style was only 2.9% of actual usage.

Why does mixed beat English on longer prompts? A few reasons:

  1. English technical terms tokenize densely. config, deploy, endpoint, file paths, command names: BPE already merged these to 1-2 tokens per word. The more technical the prompt, the more efficient English tokens become.
  2. Vietnamese no-diacritic connector words merge cheaply. la, cua, de, tu, cho, con, thi: typically 1 token each after BPE. No tax.
  3. Long coherent text enables better BPE merges. Longer sequences give the tokenizer more pattern-matching opportunities. Short commands can’t benefit from this.
  4. Pure English from non-native devs often includes unusual paths/IDs, which fragments heavily. A 50-char mixed prompt with “show me config X-Y-Z-internal-id” can fragment more than pure English prose of the same length.

Result 3: Session-level signal-to-cost

Each session was assigned a “dominant style” if >60% of its user messages shared that style. I then measured average prompt/output tokens per session plus the output-to-prompt ratio:

Dominant styleSessionsAvg prompt tok/sessAvg output tok/sessOut÷Prompt
mixed_no_diacritic240125913881.1x
vn_no_diacritic415296318.6x
pure_english31502304.6x
mixed_diacritic1339512933.3x
pure_vn_diacritic54190722.2x

The Out÷Prompt metric is interesting: it shows how much the assistant needed to generate relative to what the user wrote.

  • 1.0x: balanced dialogue, user provided enough context, assistant answered with purpose. Efficient.
  • 20x+: user wrote short, assistant had to interpret and expand. Inefficient in terms of “signal per turn.”

mixed_no_diacritic is the only style with ratio ≈ 1.0x, and it’s also the dominant style for 71% of sessions (240/340 classifiable). That’s empirical evidence this is the actual working pattern: substantive prompts with context, targeted responses.

Real examples (fabricated, preserving the pattern)

Same intent “migrate an S3 bucket,” expressed in 5 styles:

[pure_english]                    ~14 tokens
Migrate dev bucket to prod region.

[mixed_no_diacritic]              ~42 tokens
Spawn 2 agents de migrate dev bucket sang prod region,
giu nguyen lifecycle rule, report lai status dang bang.

[vn_no_diacritic]                 ~16 tokens
Di migrate bucket dev sang prod giup.

[mixed_diacritic]                 ~56 tokens
Migrate bucket dev sang prod region giúp, giữ nguyên
lifecycle rule và báo lại trạng thái dạng bảng nhé.

[pure_vn_diacritic]               ~50 tokens
Bạn có thể giúp tôi di chuyển bucket từ môi trường
phát triển sang sản xuất không?

Observations:

  • pure_vn_diacritic spends 50 tokens for ~35 chars of meaning. Genuinely expensive, and semantically vague (no technical terms, so the model has to infer “môi trường phát triển” means “dev”).
  • pure_english uses 14 tokens, cheapest but also shortest, missing context (“lifecycle rule,” “report as table”).
  • mixed_no_diacritic uses 42 tokens, 3x the English length but contains all requirements (agents, lifecycle rule, output format). Assistant can complete in one shot, no clarify rounds needed.

Per-unit-of-intent, mixed is cheaper: one complete round-trip at ~42 tokens beats English’s 14 tokens + N clarify rounds.

Why mix-lang is the sweet spot

Combining signals from data + research:

  1. Tokenizer level: English technical terms are pre-merged; Vietnamese no-diacritic connectors also merge cheaply. The two components complement each other.
  2. Semantic level: technical concepts (deploy, migrate, endpoint, agent) preserve meaning exactly when kept in English, no translation loss. Vietnamese provides context and intent, native precision.
  3. Research level: embedding English into non-English matrix languages improves LLM comprehension (Lost in the Mix, Mohamed et al. 2025, confirmed with Arabic/German/French/Chinese). Vietnamese wasn’t tested specifically, but it belongs to the same analytic-language family, so the result should extrapolate.
  4. Training distribution level: modern LLMs pre-train on massive web corpora. Common Crawl (300B+ pages, the seed for most modern pretrain datasets) and multilingual variants like mC4 (Xue et al. 2020, 101 languages, 6.3 trillion tokens, used for mT5). Developer-heavy content in those corpora is naturally code-switched: Stack Overflow answers with native-language comments + English code, GitHub issues from Asian/European devs, non-English technical blogs, forums like Qiita, r/LocalLlama, regional tech communities. All mix native + English technical terms. The model has seen this pattern millions of times during pre-training. Mix-lang from non-English devs isn’t a workaround; it’s the surface form closest to the training distribution. Methodology note: no public paper measures the exact code-switching ratio in the training corpora of Claude/GPT (labs don’t publish detailed compositions), but the public source components (Common Crawl, OSCAR, mC4, code repos) are transparent and empirically mixed.
  5. Cognitive level: native speakers think faster in their native language. Fewer round-trips, less clarification.

Zoom out: tokens are a means, not an end

This post goes deep into per-character, per-message, per-session benchmarks. Easy to get lost in the numbers. The anchor question to keep asking: why did I open the LLM in the first place?

Not to write the cheapest prompt possible. To finish the work faster and more correctly than I could alone. Tokens are a means, they matter only in two cases:

  1. Pay-per-token API at high volume: every 1M input tokens becomes real money, ROI of optimization is clear.
  2. Context window filling up: current prompt would push conversation past the limit, needs compression to fit.

Outside those two cases:

  • Flat-rate subscriptions (Claude Pro/Max, Codex Pro, GLM Coding Plan, other coding plans): tokens don’t directly become money. Optimizing tokens there is optimizing a metric unrelated to real cost. Quotas exist but are much wider than typical single-dev usage.
  • Daily personal usage: output quality + round-trip count are the real costs. A 300-token prompt that produces correct output in one shot is cheaper than a 50-token prompt that needs 3 clarification rounds (cumulative context duplication + your time + 3x model calls).

A common failure mode: developers optimize prompt length, write terse, drop context, model guesses wrong, clarify rounds increase, total tokens higher than the “long” prompt would have been. You’re trading one dimension (prompt size) for another (round-trip count) without being aware of it.

Summary: the goal is the outcome, tokens are the accounting. Don’t let accounting drive the goal. Optimize tokens when they’re the actual bottleneck, not as a reflex default.

Recommendation: where’s the real leverage?

Based on the same dataset, tiered by real impact:

TierActionSavings
1. StructuralMatch model to task (Haiku/Sonnet/Opus)40-80%
Delegate heavy reads to subagents40-70%
One topic per session + /clear20-40%
/compact before context bloats30-50%
2. ScopingPoint to specific files25-50%
.claudeignore + trim CLAUDE.md15-30%
CLI commands vs MCP (scoped output)5-20%
3. ProseStructured output (JSON)30-50%
System prompt tightening15-25%
Switch VN → English prompt~13% on short msgs

Language choice lives in Tier 3 and affects only short prompts. Structural choices are in Tier 1 with 2-5x more leverage. Prioritize wrong and you lose. Writing perfect English but loading the entire repo into context is more expensive than writing mixed-language and scoping correctly.

For the debate

If you’re hearing “Vietnamese costs more than 2x, just write English to save money”, it can be true in narrow cases (pure diacritic Vietnamese, short prompts, high-volume pay-per-token API). But for typical non-English dev workflow:

  • Only 2.9% of my messages were the “expensive” style, the rest don’t pay that tax.
  • Mixed no-diacritic (49% of usage) is more efficient than pure English on long prompts, with length-matched data to back this up.
  • Session-level, mixed has a signal-to-cost ratio of 1.0x, best of all styles for dialogue efficiency.
  • Subscription plans (Claude Pro/Max, Codex Pro, other coding plans) are flat-rate, token count doesn’t directly become money, so optimizing tokens there is optimizing the wrong metric.

Tokens are the means, correctness in few round-trips is the goal. A 42-token prompt that works in one shot beats a 14-token prompt that needs 3 clarifications.

Methodology notes & caveats

  • Single-user dataset. 5626 messages from one person, me specifically, doing DevOps-heavy AWS work. The pattern might differ for frontend devs, data scientists, or primarily coding-task users. If you run the same benchmark on yourself, results may shift.
  • Tokenizer is cl100k_base. Claude uses its own tokenizer; absolute token counts differ by ±5%. Relative ranking across styles is still valid.
  • Rule-based classifier. Marker word list ~180 VN + ~140 EN tech. 5.6% of messages fall into unknown (slash commands, attachments). Doesn’t affect the main findings.
  • “Out/Prompt ratio” is a proxy, not a perfect metric. It can be biased by task complexity. But across 555 sessions the signal is strong enough.

The script is reproducible, you can run it on your own sessions. Main logic:

# 1. Scan ~/.claude/projects/**/*.jsonl
# 2. Filter events where type == "user" AND role == "user"
# 3. Drop tool_result + system_reminder
# 4. Classify by diacritic + marker words
# 5. Tokenize with tiktoken cl100k_base
# 6. Aggregate per style: tok/char, output tokens, session-level stats

If you test and see different patterns, I’d like to know. The Vietnamese (or broader non-English) dev user base may have distinct cohorts, and “don’t switch to English for token reasons” might apply only to some of them.

Closing

Three main conclusions from the real-usage benchmark:

  1. The “>2x token” claim for Vietnamese is technically correct but framing-incorrect. It only applies to 2.9% of actual usage for a real coding developer.
  2. Mix-lang (Vietnamese no-diacritic + English technical terms) is optimal for medium/long prompts. Length-matched data confirms it; code-switching research supports it.
  3. Token optimization lives in the structural layer (model, scope, session hygiene), not language choice. Leverage is 40-80% vs 10-15%.

If you’re planning to switch to English just to save tokens, measure on your own sessions first. You might be optimizing a smaller number than what you’re already losing elsewhere.