Cost và latency: token budget, streaming, prompt caching

Bài 1 mô tả một incident: agent loop 100 lần vì file permission denied, đốt $12 trong 8 phút. Đó là bài học về max_iterations. Bài này đi sâu hơn: tại sao con số đó lại là $12, và làm thế nào để giảm xuống dưới $1 cho cùng workload.

Cost và latency trong agent không phải vấn đề thuần tuý về code. Đó là vấn đề về architecture: bạn gọi model nào, gọi ở bước nào, gửi bao nhiêu token, caching có được kích hoạt không, response có stream không. Mỗi quyết định đó có giá, và giá đó thường chỉ lộ ra khi đã có user thật.

Vì sao token cost của agent khác chatbot

Chatbot là request-response. Một lần gọi, một lần trả tiền. Agent là loop: mỗi iteration là một lần gọi API, mỗi lần gọi gửi lại toàn bộ history.

Với Claude Sonnet 4.6 (tháng 5/2026):

Input: $3/MTok (triệu token)
Output: $15/MTok

Haiku 4.5:

Input: $0.80/MTok
Output: $4/MTok

Nhìn vào một agent loop 5 iteration đơn giản:

Iteration	System prompt	History tích luỹ	Tool schemas	Input tokens	Output tokens
1	500 tok	100 tok	300 tok	900	200
2	500 tok	600 tok	300 tok	1400	150
3	500 tok	1050 tok	300 tok	1850	180
4	500 tok	1380 tok	300 tok	2180	120
5	500 tok	1680 tok	300 tok	2480	250

Tổng: ~8810 input tokens + ~900 output tokens.

Chi phí Sonnet: $0.0264 input + $0.0135 output = $0.04 cho một run 5 bước.

Nghe rẻ. Nhưng nhân lên:

1000 user requests/ngày: $40/ngày, $1200/tháng
Mỗi request loop 10 bước thay vì 5: gần gấp đôi
Bug khiến agent loop 50 bước: tăng 10x, $12000/tháng

Hai điều đập vào mắt từ bảng trên. Thứ nhất: input token dominant. Output của một bước (200-250 tokens) nhỏ hơn nhiều so với input (2000+ tokens). Thứ hai: input tăng theo từng bước vì history tích luỹ. Bước 5 gửi input gấp 2.7 lần bước 1.

Kết luận: chiến lược giảm cost phải tập trung vào input tokens, không phải output.

Đây cũng là lý do tại sao max_iterations mà bài 3 về control loop nói đến không chỉ là safety net. Mỗi iteration thêm là tiền thật.

Prompt caching là khoản tiết kiệm lớn nhất

System prompt và tool schemas không thay đổi giữa các iteration. Trong ví dụ trên, 800 tokens (500 system + 300 tools) bị gửi lại 5 lần. Đó là 4000 tokens thừa.

Anthropic có prompt caching: khi phần đầu của prompt giống nhau giữa các request, Anthropic giữ phần đó trong KV cache. Request tiếp theo trả tiền theo cache read price, rẻ hơn nhiều.

Giá cache với Sonnet 4.6:

Cache write: $3.75/MTok (đắt hơn input thường một chút)
Cache read: $0.30/MTok (rẻ hơn 10 lần so với input)

TTL mặc định của cache là 5 phút. Nếu request tiếp theo đến sau 5 phút, cache miss, gửi lại từ đầu.

1-hour TTL

Anthropic có tính năng TTL 1 giờ, kích hoạt bằng env var:

ENABLE_PROMPT_CACHING_1H=1

Với agent chạy background task dài hoặc batch job qua đêm, TTL 5 phút không đủ. TTL 1 giờ giúp cache survive qua nhiều job.

Trong Claude Code và Anthropic SDK, nếu set biến này, cache tự động dùng TTL 1 giờ thay vì 5 phút. Không cần thay đổi code.

Cache trong code: `cache_control`

Để khai báo phần nào được cache, dùng cache_control marker:

import anthropic

client = anthropic.Anthropic()

# System prompt + tool schemas dài, cache lại
system_with_cache = [
    {
        "type": "text",
        "text": """
Bạn là một agent phân tích log Kubernetes.
Nhiệm vụ: đọc log từ các pod, xác định lỗi, đề xuất fix.
...
(500+ tokens của system prompt)
""",
        "cache_control": {"type": "ephemeral"}
    }
]

tools_with_cache = [
    {
        "name": "read_pod_logs",
        "description": "Read logs from a specific Kubernetes pod",
        "input_schema": {
            "type": "object",
            "properties": {
                "namespace": {"type": "string"},
                "pod_name": {"type": "string"},
                "tail_lines": {"type": "integer", "default": 100}
            },
            "required": ["namespace", "pod_name"]
        },
        # cache_control trên tool cuối cùng trong list
    },
    # ... nhiều tools khác ...
]

# Đặt cache_control trên tool cuối cùng để cache toàn bộ list
tools_with_cache[-1]["cache_control"] = {"type": "ephemeral"}

def agent_loop(user_input: str, max_iterations: int = 10):
    messages = [{"role": "user", "content": user_input}]
    total_cache_read = 0
    total_cache_write = 0
    total_input = 0
    total_output = 0

    for i in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system=system_with_cache,
            tools=tools_with_cache,
            messages=messages,
        )

        # Track cache usage
        usage = response.usage
        total_cache_read += getattr(usage, "cache_read_input_tokens", 0)
        total_cache_write += getattr(usage, "cache_creation_input_tokens", 0)
        total_input += usage.input_tokens
        total_output += usage.output_tokens

        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            break

        if response.stop_reason == "tool_use":
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": str(result),
                    })
            messages.append({"role": "user", "content": tool_results})

    # Cost breakdown
    cost_input = total_input * 3.0 / 1_000_000
    cost_cache_write = total_cache_write * 3.75 / 1_000_000
    cost_cache_read = total_cache_read * 0.30 / 1_000_000
    cost_output = total_output * 15.0 / 1_000_000
    total_cost = cost_input + cost_cache_write + cost_cache_read + cost_output

    print(f"Input: {total_input} tok (${cost_input:.4f})")
    print(f"Cache write: {total_cache_write} tok (${cost_cache_write:.4f})")
    print(f"Cache read: {total_cache_read} tok (${cost_cache_read:.4f})")
    print(f"Output: {total_output} tok (${cost_output:.4f})")
    print(f"Total: ${total_cost:.4f}")

Iteration đầu tiên: cache write, trả $3.75/MTok. Từ iteration 2 trở đi (trong 5 phút): cache read, trả $0.30/MTok. Tiết kiệm 92.5% trên phần được cache.

Pitfall: system prompt đổi một ký tự cũng mất cache

Đây là bug tôi mất 3 giờ để debug.

Setup: agent chạy batch job, system prompt 2000 tokens, cache hit rate dự kiến 90%. Thực tế: 0% cache hit, bill tăng gấp đôi so với dự kiến.

Nguyên nhân: trong system prompt có dòng:

f"Ngày hôm nay: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}"

Timestamp thay đổi mỗi giây. Mỗi request có timestamp khác nhau, Anthropic không nhận ra đây là cùng một prefix, cache miss 100%.

Anthropic cache là exact prefix match. Nếu bất kỳ ký tự nào trong phần được cache thay đổi, toàn bộ cache invalidated.

Các nguồn cache invalidation phổ biến:

Nguồn	Fix
Timestamp trong system prompt	Chuyển ra conversation turn, không đặt trong system
User ID, session ID trong system prompt	Đặt vào đầu messages thay vì system
Random seed, nonce	Không đặt trong cacheable prefix
Dynamic instruction thay đổi theo request	Tách thành static prefix + dynamic suffix, chỉ cache prefix

Rule đơn giản: phần được đánh dấu cache_control phải giống nhau tuyệt đối giữa các request muốn share cache. Debug bằng cách log cache_creation_input_tokens vs cache_read_input_tokens. Nếu mỗi request đều có cache_creation_input_tokens > 0, đang bị cache miss.

Bài 4 về memory giải thích tại sao conversation prefix (toàn bộ history đến một điểm) cũng có thể được cache, không chỉ system prompt. Kỹ thuật này gọi là prefix caching và đặc biệt mạnh với agent có nhiều context tích luỹ.

Streaming để user đỡ thấy chờ

Streaming không giảm tổng số tokens hay tổng cost. Nó giảm Time To First Token (TTFT): thời gian từ lúc gửi request đến lúc user thấy chữ đầu tiên.

Với agent, streaming có ý nghĩa khi:

Agent trả lời trực tiếp cho user (conversational agent, chatbot với tools)
Task dài mà user cần biết tiến độ
Debug session cần xem LLM đang “nghĩ” gì

Với batch processing job chạy background: streaming không giúp ích gì về UX, thêm code complexity không cần thiết.

import anthropic

client = anthropic.Anthropic()

def agent_stream(user_input: str):
    messages = [{"role": "user", "content": user_input}]

    with client.messages.stream(
        model="claude-sonnet-4-6",
        max_tokens=2048,
        system="Bạn là assistant...",
        messages=messages,
    ) as stream:
        full_response = []

        for event in stream:
            if event.type == "content_block_delta":
                if event.delta.type == "text_delta":
                    # Stream text ra ngay lập tức
                    print(event.delta.text, end="", flush=True)
                    full_response.append(event.delta.text)

        print()  # newline sau khi xong
        final = stream.get_final_message()
        return "".join(full_response), final

Với FastAPI hoặc async context:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import anthropic

app = FastAPI()
client = anthropic.Anthropic()

@app.post("/agent/stream")
async def agent_stream_endpoint(user_input: str):
    def generate():
        with client.messages.stream(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            system="Bạn là assistant...",
            messages=[{"role": "user", "content": user_input}],
        ) as stream:
            for text in stream.text_stream:
                yield f"data: {text}\n\n"

    return StreamingResponse(generate(), media_type="text/event-stream")

Lưu ý quan trọng: khi agent đang trong tool call loop (chưa trả kết quả cuối về user), streaming không có nhiều ý nghĩa vì token trả về là tool call JSON, không phải text. Bật streaming ở iteration cuối cùng (khi stop_reason == "end_turn" hoặc khi bạn biết LLM đang viết câu trả lời), không phải toàn bộ loop.

Parallel tool calls, khi làm đúng thì rất đáng

Claude hỗ trợ parallel function calls: trong một response, LLM có thể quyết định gọi nhiều tool cùng lúc. Thay vì:

bước 1: gọi get_user_info → chờ → bước 2: gọi get_order_history → chờ → bước 3: gọi get_payment_status → chờ

Claude có thể trả về:

[
  {"type": "tool_use", "name": "get_user_info", "id": "t1", ...},
  {"type": "tool_use", "name": "get_order_history", "id": "t2", ...},
  {"type": "tool_use", "name": "get_payment_status", "id": "t3", ...}
]

Ba tools, một lần gọi LLM. Bạn chạy chúng song song, gửi cả ba kết quả về một lần.

import asyncio
import anthropic

client = anthropic.Anthropic()

async def execute_tool_async(name: str, args: dict) -> str:
    # Giả lập tool execution
    await asyncio.sleep(0.1)
    return f"Result of {name}"

async def agent_with_parallel_tools(user_input: str, max_iterations: int = 10):
    messages = [{"role": "user", "content": user_input}]

    for _ in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            tools=TOOLS,
            messages=messages,
        )

        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            return next(
                (b.text for b in response.content if hasattr(b, "text")), ""
            )

        if response.stop_reason == "tool_use":
            tool_calls = [b for b in response.content if b.type == "tool_use"]

            # Chạy parallel nếu nhiều tool calls
            if len(tool_calls) > 1:
                tasks = [execute_tool_async(tc.name, tc.input) for tc in tool_calls]
                results = await asyncio.gather(*tasks)
            else:
                results = [await execute_tool_async(tool_calls[0].name, tool_calls[0].input)]

            tool_results = [
                {
                    "type": "tool_result",
                    "tool_use_id": tc.id,
                    "content": str(result),
                }
                for tc, result in zip(tool_calls, results)
            ]
            messages.append({"role": "user", "content": tool_results})

    return "Max iterations exceeded"

Khi tool call là I/O bound (API call, DB query, file read), parallel execution giảm latency đáng kể. Ba API call mỗi cái mất 200ms: sequential là 600ms, parallel là 200ms.

Khi nào Claude tự quyết định parallel: khi tool descriptions độc lập với nhau và không có dependency rõ ràng. Nếu tool B cần output của tool A, Claude sẽ gọi tuần tự (A trước, B sau). Giúp Claude nhận ra tool nào độc lập bằng cách viết description rõ ràng: “Lấy thông tin user từ DB” rõ ràng hơn “Query DB”.

Không phải bước nào cũng cần model mạnh

Không phải mọi bước trong agent loop đều cần model xịn nhất.

Ví dụ: agent phân tích log Kubernetes với flow:

Plan: xác định cần đọc log từ namespace nào, pod nào (cần reasoning)
Execute: gọi tool đọc log, parse output (cần accuracy)
Summarize: tóm tắt kết quả thành report ngắn (không cần reasoning phức tạp)

Bước 1 dùng Sonnet 4.6 ($3/MTok input). Bước 3 dùng Haiku 4.5 ($0.80/MTok input): rẻ hơn 3.75 lần, latency thấp hơn, đủ dùng cho task summarize.

import anthropic

client = anthropic.Anthropic()

def planning_step(context: str) -> str:
    """Dùng Sonnet cho bước cần reasoning."""
    response = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=1024,
        system="Bạn là planner. Phân tích context và liệt kê các bước cần thực hiện.",
        messages=[{"role": "user", "content": context}],
    )
    return response.content[0].text

def summarize_step(raw_results: str) -> str:
    """Dùng Haiku cho bước summarize đơn giản."""
    response = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=512,
        system="Tóm tắt kết quả sau thành 3-5 điểm chính, ngắn gọn.",
        messages=[{"role": "user", "content": raw_results}],
    )
    return response.content[0].text

def tiered_agent(user_input: str) -> str:
    # Bước plan: Sonnet
    plan = planning_step(user_input)

    # Bước execute: Sonnet với tools (cần accuracy cao)
    results = execution_loop(plan, model="claude-sonnet-4-6")

    # Bước summarize: Haiku (tiết kiệm cost)
    summary = summarize_step(results)

    return summary

Cost saving thực tế phụ thuộc vào token distribution giữa các bước. Nếu summarize chiếm 30% tổng token và dùng Haiku thay Sonnet, tiết kiệm được khoảng 30% × (3.0 - 0.8) / 3.0 = 22% tổng cost input.

Không lớn bằng caching, nhưng dễ implement và không có risk cache invalidation.

Token budget phải nằm trong loop

Bài 3 về control loop đề cập max_iterations như safety net. Token budget là safety net thứ hai, cụ thể hơn.

from dataclasses import dataclass, field

@dataclass
class TokenBudget:
    max_input_tokens: int = 50_000
    max_output_tokens: int = 10_000
    input_used: int = 0
    output_used: int = 0
    cache_read: int = 0
    cache_write: int = 0

    def update(self, usage) -> None:
        self.input_used += usage.input_tokens
        self.output_used += usage.output_tokens
        self.cache_read += getattr(usage, "cache_read_input_tokens", 0)
        self.cache_write += getattr(usage, "cache_creation_input_tokens", 0)

    def is_exceeded(self) -> bool:
        return (
            self.input_used > self.max_input_tokens
            or self.output_used > self.max_output_tokens
        )

    def cost_usd(self) -> float:
        return (
            self.input_used * 3.0 / 1_000_000
            + self.cache_write * 3.75 / 1_000_000
            + self.cache_read * 0.30 / 1_000_000
            + self.output_used * 15.0 / 1_000_000
        )

def agent_with_budget(user_input: str, budget: TokenBudget) -> str:
    messages = [{"role": "user", "content": user_input}]

    for i in range(20):  # max_iterations vẫn cần
        if budget.is_exceeded():
            raise RuntimeError(
                f"Token budget exceeded after {i} iterations. "
                f"Cost so far: ${budget.cost_usd():.4f}"
            )

        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            messages=messages,
        )

        budget.update(response.usage)
        messages.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            return next(
                (b.text for b in response.content if hasattr(b, "text")), ""
            )

        # handle tool_use...

    return "Max iterations exceeded"

Token budget và max_iterations không redundant. max_iterations bắt số vòng lặp vô hạn. Token budget bắt vòng lặp với token per iteration cao bất thường (ví dụ: tool trả về 10KB log thay vì 1KB như dự kiến, nhân lên 20 iteration là vấn đề khác hoàn toàn).

Bảng tối ưu tôi sẽ kiểm tra trước

Kỹ thuật	Impact	Complexity	Khi nào dùng
Prompt caching (system + tools)	Cost -70 đến -90% trên phần được cache	Thấp	Luôn, ngay từ đầu
TTL 1 giờ (`ENABLE_PROMPT_CACHING_1H=1`)	Cache survive > 5 phút	Rất thấp (env var)	Batch job, background task
`max_iterations` nhỏ	Cost capped	Thấp	Luôn
Token budget tracking	Cost predictable	Thấp	Production, billing-sensitive
Model tier mix (Sonnet plan, Haiku summarize)	Cost -15 đến -25%	Trung bình	Khi có bước đơn giản rõ ràng
Parallel tool calls	Latency -50 đến -70% trên I/O	Trung bình	Khi tools độc lập nhau
Streaming	TTFT giảm, không giảm cost	Thấp	Conversational agent
History truncation	Cost capped trên long-running agent	Trung bình	Agent chạy > 10 iteration thường xuyên

Ba cái bẫy tôi từng thấy nhiều nhất

Bẫy 1: Cache miss vì dynamic system prompt. Đã mô tả ở trên. Debug: log cache_creation_input_tokens mỗi iteration. Nếu luôn > 0 từ iteration 2 trở đi, đang bị miss.

Bẫy 2: Output token cost bị underestimate. Output rẻ hơn input trên giá/token, nhưng LLM có thể generate nhiều hơn bạn nghĩ khi dùng tool với nhiều arguments phức tạp. Một tool call JSON với nested args dài 500 tokens. Nhân với 20 iteration là 10k tokens output, $0.15. Không lớn đơn lẻ, nhưng trên 10k requests là $1500.

Bẫy 3: Streaming bật trên toàn bộ loop thay vì chỉ final response. Tool call response là JSON, không cần stream. Bật stream trên toàn bộ loop thêm connection overhead mà không giúp UX. Chỉ stream ở iteration cuối, khi bạn biết LLM đang trả lời user trực tiếp.

Chốt lại bằng cách nhìn bill

Cost và latency trong agent là hệ quả trực tiếp của architecture. Ba quyết định lớn nhất:

Caching: implement ngay từ đầu, đừng để “sau tối ưu”. Khó thêm sau vì đòi hỏi refactor system prompt structure.
max_iterations và token budget: cả hai, không chỉ một. Mỗi cái bắt một loại runaway khác nhau.
Model tier: không cần Sonnet ở mọi bước. Haiku 4.5 đủ cho 30-40% workload nếu bạn thiết kế đúng.

Đừng chờ bill tăng rồi mới thêm guardrail. Log token theo iteration, đặt budget, và kiểm tra cache hit từ bản đầu tiên có user thật. Sau đó đọc Failure modes: hallucination, infinite loop, hijacking vì agent đắt là một vấn đề, agent vừa đắt vừa sai mới là vấn đề production.