Failure modes: hallucination, infinite loop, hijacking

Tháng đầu tôi chạy agent trên user thật, tôi tin rằng mình đã cover hết edge case. max_iterations đã set. Error handling đã có. Retry logic đã viết. Monitoring dashboard đã lên.

Rồi agent của tôi gọi một tool không tồn tại trong schema. Không phải tool bị xoá sau khi deploy. Nó chưa bao giờ tồn tại. Agent tự nghĩ ra.

Đó là lần đầu tôi thực sự hiểu hallucination không chỉ là “model nói sai thông tin”. Trong context agent, nó còn là “model hành động dựa trên thứ không có thật”.

Bài này catalog bốn fail mode thực tế nhất khi agent chạy production: hallucination, infinite loop, goal hijacking, và sandbagging. Kèm detection signals, code Python để catch chúng, và mitigation strategy cho từng loại.

Fail mode 1: Tool không có thật nhưng agent vẫn gọi

Hallucination trong LLM thông thường là model tạo ra facts sai. Trong agent, nó có thêm hai dạng nguy hiểm hơn.

Dạng 1: Gọi tool không tồn tại. Model nhìn vào task, “nghĩ” có một tool phù hợp, rồi emit một tool call với tên không có trong schema. Runtime không tìm thấy handler, throw exception. Nếu agent loop có retry, model sẽ thử lại với tên tool sai khác.

Incident thực tế: agent phân tích log của tôi gọi search_logs_by_regex. Tool đó không được khai báo. Có search_logs với field pattern, nhưng model quyết định dùng tên khác. Agent retry 5 lần với 5 biến thể tên khác nhau trước khi chạm max_iterations.

Dạng 2: Gọi tool với args không có trong schema. Tool tồn tại, nhưng model truyền field không được khai báo, hoặc truyền sai type. Ví dụ: schema yêu cầu path: string, model truyền {"path": "/data", "recursive": true}. Field recursive không trong schema. Behaviour tuỳ runtime: một số ignore extra fields, một số throw validation error.

Dạng 3: Confidence cao với kết quả sai. Model nói “đã tạo file /tmp/report.csv” nhưng tool thực ra trả về error. Model đọc nhầm tool result, tự kết luận thành công, báo cáo user như vậy. Đây là dạng silent failure nguy hiểm nhất vì không có exception nào được raise.

import anthropic
from typing import Any

def validate_tool_call(tool_name: str, tool_input: dict, registered_tools: list[dict]) -> list[str]:
    """Check tool call against schema. Return list of validation errors."""
    errors = []

    # Find tool in registry
    tool_schema = next((t for t in registered_tools if t["name"] == tool_name), None)
    if tool_schema is None:
        errors.append(f"Tool '{tool_name}' not found in registry. Known tools: {[t['name'] for t in registered_tools]}")
        return errors

    schema = tool_schema.get("input_schema", {})
    required = schema.get("required", [])
    properties = schema.get("properties", {})

    # Check required fields
    for field in required:
        if field not in tool_input:
            errors.append(f"Missing required field: '{field}'")

    # Check for unknown fields (hallucinated args)
    for field in tool_input:
        if field not in properties:
            errors.append(f"Unknown field: '{field}' (not in schema)")

    return errors


def safe_execute_tool(tool_name: str, tool_input: dict, registered_tools: list[dict], handlers: dict) -> dict:
    """Execute tool with pre-validation. Return structured result."""
    validation_errors = validate_tool_call(tool_name, tool_input, registered_tools)

    if validation_errors:
        return {
            "success": False,
            "error": "VALIDATION_FAILED",
            "details": validation_errors,
            # Return to LLM with clear message so it can correct itself
            "message": f"Tool call rejected: {'; '.join(validation_errors)}"
        }

    handler = handlers.get(tool_name)
    if handler is None:
        return {"success": False, "error": "NO_HANDLER", "message": f"No handler for '{tool_name}'"}

    try:
        result = handler(**tool_input)
        return {"success": True, "result": result}
    except Exception as e:
        return {"success": False, "error": "EXECUTION_ERROR", "message": str(e)}

Mitigation:

Validate mọi tool call trước khi execute. Trả error message rõ ràng về LLM thay vì throw exception. Model có thể tự correct nếu được feed error đúng cách.
Schema description rõ ràng: từng field phải có description giải thích rõ expected value. Model ít hallucinate args hơn khi schema verbose.
Sau tool execute, check result trước khi để model kết luận: nếu tool trả error code, inject thêm một message “Tool returned error, do NOT report success to user” vào context.

Fail mode 2: Loop mãi vì lỗi nhỏ

Bài 3 về control loop đã cover max_iterations như safety net cơ bản. Nhưng max_iterations chỉ là hard stop. Trước khi chạm giới hạn đó, agent có thể loop theo nhiều pattern khác nhau, mỗi pattern cần detection riêng.

Pattern 1: Same-tool-same-args loop. Model gọi cùng một tool với cùng args nhiều lần liên tiếp. Tool trả kết quả giống nhau. Model gọi lại. Nguyên nhân: tool trả kết quả mà model không hiểu, hoặc model không có tool nào khác để tiến lên.

Incident thực tế: agent tìm file cấu hình gọi list_dir("/etc/app") 12 lần liên tiếp. Directory tồn tại nhưng empty. Model không có tool nào khác để thoát khỏi dead end. Mỗi vòng tốn ~1500 tokens.

Pattern 2: Oscillation loop. Model gọi tool A, nhận kết quả, gọi tool B, nhận kết quả, quay lại gọi tool A với args gần giống. Không tiến gần goal hơn sau mỗi chu kỳ.

Pattern 3: Retry spiral. Tool fail. Model retry với args hơi khác. Fail lại. Model thử cách khác. Fail. Không có điều kiện dừng rõ ràng ngoài max_iterations.

from collections import deque
import hashlib
import json

class LoopDetector:
    def __init__(self, window: int = 5, threshold: int = 3):
        """
        window: số lượng tool calls gần nhất để kiểm tra
        threshold: số lần lặp tối đa cho phép trong window
        """
        self.window = window
        self.threshold = threshold
        self.call_history: deque = deque(maxlen=window)

    def _fingerprint(self, tool_name: str, tool_input: dict) -> str:
        """Hash tool call thành fingerprint để so sánh."""
        payload = json.dumps({"tool": tool_name, "input": tool_input}, sort_keys=True)
        return hashlib.md5(payload.encode()).hexdigest()[:8]

    def record(self, tool_name: str, tool_input: dict) -> None:
        fp = self._fingerprint(tool_name, tool_input)
        self.call_history.append(fp)

    def is_looping(self) -> tuple[bool, str]:
        """Return (looping: bool, reason: str)."""
        if len(self.call_history) < self.threshold:
            return False, ""

        # Check exact same-call repeat
        recent = list(self.call_history)
        most_recent = recent[-1]
        repeat_count = sum(1 for fp in recent if fp == most_recent)
        if repeat_count >= self.threshold:
            return True, f"Same tool call repeated {repeat_count} times in last {self.window} calls"

        # Check oscillation: A-B-A-B pattern
        if len(recent) >= 4:
            pattern_ab = recent[-4:-2]
            pattern_cd = recent[-2:]
            if pattern_ab == pattern_cd:
                return True, "Oscillation detected: same 2-call pattern repeated"

        return False, ""


def agent_loop_with_detection(
    user_input: str,
    client: anthropic.Anthropic,
    tools: list,
    handlers: dict,
    max_iterations: int = 20
):
    messages = [{"role": "user", "content": user_input}]
    detector = LoopDetector(window=6, threshold=3)
    iteration_costs = []

    for i in range(max_iterations):
        resp = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=2048,
            tools=tools,
            messages=messages,
        )

        # Track cost per iteration
        iteration_costs.append({
            "iter": i,
            "input_tokens": resp.usage.input_tokens,
            "output_tokens": resp.usage.output_tokens,
        })

        messages.append({"role": "assistant", "content": resp.content})

        if resp.stop_reason == "end_turn":
            return resp.content[0].text

        if resp.stop_reason == "tool_use":
            tool_results = []
            for block in resp.content:
                if block.type == "tool_use":
                    detector.record(block.name, block.input)
                    looping, reason = detector.is_looping()
                    if looping:
                        raise RuntimeError(f"Loop detected at iteration {i}: {reason}")

                    result = safe_execute_tool(block.name, block.input, tools, handlers)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result),
                    })

            messages.append({"role": "user", "content": tool_results})

    raise RuntimeError(f"Max iterations ({max_iterations}) exceeded")

Mitigation:

LoopDetector phía trên catch được same-call repeat và 2-call oscillation. Nhưng không catch “tiến rất chậm nhưng vẫn tiến”. Kết hợp với cost-per-iteration tracking: nếu sau 10 iteration mà cost tăng đều nhưng không có end_turn, có thể đang trong drift loop.
Thêm progress_marker vào tool result: mỗi iteration, agent tự đánh giá “đã tiến bao xa”. Nếu marker không tăng sau N iteration, abort.
Cross-link: bài 3 control loop giải thích tại sao max_iterations không đủ. Bài 21 eval giải thích cách replay trace để catch regression sau khi thêm detection logic.

Fail mode 3: Goal bị kéo lệch dần

Goal hijacking là khi agent drift khỏi original task. Khác với infinite loop (làm đúng việc nhưng lặp), goal hijacking là làm sai việc.

Nó không nhất thiết là do prompt injection (bài 24 sẽ cover). Nhiều khi là drift tự nhiên do context accumulation.

Pattern thường gặp:

Lúc đầu user yêu cầu “phân tích log và tìm error”. Agent đọc log, thấy một config bất ổn, quyết định “thật ra nên fix config trước”. Sửa config xong, thấy service cần restart, restart. Cuối cùng trả về “đã optimize system” thay vì báo cáo error.

Incident thực tế: agent quản lý file của một user được yêu cầu “xoá file tạm trong /tmp/uploads”. Agent xoá xong, thấy folder khác cũng có file tạm, tự quyết định xoá luôn. Rồi thấy folder log cũng “dư thừa”. Cuối cùng xoá 3 folder không trong scope ban đầu.

Agent không bị inject. Nó chỉ generalise quá mức từ một goal cụ thể.

Detection: divergence tracking.

from anthropic import Anthropic

def check_goal_alignment(
    original_task: str,
    recent_actions: list[dict],
    client: Anthropic,
    max_actions_to_check: int = 5
) -> dict:
    """
    Dùng một LLM call riêng để kiểm tra xem actions gần đây có align với original task không.
    Tách biệt khỏi main agent loop để tránh context contamination.
    """
    actions_summary = "\n".join([
        f"- {a['tool']}: {json.dumps(a['input'])[:100]}"
        for a in recent_actions[-max_actions_to_check:]
    ])

    check_prompt = f"""Original task: {original_task}

Recent agent actions:
{actions_summary}

Are these actions aligned with the original task? Reply with:
ALIGNED: <brief reason>
or
DRIFTED: <what the agent is doing instead, and from which action the drift started>

Be strict. If actions are expanding scope beyond what the original task requires, it is drift."""

    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=200,
        messages=[{"role": "user", "content": check_prompt}],
    )
    verdict = resp.content[0].text.strip()
    drifted = verdict.startswith("DRIFTED")
    return {"drifted": drifted, "verdict": verdict}


class GoalTracker:
    def __init__(self, original_task: str, client: Anthropic, check_every: int = 5):
        self.original_task = original_task
        self.client = client
        self.check_every = check_every
        self.action_log: list[dict] = []

    def record_action(self, tool_name: str, tool_input: dict) -> None:
        self.action_log.append({"tool": tool_name, "input": tool_input})

        if len(self.action_log) % self.check_every == 0:
            result = check_goal_alignment(
                self.original_task,
                self.action_log,
                self.client,
            )
            if result["drifted"]:
                raise RuntimeError(f"Goal drift detected: {result['verdict']}")

Mitigation:

Scope tool list cho từng task: agent xử lý read-only task không cần write tools. Restrict tools bằng filter trước khi truyền vào LLM.
System prompt phải explicit về scope: “Chỉ làm X. Không làm Y dù Y có vẻ liên quan.”
Định kỳ check alignment như GoalTracker trên. Chi phí thêm một LLM call mỗi 5 iteration nhưng rẻ hơn cleanup sau drift.
Pitfall quan trọng: check_goal_alignment dùng LLM khác để judge. Nếu LLM judge cùng bị hallucinate, detection bỏ qua drift thật. Không có detector nào là bulletproof.

Fail mode 4: Agent làm ít hơn khả năng thật

Sandbagging là khi agent trả về kết quả không đầy đủ nhưng báo cáo như đã hoàn thành. Khác với hallucination (nói điều không có thật), sandbagging là làm việc ở mức thấp hơn yêu cầu.

Dạng thường gặp nhất: agent được yêu cầu “kiểm tra toàn bộ 500 file log, tìm error pattern”. Agent đọc 20 file đầu, thấy một vài error, báo cáo “đã kiểm tra và tìm thấy N lỗi”. Không kiểm tra 480 file còn lại.

Nguyên nhân: model tối ưu để tạo ra response thoả mãn user, không phải để hoàn thành task đầy đủ. Khi đã có “đủ” kết quả để tạo ra câu trả lời có vẻ ổn, model dừng. Đây là phản ánh của training objective, không phải bug code.

Một nguyên nhân khác: context window gần đầy. Model bắt đầu skip bước vì không còn chỗ để lưu intermediate results.

Detection: coverage tracking.

class CoverageTracker:
    def __init__(self, expected_items: list, item_type: str = "item"):
        """
        expected_items: list các item mà agent phải xử lý
        item_type: tên cho logging ("file", "record", "API endpoint", v.v.)
        """
        self.expected = set(str(i) for i in expected_items)
        self.processed = set()
        self.item_type = item_type

    def mark_processed(self, item: str) -> None:
        self.processed.add(str(item))

    def coverage_report(self) -> dict:
        missing = self.expected - self.processed
        pct = len(self.processed) / len(self.expected) * 100 if self.expected else 100
        return {
            "total": len(self.expected),
            "processed": len(self.processed),
            "missing_count": len(missing),
            "coverage_pct": round(pct, 1),
            "missing_sample": list(missing)[:5],  # first 5 for debugging
        }

    def is_complete(self, min_coverage: float = 1.0) -> bool:
        report = self.coverage_report()
        return report["coverage_pct"] >= min_coverage * 100


# Ví dụ: wrap tool để tự track coverage
def make_tracked_tool(original_handler, tracker: CoverageTracker, item_extractor):
    """
    original_handler: function xử lý tool
    tracker: CoverageTracker instance
    item_extractor: function lấy item ID từ tool input
    """
    def tracked_handler(**kwargs):
        item_id = item_extractor(kwargs)
        result = original_handler(**kwargs)
        tracker.mark_processed(item_id)
        return result
    return tracked_handler

Sau khi agent end_turn, verify coverage trước khi chấp nhận kết quả:

def verify_task_completion(tracker: CoverageTracker, min_coverage: float = 0.95) -> None:
    report = tracker.coverage_report()
    if not tracker.is_complete(min_coverage):
        raise RuntimeError(
            f"Task incomplete: only {report['coverage_pct']}% coverage "
            f"({report['processed']}/{report['total']} {tracker.item_type}s). "
            f"Missing: {report['missing_sample']}"
        )

Mitigation:

Explicit task decomposition trong system prompt: “Bạn phải xử lý MỌI item trong list. Không được dừng sớm.”
Pagination bắt buộc: chia list thành batch nhỏ, agent phải confirm từng batch xong trước khi sang batch tiếp.
Tool injection để force coverage: thêm tool mark_item_done(item_id) vào schema. Agent bắt buộc phải gọi tool này với mỗi item, tạo ra audit trail. Runtime verify xem tất cả items đã được mark chưa.

Pitfall: detector chỉ bắt lỗi dễ thấy

Ba incident minh hoạ tại sao detection không đủ.

Incident 1: Loop detector không catch drift-then-loop. Agent drift khỏi goal (fail mode 3), rồi trong goal mới, bắt đầu loop (fail mode 2). LoopDetector catch loop trong goal mới. Nhưng khi bạn fix loop và report “fixed loop”, agent vẫn đang chạy task sai. Fail mode 3 đã xảy ra trước khi fail mode 2 được detect.

Incident 2: Silent sandbagging qua tool wrapping. Agent gọi process_files(file_list=[...]) thay vì gọi từng file riêng. Tool handler intern xử lý list, nhưng CoverageTracker của tôi track theo tool call, không theo item bên trong list. Agent call tool một lần với list đầy đủ, tracker báo 100% coverage, nhưng handler intern chỉ xử lý 10 item đầu. Silent failure hoàn toàn.

Incident 3: Hallucination qua tool result caching. Tôi thêm cache cho tool results để giảm cost. Agent gọi read_file("/config.yaml") lần 2, nhận cached result từ lần 1. File đã bị sửa ở lần giữa. Agent hành động dựa trên stale data, kết luận sai, không có exception nào được raise. validate_tool_call không check stale cache.

Lesson từ ba incidents: detection giỏi nhất cũng chỉ catch failure mà bạn đã biết có thể xảy ra và đã viết detector cho nó. Unknown failure modes thoát qua hết. Đây là lý do bài 21 về eval quan trọng: golden set và trace replay giúp phát hiện regression sau khi bạn thêm detector, đảm bảo detector mới không break behavior cũ.

Bảng tóm tắt 4 fail modes

Fail mode	Signal	Detection	Mitigation
Hallucination (tool)	Tool call với tên không tồn tại hoặc args ngoài schema; agent báo success nhưng output thiếu	Schema validation trước execute; check tool result trước khi agent conclude	Schema verbose với description; trả validation error về LLM; verify tool result
Infinite loop	Cùng tool call lặp lại; cost tăng đều không có end_turn; oscillation A-B-A-B	`LoopDetector` theo fingerprint; cost-per-iteration tracking	`max_iterations` đủ nhỏ; loop detector abort sớm; thêm “exit tool” để agent tự thoát dead end
Goal hijacking	Action ra ngoài scope task gốc; tool calls không liên quan đến original task	`GoalTracker` với LLM-judge định kỳ; scope restriction trên tool list	Tool list tối thiểu cho từng task; system prompt explicit về scope; GoalTracker check mỗi N action
Sandbagging	Agent end_turn sớm; coverage thực tế thấp hơn report; task dài nhưng finish quá nhanh	`CoverageTracker` theo item; verify sau end_turn	Force pagination; thêm `mark_done` tool để tạo audit trail; explicit prompt về completeness

Điều tôi muốn nhớ khi debug agent

Bốn fail mode này không phải lý thuyết. Đây là những thứ tôi và nhiều người khác đã gặp khi chạy agent thật trên user thật. Detector giúp, nhưng không đủ: silent failures luôn tồn tại ở phía ngoài những gì bạn đã biết cần detect.

Hiểu fail mode là điều kiện cần. Điều kiện đủ là có eval để catch regression khi codebase thay đổi, và có tracing để reconstruct incident khi failure xảy ra ngoài expected.

Đừng thiết kế detector như thể failure chỉ đến từ bug vô tình. Bài kế tiếp, Security: prompt injection, tool sandboxing, secrets, nói về trường hợp khó hơn: data hoặc user cố ý kéo agent ra khỏi scope.