Local model cho vibe coding: Ollama, llama.cpp, model tool-call và timeout

Local model nghe rất hợp với vibe coding có kiểm soát: dữ liệu nằm trên máy mình, không gửi repo cho cloud model, chi phí token thấp hơn, chạy lúc nào cũng được. Nhưng thực tế phũ hơn: local model không tự động an toàn hơn, cũng không tự động làm agent tốt hơn.

Với vibe coding, điểm nghẽn không chỉ là model chat hay. Điểm nghẽn là tool-call, timeout, context, memory, loop control, và khả năng dừng đúng lúc.

Bài này nối với series Local LLM stack 2026 refresh và đặc biệt là bài failure modes của local agent loops. Ở đây ta chỉ tập trung vào quyết định: khi nào nên dùng local model cho vibe coding, và test thế nào trước khi cho nó write access.

Local model là privacy/control choice

Chọn local model vì:

repo không rời máy;
prompt nhạy cảm không đi qua provider cloud;
có thể chạy offline trong một số tình huống;
kiểm soát model/runtime tốt hơn;
cost predictable hơn nếu hardware đã có.

Đừng chọn local model chỉ vì nghĩ nó “an toàn hơn”. Nếu agent local có shell access rộng, nó vẫn có thể phá file. Nếu model local gọi tool sai, nó vẫn có thể tạo diff rác. Nếu bạn mount cả home directory vào agent, privacy cũng không còn nghĩa gì.

Local giảm một loại risk: data gửi ra ngoài. Nó không giảm risk về permission, file write, destructive command, hoặc hallucinated change.

Tool-call quan trọng hơn chat quality

Một model có thể trả lời tiếng Việt mượt nhưng fail agent loop. Lý do là agent cần output theo schema tool-call đúng.

Các failure phổ biến:

arguments không phải JSON hợp lệ;
model gọi tool không tồn tại;
model lặp lại cùng tool call;
model không biết khi nào dừng;
model trả prose thay vì structured call;
adapter đổi sai format giữa OpenAI-compatible và Anthropic-style.

Với Ollama hoặc llama.cpp, bạn cần kiểm tra model có tool-call ổn trong runtime bạn dùng, không chỉ đọc benchmark.

Smoke test tối thiểu:

Read the current folder and list files. Do not edit anything.

Nếu chỉ read-only đã lỗi, đừng cho write.

Smoke test thứ hai:

Create one file called local-model-smoke-test.md with three bullet points. Do not create any other file.

Review bằng Git:

git status --short
git diff

Nếu agent tạo đúng một file, tiếp tục. Nếu nó tạo project scaffold hoặc sửa file khác, boundary chưa ổn.

Timeout là safety feature

Local model chậm hơn cloud frontier, nhất là trên CPU hoặc GPU yếu. Một agent loop 10 turn có thể thành 30 phút. Nếu không có timeout, bạn không biết nó đang làm việc, treo, hay lặp.

Rule thực dụng:

one-shot task: timeout 2-5 phút;
read-only repo summary: timeout 5 phút;
file edit nhỏ: timeout 10 phút;
không cho autonomous run dài khi chưa có turn cap.

Nếu harness cho config max_turns, đặt thấp lúc đầu:

max_turns = 4

Nếu harness không có, bọc process bằng shell timeout khi phù hợp:

timeout 600 your-agent-command

macOS không có GNU timeout mặc định. Có thể dùng gtimeout từ coreutils hoặc dùng cơ chế timeout của tool/harness. Đừng copy command này máy nào cũng chạy.

One-shot trước, agent loop sau

Local model hợp với one-shot hơn agent loop trong giai đoạn đầu.

One-shot tốt:

Given this error message, explain likely cause and suggest the next command. Do not edit files.

Agent loop rủi ro hơn:

Fix the bug end to end.

Với one-shot, người dùng vẫn là control loop. Model chỉ đề xuất. Với agent loop, model vừa quyết định vừa gọi tool vừa sửa file. Nếu tool-call không ổn, lỗi nhân lên rất nhanh.

Hãy tăng quyền theo bậc:

Chat read-only.
Read files.
Propose diff.
Write one file.
Write multiple files trong toy repo.
Write repo thật trong branch riêng.

Đừng nhảy từ bậc 1 sang bậc 6.

Ollama và llama.cpp: kiểm tra ở runtime thật

Ollama và llama.cpp đều có ecosystem tool-calling riêng đang phát triển nhanh. Cùng một model có thể behave khác nhau tùy runtime, chat template, context length, quantization, và API compatibility.

Trước khi dùng cho vibe coding, ghi lại:

model name và quantization;
runner version;
context length;
tool-call mode/schema;
hardware;
average latency;
failure observed.

Ví dụ lab note:

Model: qwen-code-xx via Ollama
Runner: Ollama version X
Context: 8192
Hardware: Mac M3 Max, 36GB RAM
Smoke test: read-only pass, single-file write pass
Observed issue: repeated tool call after missing file
Verdict: OK for one-file edits, not OK for autonomous repo refactor

Nếu bạn không ghi lại, lần sau đổi model sẽ không biết tốt/xấu vì đâu.

Khi nào local model đủ dùng

Local model đủ dùng cho:

tóm tắt file nhỏ;
viết README/lab note;
giải thích error;
tạo static prototype nhỏ;
refactor text/copy đơn giản;
đề xuất test checklist;
phân loại log không nhạy cảm.

Local model chưa nên dùng một mình cho:

auth;
payment;
database migration;
production deploy;
secret handling;
multi-file refactor lớn;
security-sensitive code;
task cần tool-call dài nhiều turn.

Không phải vì local model luôn kém. Vì failure cost của những task này cao, còn local agent loop vẫn cần nhiều kiểm chứng hơn.

Chốt lại

Local model cho vibe coding là lựa chọn đáng thử khi bạn cần privacy và control. Nhưng hãy coi nó là một runtime cần kiểm thử, không phải phiên bản rẻ hơn của cloud coding agent.

Test read-only trước. Test one-file write sau. Đặt timeout. Đặt max turns. Ghi lại model/runtime. Chỉ khi smoke test ổn mới cho write access vào repo thật.