Disk full và shard imbalance: quy trình recovery Elasticsearch không mất data

3 giờ sáng, PagerDuty báo “ES cluster red, ingest dropped”. Bạn login, check: 2 node disk 96%, 1 node 78%. Index bị set read-only. Filebeat đang queue log trên 200 máy. Bạn có 30 phút trước khi buffer Filebeat overflow và bắt đầu drop log.

Hầu như DevOps quản lý ELK rồi cũng gặp tình huống này. Tôi viết bài này như một runbook thực chiến: thứ tự bước, command cụ thể, và quyết định nào cần tránh khi áp lực cao.

Mục tiêu bài:

Hiểu cơ chế disk watermark của ES
Quy trình recovery từng bước cho disk full
Free disk an toàn không corrupt data
Rebalance shard sau khi recover
Pattern prevent tái diễn

Phần 1: Disk watermark của ES

ES có 3 watermark threshold:

Watermark	Default	Hành vi khi vượt
`cluster.routing.allocation.disk.watermark.low`	85%	Không assign shard mới về node này
`cluster.routing.allocation.disk.watermark.high`	90%	Move shard hiện có ra node khác
`cluster.routing.allocation.disk.watermark.flood_stage`	95%	Set tất cả index có shard ở node này thành read-only

Flood-stage là chốt cuối. Khi vượt, ES set index block read_only_allow_delete: true. Ingest fail, Kibana write fail, alert rule fail.

Read-only protect cluster khỏi corrupt (write vào disk hết = file truncate). Nó là cơ chế bảo vệ, không phải bug. Nhưng nếu không biết, rất dễ panic rồi gõ chmod/rm -rf linh tinh.

Check current state:

curl -sS "http://es:9200/_cat/allocation?v"

shards disk.indices disk.used disk.avail disk.total disk.percent host
   234         85gb     97gb        3gb      100gb           97 es-1
   234         84gb     92gb        8gb      100gb           92 es-2
   200         70gb     78gb       22gb      100gb           78 es-3

Node es-1 đang flood-stage. Cluster bị block write.

Phần 2: Verify block trước khi unblock

KHÔNG unblock trước khi free được disk. Unblock + write vào disk hết = ES crash, có thể corrupt segment.

Verify:

curl -sS "http://es:9200/_settings?include_defaults=true&filter_path=*.settings.index.blocks*"

Index nào bị block sẽ thấy:

{
  "app-logs-2026.05.16": {
    "settings": {
      "index": {
        "blocks": {"read_only_allow_delete": "true"}
      }
    }
  }
}

Phần 3: Phase 1, free disk

Step 1: Identify largest indices

curl -sS "http://es:9200/_cat/indices?v&s=store.size:desc&bytes=gb" | head -20

health index                pri rep docs store.size
green  app-logs-2026.04.15   5   1  10m  45gb
green  app-logs-2026.04.16   5   1  9.5m 42gb
green  app-logs-2026.04.17   5   1  9.8m 43gb
...

Index nào cũ nhất + lớn nhất là target.

Step 2: Snapshot trước khi xoá (BẮT BUỘC)

Đừng bao giờ xoá index không có snapshot, kể cả đêm khuya áp lực.

# Verify snapshot repository exist
curl -sS "http://es:9200/_snapshot/_all"

# Trigger snapshot
curl -X PUT "http://es:9200/_snapshot/s3-backup/incident-$(date +%Y%m%d-%H%M%S)" \
  -H 'Content-Type: application/json' \
  -d '{
    "indices": "app-logs-2026.04.15,app-logs-2026.04.16,app-logs-2026.04.17",
    "include_global_state": false,
    "metadata": {"reason": "pre-deletion-disk-full"}
  }'

Đợi snapshot xong (GET /_snapshot/s3-backup/_status) trước khi sang step kế.

Step 3: Delete old indices

curl -X DELETE "http://es:9200/app-logs-2026.04.15"
curl -X DELETE "http://es:9200/app-logs-2026.04.16"

Mỗi delete free ngay disk space. Đợi 30 giây, check _cat/allocation. Khi node thấp dưới 90%, sang phase 2.

Step 4: Alternative khi không có snapshot

Tình huống tệ nhất: chưa setup snapshot. Hai option:

Option A: scp data ra S3 bằng tay

# Trên node ES, copy index folder ra disk khác hoặc S3
INDEX_UUID=$(curl -sS "http://es:9200/_cat/indices/app-logs-2026.04.15?h=uuid")
tar czf /tmp/backup-old.tar.gz /var/lib/elasticsearch/nodes/0/indices/${INDEX_UUID}
aws s3 cp /tmp/backup-old.tar.gz s3://emergency-backup/

Sau đó xoá. Không pretty nhưng có rollback.

Option B: chấp nhận risk + delete

Chỉ làm khi business chấp nhận data loss của index cũ. Communicate trước.

Phần 4: Phase 2, remove read-only flag

Sau khi disk dưới flood_stage threshold (95%), ES KHÔNG tự gỡ block. Bạn phải gỡ thủ công:

curl -X PUT "http://es:9200/_all/_settings" \
  -H 'Content-Type: application/json' \
  -d '{"index.blocks.read_only_allow_delete": null}'

Hoặc per index:

curl -X PUT "http://es:9200/app-logs-*/_settings" \
  -H 'Content-Type: application/json' \
  -d '{"index.blocks.read_only_allow_delete": null}'

Verify ingest resumed:

# Tail log Filebeat hoặc Vector
journalctl -u filebeat -n 50 -f

Sau 1-2 phút thấy log mới index thành công = phục hồi xong write path.

Phần 5: Phase 3, rebalance

Disk được free nhưng shard có thể vẫn lệch (vì ES đã stop move shard khi vượt high watermark).

curl -sS "http://es:9200/_cat/allocation?v"

shards disk.indices disk.used disk.avail disk.total disk.percent host
   220         65gb     78gb       22gb      100gb           78 es-1
   215         63gb     76gb       24gb      100gb           76 es-2
   210         60gb     72gb       28gb      100gb           72 es-3

Ba node gần bằng nhau là OK. Nếu vẫn lệch:

# Force reroute thủ công
curl -X POST "http://es:9200/_cluster/reroute?retry_failed=true"

Hoặc move shard cụ thể:

curl -X POST "http://es:9200/_cluster/reroute" \
  -H 'Content-Type: application/json' \
  -d '{
    "commands": [{
      "move": {
        "index": "app-logs-2026.05.16",
        "shard": 0,
        "from_node": "es-1",
        "to_node": "es-3"
      }
    }]
  }'

Đợi _cluster/health về green.

Phần 6: Phase 4, postmortem và prevent

Sau incident, đừng đi ngủ. Setup prevention ngay khi còn nhớ context.

Action 1: ILM policy

Index Lifecycle Management auto-rotate index cũ. Pattern:

curl -X PUT "http://es:9200/_ilm/policy/app-logs-policy" \
  -H 'Content-Type: application/json' \
  -d '{
    "policy": {
      "phases": {
        "hot": {
          "actions": {
            "rollover": {"max_age": "1d", "max_size": "50gb"}
          }
        },
        "warm": {
          "min_age": "7d",
          "actions": {
            "shrink": {"number_of_shards": 1},
            "forcemerge": {"max_num_segments": 1}
          }
        },
        "cold": {
          "min_age": "30d",
          "actions": {
            "searchable_snapshot": {"snapshot_repository": "s3-backup"}
          }
        },
        "delete": {
          "min_age": "90d",
          "actions": {"delete": {}}
        }
      }
    }
  }'

Attach policy vào index template để index mới tự inherit. Không bao giờ phải xoá thủ công nữa.

Action 2: Monitoring alert disk

Tạo alert rule Kibana fire khi disk vượt 80%:

curl -X POST "http://kibana:5601/api/alerting/rule" \
  -H 'kbn-xsrf: true' \
  -H 'Authorization: ApiKey ...' \
  -d '{
    "name": "Disk usage warning",
    "rule_type_id": ".es-query",
    "consumer": "alerts",
    "schedule": {"interval": "5m"},
    "params": {
      "index": [".monitoring-es-*"],
      "esQuery": "{...query về node disk percent...}",
      "threshold": [80],
      "thresholdComparator": ">"
    },
    "actions": [...]
  }'

Hoặc Prometheus + Alertmanager nếu đã có infra đó.

Action 3: Auto-snapshot daily

curl -X PUT "http://es:9200/_slm/policy/daily-snapshot" \
  -H 'Content-Type: application/json' \
  -d '{
    "schedule": "0 30 1 * * ?",
    "name": "<daily-snap-{now/d}>",
    "repository": "s3-backup",
    "config": {
      "indices": ["app-logs-*"],
      "ignore_unavailable": true,
      "include_global_state": false
    },
    "retention": {
      "expire_after": "30d",
      "min_count": 7,
      "max_count": 30
    }
  }'

SLM (Snapshot Lifecycle Management) tự chạy daily, retain 30 ngày. Không bao giờ phải snapshot bằng tay trong incident nữa.

Action 4: Watermark tăng tạm thời

Trong incident, nếu muốn câu giờ trong khi free disk, có thể nâng watermark tạm:

curl -X PUT "http://es:9200/_cluster/settings" \
  -H 'Content-Type: application/json' \
  -d '{
    "transient": {
      "cluster.routing.allocation.disk.watermark.low": "92%",
      "cluster.routing.allocation.disk.watermark.high": "94%",
      "cluster.routing.allocation.disk.watermark.flood_stage": "97%"
    }
  }'

Quan trọng: dùng transient không persistent. Settings sẽ revert khi cluster restart. Đừng để override này tồn tại sau incident.

Phần 7: Shard imbalance không liên quan disk

Đôi khi disk OK nhưng shard vẫn lệch. Nguyên nhân:

Nguyên nhân 1: Custom routing

Index template có routing.required: true và data về cùng routing key:

{"settings": {"routing": {"required": true}}}

Data về cùng customer X = cùng shard X. Fix: bỏ custom routing hoặc reindex.

Nguyên nhân 2: Allocation awareness chưa balance

Nếu set cluster.routing.allocation.awareness.attributes, ES distribute theo attribute đó (vd: zone). Nếu zone không đều = shard lệch.

Check:

curl -sS "http://es:9200/_cluster/settings?include_defaults=true&flat_settings=true" \
  | grep awareness

Nguyên nhân 3: Node mới chưa nhận đủ shard

Vừa scale node mới, shard chưa kịp move. Đợi 30 phút và check lại. Force reroute nếu vẫn chưa:

curl -X POST "http://es:9200/_cluster/reroute?retry_failed=true"

Phần 8: Story thực tế

Một incident ELK production tôi xử lý đầu 2026: cluster 6 node, 4 node ở 94%, 2 node ở 60%. Lý do lệch: 4 node “old” thuộc một storage class cũ, 2 node “new” mới scale ra thuộc class khác, nhưng allocation awareness setup theo tag data_tier mặc định hot, tag mới cho 2 node là warm. Index app-logs-* chỉ allow allocate vào hot = không bao giờ rebalance sang node mới.

Quy trình xử lý:

Free disk: delete 5 ngày cũ nhất sau khi snapshot (mất 15 phút)
Unblock read-only (1 phút)
Update index template, remove tier filter để index mới có thể allocate vào cả hot lẫn warm
Manual move 30 shard từ node “old” sang node “new” qua reroute API
Postmortem: ILM policy cũ không có warm phase, viết lại. SLM chưa setup, setup luôn.

Tổng thời gian recovery: 45 phút. Không mất data (snapshot OK). Lessons learned đi vào runbook chính thức.

Phần 9: Anti-pattern

Tránh tuyệt đối khi panic:

Action	Tại sao tránh
`rm -rf /var/lib/elasticsearch/...`	Corrupt cluster state. Dùng API delete.
Restart node ES khi disk full	Có thể không lên lại được. Free disk trước.
Disable replica để câu disk	Có thể fail trong khi disable, mất primary. Snapshot trước.
Set `number_of_replicas: 0` permanently	Lose redundancy hoàn toàn. Chỉ tạm.
Force allocate empty primary	Mất data shard đó vĩnh viễn.
Ignore alert “warning” suốt 1 tuần	Đến lúc red mới hành động = quá muộn.

Checklist nhanh

Tình huống	Action
Disk 90%+	Snapshot + delete oldest indices
Index read-only sau flood	Free disk -> `PUT /_settings {"index.blocks.read_only_allow_delete": null}`
Shard unassigned	`GET /_cluster/allocation/explain` -> đọc reason
Hot shard	Reroute thủ công hoặc reindex
Cluster yellow sau scale	Đợi recovery xong, monitor `_cat/recovery`
Cluster red	Identify red index, check primary lost, snapshot restore
Prevention	ILM + SLM + Alert + tăng disk before 80%

Chốt lại

Recovery disk full không khó nếu có quy trình. Nguy hiểm nằm ở panic và skip step. Khi 3 giờ sáng PagerDuty kêu, bạn cần một runbook đủ rõ để làm theo, không phải copy command từ kết quả Google đầu tiên.

Phần cuối series Kibana từ A đến Z là performance tuning chiều sâu: JVM heap sizing, field caps cache, merge throttling. Những thứ bạn chỉ chạm vào sau khi đã làm chủ các tầng vận hành bên trên.