SLO tracking trong Kibana: SLI, error budget và alert progression

Câu hỏi cũ trong incident review: “Bao lâu mới hết alert noise?”. Câu hỏi mới sau khi áp dụng SLO: “Service này còn bao nhiêu downtime cho phép trong tháng?”. Sự khác biệt nằm ở chỗ chuyển từ tư duy phản ứng (mỗi error là một vấn đề) sang tư duy ngân sách (lỗi là chi phí, có hạn mức).

Kibana 8.x có feature SLOs từ phiên bản 8.10. Trước đó thường phải tự dựng bằng transform và rule. Bài này dùng feature chính thức trước, rồi ghi thêm cách tự build nếu cluster chưa upgrade được.

Đọc xong nên nắm được:

SLI, SLO, error budget và burn rate khác nhau thế nào.
Định nghĩa SLI từ log Elasticsearch: availability, latency, success rate.
Setup SLO trong Kibana với time window và target.
Alert progression theo Google SRE multi-window pattern.

Vocabulary

Bốn từ dễ nhầm, định nghĩa rõ trước:

Từ	Định nghĩa	Ví dụ
SLI	Service Level Indicator. Một con số đo chất lượng.	`successful_requests / total_requests`
SLO	Service Level Objective. Mục tiêu cho SLI trong khoảng thời gian.	`SLI >= 99.9% trong 30 ngày`
Error budget	Khoảng cho phép SLI thấp hơn target.	`100% - 99.9% = 0.1% = 43.2 phút/30 ngày`
Burn rate	Tốc độ tiêu thụ error budget.	`1x = vừa đủ trong window. 14x = sắp hết.`

Lưu ý: SLO không phải SLA. SLA là cam kết với khách hàng (kèm penalty nếu vi phạm). SLO là target nội bộ, thường strict hơn SLA để có buffer.

Định nghĩa SLI từ log

Trước khi setup SLO, bạn cần data đủ điều kiện. Hai SLI phổ biến nhất:

Availability SLI

SLI = good_requests / total_requests
good_requests = response code 2xx, 3xx, 4xx (không phải 5xx)
total_requests = tất cả request

Cần log có:

@timestamp
http.response.status_code (int)
service.name (keyword)

Kiểm tra trong Kibana:

GET request-logs-*/_search
{
  "size": 0,
  "aggs": {
    "total": { "value_count": { "field": "@timestamp" } },
    "good": {
      "filter": { "range": { "http.response.status_code": { "lt": 500 } } }
    }
  }
}

Tỷ lệ good.doc_count / total.value chính là SLI hiện tại.

Latency SLI

SLI = fast_requests / total_requests
fast_requests = response time <= 300ms

Cần log có http.response.duration (ms).

"aggs": {
  "total": { "value_count": { "field": "@timestamp" } },
  "fast": {
    "filter": { "range": { "http.response.duration": { "lte": 300 } } }
  }
}

Threshold 300ms tuỳ business. P95 thường gần với “user perception of slow”. Đo trong thời gian peak để biết baseline thực.

Setup SLO qua UI

Kibana 8.10+ có Observability → SLOs. Quy trình:

Observability → SLOs → Create SLO.
Chọn Indicator type:
- Custom KQL: định nghĩa good/total query bằng KQL trên data view bất kỳ
- APM availability: tự động cho service đã có APM
- APM latency: tự động cho service đã có APM
- Histogram metric: cho metric pre-aggregated
Điền:
- Data view: request-logs-*
- Good filter: http.response.status_code < 500
- Total filter: leave empty hoặc filter cho service cụ thể
- Time field: @timestamp
Set objective:
- Target: 99.9 (%)
- Time window: Rolling 30 days hoặc Calendar aligned monthly
Describe SLO: name, description, tags, owner.
Save.

Kibana sẽ tạo background transform để precompute SLI mỗi phút.

Setup SLO qua API

Cho IaC workflow (Terraform, Pulumi, GitHub Actions):

curl -s -u "$KB_USER:$KB_PASS" \
  -H "kbn-xsrf: true" \
  -H "Content-Type: application/json" \
  -X POST "$KIBANA_URL/api/observability/slos" \
  -d '{
    "name": "Payment service availability",
    "description": "5xx error rate for payment endpoints",
    "indicator": {
      "type": "sli.kql.custom",
      "params": {
        "index": "request-logs-*",
        "good": "http.response.status_code < 500 and service.name : \"payment\"",
        "total": "service.name : \"payment\"",
        "timestampField": "@timestamp"
      }
    },
    "timeWindow": {
      "duration": "30d",
      "type": "rolling"
    },
    "budgetingMethod": "occurrences",
    "objective": {
      "target": 0.999
    },
    "tags": ["service:payment", "team:platform", "tier:critical"]
  }'

Response trả về SLO ID. Lưu lại để tham chiếu khi tạo burn rate rule.

Budgeting method

Method	Cách tính	Khi dùng
`occurrences`	Đếm số request good/total	Standard cho HTTP service
`timeslices`	Chia window thành slice (ví dụ 1m), mỗi slice good/bad theo SLI	Khi traffic thấp, occurrences không đủ data

Timeslices phù hợp cho service ít traffic (< 100 req/phút). Đếm phút “good” thay vì đếm request “good”.

Burn rate alert progression

Google SRE Workbook khuyến nghị 2-3 alert window đồng thời. Kibana 8.10+ hỗ trợ trực tiếp.

Multi-window setup

Sau khi tạo SLO, vào SLO detail page → Alerts → Create burn rate rule.

Form:

Window	Burn rate threshold	Severity	Action
Short: 5m	14.4x	Critical	PagerDuty page
Short: 30m	6x	High	Slack page
Long: 1h	1x	Warning	Slack mention
Long: 6h	0.5x	Info	Email weekly

Burn rate 14.4x nghĩa là: nếu cứ tiếp tục với tốc độ này thì error budget hết trong 30d / 14.4 = 2.08 ngày thay vì 30 ngày.

Logic AND giữa short và long window: alert trigger CHỈ khi BOTH window vượt threshold. Tránh false positive từ một spike ngắn.

Cấu hình qua API

curl -s -u "$KB_USER:$KB_PASS" \
  -H "kbn-xsrf: true" \
  -H "Content-Type: application/json" \
  -X POST "$KIBANA_URL/api/alerting/rule" \
  -d '{
    "name": "Payment SLO burn rate",
    "rule_type_id": "slo.rules.burnRate",
    "consumer": "slo",
    "schedule": {"interval": "1m"},
    "params": {
      "sloId": "<SLO_ID>",
      "windows": [
        {
          "id": "fast",
          "burnRateThreshold": 14.4,
          "maxBurnRateThreshold": 720,
          "longWindow": {"value": 1, "unit": "h"},
          "shortWindow": {"value": 5, "unit": "m"},
          "actionGroup": "slo.burnRate.alert"
        },
        {
          "id": "medium",
          "burnRateThreshold": 6,
          "maxBurnRateThreshold": 270,
          "longWindow": {"value": 6, "unit": "h"},
          "shortWindow": {"value": 30, "unit": "m"},
          "actionGroup": "slo.burnRate.high.alert"
        },
        {
          "id": "slow",
          "burnRateThreshold": 1,
          "maxBurnRateThreshold": 60,
          "longWindow": {"value": 24, "unit": "h"},
          "shortWindow": {"value": 1, "unit": "h"},
          "actionGroup": "slo.burnRate.medium.alert"
        }
      ]
    },
    "actions": [
      {
        "group": "slo.burnRate.alert",
        "id": "<PAGERDUTY_CONNECTOR_ID>",
        "params": {
          "severity": "critical",
          "summary": "Payment SLO burning at {{context.burnRate}}x",
          "dedupKey": "slo-payment-fast"
        }
      }
    ]
  }'

Pitfall: threshold tính sai

Burn rate threshold tỉ lệ với SLO target:

SLO 99.9% (budget 0.1%) → 14.4x = (2 * 1h / 30d) / 0.1%
SLO 99% (budget 1%) → 14.4x đồng nghĩa với rate khác

Kibana SLO UI có table preset threshold theo target chuẩn của Google. Đừng copy threshold từ SLO 99.9% sang SLO 99.5% mà không tính lại.

Công thức:

burn_rate = (errors_in_window / requests_in_window) / (1 - SLO_target)

Nếu SLO target = 0.999 và trong 1h có 100/10000 errors = 1% error → burn_rate = 1% / 0.1% = 10x.

Tự build SLO không cần feature 8.10+

Nếu cluster của bạn dưới 8.10, build SLO bằng transform + rule.

Transform precompute SLI

PUT _transform/slo-payment-1m
{
  "source": {
    "index": "request-logs-*",
    "query": { "term": { "service.name": "payment" } }
  },
  "pivot": {
    "group_by": {
      "ts": {
        "date_histogram": { "field": "@timestamp", "fixed_interval": "1m" }
      }
    },
    "aggregations": {
      "total": { "value_count": { "field": "@timestamp" } },
      "good": {
        "filter": { "range": { "http.response.status_code": { "lt": 500 } } }
      },
      "sli": {
        "bucket_script": {
          "buckets_path": { "g": "good._count", "t": "total" },
          "script": "params.g / params.t"
        }
      }
    }
  },
  "dest": { "index": "slo-payment" },
  "frequency": "30s",
  "sync": { "time": { "field": "@timestamp", "delay": "60s" } }
}

POST _transform/slo-payment-1m/_start

Rule trên index pre-computed

curl -s -u "$KB_USER:$KB_PASS" \
  -H "kbn-xsrf: true" \
  -H "Content-Type: application/json" \
  -X POST "$KIBANA_URL/api/alerting/rule" \
  -d '{
    "name": "Payment SLO burn rate (custom)",
    "rule_type_id": ".es-query",
    "consumer": "alerts",
    "schedule": {"interval": "1m"},
    "params": {
      "searchType": "esQuery",
      "index": ["slo-payment"],
      "timeField": "ts",
      "esQuery": "{\"query\":{\"bool\":{\"must\":[{\"range\":{\"ts\":{\"gte\":\"now-1h\"}}}]}}}",
      "size": 0,
      "threshold": [0.985],
      "thresholdComparator": "<",
      "timeWindowSize": 1,
      "timeWindowUnit": "h"
    }
  }'

Threshold 0.985 = average SLI trên 1h < 98.5% → burn budget 14.4x cho SLO 99.9%. Tự tính theo formula ở trên.

Tradeoff: tự build mất công maintain transform, nhưng kiểm soát từng chi tiết. Chuyển sang dedicated feature sau khi nâng cấp cluster.

Những lỗi dễ gặp

Ca 1: SLI luôn 100% trong dev

Engineer dựng SLO cho dev environment, thấy SLI luôn 100% và burn rate luôn 0. Lý do: dev không có traffic thực. Total = 5 request/giờ → math vô nghĩa.

Fix: chỉ track SLO cho production. Dev/staging dùng smoke test riêng, không phải SLO.

Ca 2: Error budget hết sau migration

Sau khi migrate database, service down 30 phút. Error budget 30 ngày (43.2 phút) tiêu 70%. Tuần sau lại có lỗi nhỏ → alert burn rate dồn dập vì budget gần hết.

Fix: incident lớn nên dùng maintenance window trong Kibana SLO (loại trừ window khỏi tính toán). Hoặc reset budget thủ công sau incident được tag là planned.

Ca 3: Burn rate alert lệch vì timezone

Calendar-aligned monthly SLO với window “Asia/Ho_Chi_Minh” nhưng ES query không set time_zone. Phần đầu ngày 1 (UTC) bị tính vào tháng trước. Budget tháng mới không reset đúng.

Fix: setup SLO qua UI luôn force timezone, qua API phải explicit set trong indicator params.

Dashboard SLO

Kibana có template dashboard sẵn cho SLO. Vào Observability → SLOs → click SLO → Dashboards.

Hoặc tự build với Lens:

Panel 1: gauge “Current SLI” (single value, avg(sli) trong 30d).
Panel 2: time series “SLI over time” (line chart).
Panel 3: gauge “Error budget remaining” (bar với threshold 25% warning, 10% danger).
Panel 4: history “Burn rate over time” (line chart với reference line ở 1x).

Embed vào dashboard chính của service. On-call mở dashboard, thấy budget remaining ngay đầu, biết tình trạng sức khoẻ chỉ trong 2 giây.

Ghi nhanh

Concept	Cách đo / setup
Availability SLI	`count(status_code < 500) / count(*)`
Latency SLI	`count(duration <= threshold) / count(*)`
Rolling SLO	Trailing 30d, recompute liên tục
Calendar SLO	Monthly/weekly, reset boundary
Burn rate 14.4x	Budget hết trong 2 ngày (SLO 30d)
Multi-window	Short AND Long window cùng vượt → alert
Maintenance window	Loại incident planned ra khỏi tính budget
Custom build	Transform `_transform/<slo>` + ES query rule

Chốt lại

SLO không phải “alert v2”. Nó là cách reframe câu hỏi: từ “có lỗi không” sang “còn budget không”. Cùng dữ liệu, cùng infrastructure, nhưng triết lý vận hành khác. Khi team đã quen SLO, post-mortem chuyển trọng tâm từ “ai gây lỗi” sang “budget consumed bao nhiêu và lesson nào để giữ budget tốt hơn”.

Bài tiếp theo và cuối Part 3 trong series Kibana từ A đến Z đi vào deduplication và throttling: làm sao tránh alert fatigue khi rule trigger lặp lại. Có SLO tốt mà alert spam thì team vẫn burn out.