Dashboard-as-code workflow: NDJSON, Git và CI/CD cho Kibana

Câu chuyện kinh điển: tech lead làm xong dashboard “Error Overview” cực đẹp, share link cho team. Hai tháng sau cluster bị reinstall vì security patch, dashboard bay sạch. Hỏi tech lead, anh ấy đã chuyển team. Hỏi DevOps, “ai làm thì người đó giữ”. Result: cả team mò mẫm dựng lại từ đầu mất 2 ngày.

Vấn đề không phải Kibana. Vấn đề là dashboard sống trong index .kibana của Elasticsearch, không phải trong git. Một disk failure, một reinstall, một typo DELETE /.kibana là mất sạch. Bài này dạy bạn pattern dashboard-as-code: treat saved objects như source code, version-controlled, CI-validated, environment-promoted.

Mục tiêu bài:

Hiểu vì sao saved objects phải nằm trong git
Tổ chức repo cho NDJSON file
Viết GitHub Actions pipeline import vào ES staging và production
Validate NDJSON trước khi import (drift detection)
Pattern environment promotion: dev -> staging -> prod

Phần 1: Mental model

Kibana saved object là một document trong index .kibana_* của Elasticsearch. Mỗi dashboard, visualization, lens, alert rule, data view là một document có ID + type. Khi bạn click “Save” trong GUI, Kibana POST document vào index đó.

Format export NDJSON chính là dump các document đó, mỗi dòng là một saved object:

{"id":"f00bar","type":"dashboard","attributes":{...},"references":[...]}
{"id":"baz123","type":"lens","attributes":{...},"references":[...]}

References là dependency graph: dashboard reference lens, lens reference data view, alert rule reference connector. Khi import phải đảm bảo order đúng (data view trước lens trước dashboard).

Lý do treat như code:

Versioned: thấy ai sửa cái gì, khi nào, có rollback
Reproducible: rebuild cluster mới = git clone + import script
Promoteable: dev -> staging -> prod theo CI pipeline
Reviewable: PR diff thấy được dashboard JSON thay đổi gì
Backupable: git push = backup nằm trên GitHub, không phụ thuộc ES snapshot

Phần 2: Repo layout

Một layout đã chạy production cho 30+ dashboard:

infra-kibana/
├── README.md
├── .github/workflows/
│   ├── validate.yml          # CI lint, schema check
│   └── deploy.yml             # Deploy on merge to main
├── scripts/
│   ├── export.sh              # Helper: export 1 dashboard
│   ├── import.sh              # Helper: import 1 NDJSON file
│   └── diff.sh                # Compare local vs cluster
├── dashboards/
│   ├── application/
│   │   ├── error-overview.ndjson
│   │   ├── api-latency.ndjson
│   │   └── deployment-tracker.ndjson
│   ├── infrastructure/
│   │   ├── disk-usage.ndjson
│   │   ├── pod-restarts.ndjson
│   │   └── ingress-traffic.ndjson
│   └── security/
│       └── auth-failures.ndjson
├── data-views/
│   ├── app-logs.ndjson
│   ├── metrics.ndjson
│   └── audit-logs.ndjson
├── alert-rules/
│   ├── error-burst.ndjson
│   ├── disk-warning.ndjson
│   └── auth-anomaly.ndjson
└── connectors/
    └── slack.ndjson

Phân nhóm theo domain (application, infrastructure, security) chứ không theo loại object. Lý do: một dashboard “Application Error Overview” cần cả data view + lens + alert rule cùng nằm gần nhau cho dễ navigate.

Một file NDJSON nên gom một dashboard và toàn bộ dependency của nó: lens, data view, search liên quan. Khi import cluster mới, một file là một deliverable hoàn chỉnh.

Phần 3: Export script

scripts/export.sh:

#!/usr/bin/env bash
set -euo pipefail

KIBANA_URL="${KIBANA_URL:-http://localhost:5601}"
API_KEY="${KIBANA_API_KEY}"
DASHBOARD_ID="$1"
OUTPUT_FILE="$2"

if [ -z "$DASHBOARD_ID" ] || [ -z "$OUTPUT_FILE" ]; then
  echo "Usage: $0 <dashboard-id> <output.ndjson>"
  exit 1
fi

curl -sS \
  -H "Authorization: ApiKey ${API_KEY}" \
  -H "kbn-xsrf: true" \
  -H "Content-Type: application/json" \
  -X POST "${KIBANA_URL}/api/saved_objects/_export" \
  -d "{
    \"objects\": [{\"type\":\"dashboard\",\"id\":\"${DASHBOARD_ID}\"}],
    \"includeReferencesDeep\": true
  }" \
  -o "$OUTPUT_FILE"

echo "Exported to: $OUTPUT_FILE"
wc -l "$OUTPUT_FILE"

Quan trọng nhất là includeReferencesDeep: true. Không có flag này, import sẽ fail vì missing reference.

Sau khi export, mở file ra. NDJSON sẽ có một số field volatile cần normalize trước khi commit:

# Sort keys, strip volatile fields
jq -c 'del(.updated_at, .created_at, .version)' \
  export.ndjson > dashboards/application/error-overview.ndjson

Strip updated_at, created_at, version để diff trong git không noise. Khi import lại, Kibana sẽ tự gen field này.

Phần 4: Import script

scripts/import.sh:

#!/usr/bin/env bash
set -euo pipefail

KIBANA_URL="${KIBANA_URL:-http://localhost:5601}"
API_KEY="${KIBANA_API_KEY}"
NDJSON_FILE="$1"
OVERWRITE="${2:-false}"

if [ -z "$NDJSON_FILE" ]; then
  echo "Usage: $0 <file.ndjson> [overwrite]"
  exit 1
fi

ENDPOINT="${KIBANA_URL}/api/saved_objects/_import"
if [ "$OVERWRITE" = "true" ]; then
  ENDPOINT="${ENDPOINT}?overwrite=true"
fi

RESPONSE=$(curl -sS \
  -H "Authorization: ApiKey ${API_KEY}" \
  -H "kbn-xsrf: true" \
  -X POST "${ENDPOINT}" \
  --form file=@"${NDJSON_FILE}")

SUCCESS=$(echo "$RESPONSE" | jq -r '.success')
COUNT=$(echo "$RESPONSE" | jq -r '.successCount')

if [ "$SUCCESS" != "true" ]; then
  echo "Import FAILED: $RESPONSE" >&2
  exit 1
fi

echo "Imported ${COUNT} objects from ${NDJSON_FILE}"

Lưu ý: response có success: true nhưng vẫn có thể có successResults chứa warning. Production CI nên parse cả successResults và alert nếu có conflict.

Phần 5: GitHub Actions pipeline

.github/workflows/validate.yml:

name: Validate Kibana objects

on:
  pull_request:
    paths:
      - 'dashboards/**'
      - 'data-views/**'
      - 'alert-rules/**'
      - 'connectors/**'

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Check JSONL syntax
        run: |
          for f in $(find dashboards data-views alert-rules -name '*.ndjson'); do
            while IFS= read -r line; do
              echo "$line" | jq empty || { echo "Invalid JSON in $f"; exit 1; }
            done < "$f"
          done

      - name: Verify references exist
        run: |
          python3 scripts/check-references.py

.github/workflows/deploy.yml:

name: Deploy to Kibana

on:
  push:
    branches: [main]
    paths:
      - 'dashboards/**'
      - 'data-views/**'
      - 'alert-rules/**'
      - 'connectors/**'

jobs:
  deploy-staging:
    runs-on: ubuntu-latest
    environment: staging
    steps:
      - uses: actions/checkout@v4

      - name: Import to staging
        env:
          KIBANA_URL: ${{ secrets.KIBANA_URL_STAGING }}
          KIBANA_API_KEY: ${{ secrets.KIBANA_API_KEY_STAGING }}
        run: |
          for f in $(find data-views dashboards alert-rules connectors -name '*.ndjson'); do
            ./scripts/import.sh "$f" true
          done

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://kibana.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Import to production
        env:
          KIBANA_URL: ${{ secrets.KIBANA_URL_PROD }}
          KIBANA_API_KEY: ${{ secrets.KIBANA_API_KEY_PROD }}
        run: |
          for f in $(find data-views dashboards alert-rules connectors -name '*.ndjson'); do
            ./scripts/import.sh "$f" true
          done

GitHub environment: production cho phép manual approval trước khi deploy. Đây là gate quan trọng tránh bug ở staging dashboard leak sang production.

Import order quan trọng: data-views -> dashboards -> alert-rules. Lý do: alert rule có thể reference dashboard ID, dashboard reference data view ID. Sai order = import fail.

Phần 6: Drift detection

Vấn đề kinh điển: dev sửa dashboard trên UI staging mà không commit về git. Sau pipeline deploy, dashboard bị overwrite về version cũ. Cãi nhau bùng nổ.

Pattern drift detection: chạy job định kỳ so sánh local NDJSON với cluster state:

scripts/diff.sh:

#!/usr/bin/env bash
set -euo pipefail

KIBANA_URL="$1"
API_KEY="$2"
LOCAL_DIR="dashboards"

TEMP_DIR=$(mktemp -d)

for local_file in $(find "$LOCAL_DIR" -name '*.ndjson'); do
  DASHBOARD_ID=$(jq -r 'select(.type=="dashboard") | .id' "$local_file" | head -1)

  if [ -z "$DASHBOARD_ID" ]; then
    continue
  fi

  REMOTE_FILE="${TEMP_DIR}/$(basename "$local_file")"

  curl -sS \
    -H "Authorization: ApiKey ${API_KEY}" \
    -H "kbn-xsrf: true" \
    -H "Content-Type: application/json" \
    -X POST "${KIBANA_URL}/api/saved_objects/_export" \
    -d "{\"objects\":[{\"type\":\"dashboard\",\"id\":\"${DASHBOARD_ID}\"}],\"includeReferencesDeep\":true}" \
    | jq -c 'del(.updated_at, .created_at, .version)' \
    > "$REMOTE_FILE"

  if ! diff -q <(sort "$local_file") <(sort "$REMOTE_FILE") > /dev/null; then
    echo "DRIFT detected in $local_file"
    diff <(sort "$local_file") <(sort "$REMOTE_FILE") | head -20
  fi
done

Run hourly hoặc daily, alert Slack nếu có drift. Một incident tại dự án internal trước đây bắt được dev edit dashboard trên prod lúc 11 giờ đêm để fix demo, drift detector ping Slack 12 phút sau, ai cũng biết và team kịp rollback.

Phần 7: Pitfall thực tế

Pitfall 1: ID conflict giữa staging và prod

Dev tạo dashboard mới trên staging với UUID auto-generated, ví dụ 8f7e6d5c-.... Export, commit, deploy prod. Bug: dashboard ID trùng với một dashboard production khác đã tồn tại với cùng UUID.

Fix: trước khi commit dashboard mới, dùng UUID rõ ràng và stable. Cách 1: tạo dashboard từ API thay vì GUI để control ID:

curl -X POST "${KIBANA_URL}/api/saved_objects/dashboard/error-overview-prod-stable" \
  -H "kbn-xsrf: true" \
  -H "Authorization: ApiKey ${API_KEY}" \
  -d '{...}'

Cách 2: sửa ID trong NDJSON sau khi export, dùng slug có ý nghĩa.

Pitfall 2: Reference dangling sau khi merge

Hai dev tạo lens cùng ID trên hai branch khác nhau. Merge tới nơi, một lens overwrite lens kia, dashboard reference tới lens cũ bị broken.

Fix: validate reference trong CI. Script Python kiểm tra mọi reference ID có tồn tại trong cùng repo:

import json
import sys
from pathlib import Path

all_ids = set()
all_refs = set()

for ndjson in Path('.').rglob('*.ndjson'):
    with ndjson.open() as f:
        for line in f:
            obj = json.loads(line)
            all_ids.add((obj['type'], obj['id']))
            for ref in obj.get('references', []):
                all_refs.add((ref['type'], ref['id']))

missing = all_refs - all_ids
if missing:
    print(f"Missing references: {missing}")
    sys.exit(1)

Pitfall 3: API key permission không đủ

CI deploy báo Unauthorized mặc dù API key đã set. Lý do: key chỉ có quyền read trên .kibana thay vì manage_saved_objects.

Fix: tạo API key với role descriptor đủ rộng:

{
  "kibana_admin_ci": {
    "cluster": ["monitor"],
    "indices": [{"names": [".kibana*"], "privileges": ["all"]}],
    "applications": [{
      "application": "kibana-.kibana",
      "privileges": ["all"],
      "resources": ["*"]
    }]
  }
}

Tách riêng key cho CI khỏi key của dev daily use. Rotate riêng.

Phần 8: Production checklist

Trước khi declare workflow này “ready”:

Repo có README giải thích quy trình
Script export/import có error handling
CI validate JSON syntax + reference integrity
Deploy gate qua manual approval cho production
Drift detection chạy định kỳ
API key có scope hẹp, expiration 90 ngày
Secret lưu trong GitHub Actions Secrets, không hardcode
Rollback plan: revert PR + redeploy
Audit log: ai merge PR nào, khi nào

Cheatsheet

Việc	Command
Export 1 dashboard	`./scripts/export.sh <id> file.ndjson`
Import overwrite	`./scripts/import.sh file.ndjson true`
Strip volatile fields	`jq -c 'del(.updated_at, .created_at, .version)'`
List all dashboards	`curl -H "Authorization: ApiKey ${KEY}" "${URL}/api/saved_objects/_find?type=dashboard"`
Bulk delete by type	`curl -X DELETE "${URL}/api/saved_objects/_bulk_delete" -d '[...]'`
Validate NDJSON syntax	`while read line; do echo "$line" \| jq empty; done < file.ndjson`
Import order	data-views -> dashboards -> alert-rules -> connectors

Lời kết

Treat saved objects như source code không phải là over-engineering. Đó là điều tối thiểu nếu bạn không muốn mất ngủ mỗi lần ES upgrade hay disk failure. Mỗi dashboard là một deliverable, và deliverable phải versioned.

Bài tiếp theo trong series Kibana từ A đến Z sẽ đi sâu vào Kibana API: cách bulk tạo user, mass update dashboard, viết script automation nhiều bước, và pattern wrap API thành CLI nội bộ cho team. Một khi đã có config-as-code, bước kế tiếp là tự động hoá mọi tác vụ lặp lại.