Monitoring Setup

Production monitoring and observability for Zentinel deployments.

Metrics Endpoint

Zentinel exposes Prometheus metrics on the configured address:

observability {
    metrics {
        enabled #true
        address "0.0.0.0:9090"
        path "/metrics"
    }
}

Verify:

curl http://localhost:9090/metrics

Prometheus Setup

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Zentinel proxy
  - job_name: 'zentinel'
    static_configs:
      - targets: ['zentinel:9090']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance
        regex: '([^:]+):\d+'
        replacement: '${1}'

  # Zentinel agents
  - job_name: 'zentinel-agents'
    static_configs:
      - targets:
          - 'zentinel-waf:9091'
          - 'zentinel-auth:9092'
          - 'zentinel-ratelimit:9093'

Docker Compose

services:
  prometheus:
    image: prom/prometheus:v2.47.0
    ports:
      - "9091:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'

volumes:
  prometheus-data:

Kubernetes ServiceMonitor

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: zentinel
  labels:
    app: zentinel
spec:
  selector:
    matchLabels:
      app: zentinel
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

Key Metrics

Request Metrics

Metric	Type	Description
`zentinel_requests_total`	Counter	Total requests by route, method, status
`zentinel_request_duration_seconds`	Histogram	Request latency distribution
`zentinel_request_size_bytes`	Histogram	Request body size
`zentinel_response_size_bytes`	Histogram	Response body size

Upstream Metrics

Metric	Type	Description
`zentinel_upstream_requests_total`	Counter	Requests per upstream target
`zentinel_upstream_latency_seconds`	Histogram	Upstream response time
`zentinel_upstream_health`	Gauge	Target health (1=healthy, 0=unhealthy)
`zentinel_upstream_connections_active`	Gauge	Active connections per upstream

Agent Metrics

Metric	Type	Description
`zentinel_agent_duration_seconds`	Histogram	Agent processing time
`zentinel_agent_errors_total`	Counter	Agent errors by type
`zentinel_agent_decisions_total`	Counter	Agent decisions (allow/block)

System Metrics

Metric	Type	Description
`zentinel_connections_active`	Gauge	Active client connections
`zentinel_connections_total`	Counter	Total connections
`process_cpu_seconds_total`	Counter	CPU usage
`process_resident_memory_bytes`	Gauge	Memory usage

Essential PromQL Queries

Request Rate

# Requests per second
rate(zentinel_requests_total[5m])

# By route
sum by (route) (rate(zentinel_requests_total[5m]))

# By status code
sum by (status) (rate(zentinel_requests_total[5m]))

Error Rate

# 5xx error rate
sum(rate(zentinel_requests_total{status=~"5.."}[5m]))
/ sum(rate(zentinel_requests_total[5m])) * 100

# 4xx rate
sum(rate(zentinel_requests_total{status=~"4.."}[5m]))
/ sum(rate(zentinel_requests_total[5m])) * 100

Latency

# 50th percentile
histogram_quantile(0.50, rate(zentinel_request_duration_seconds_bucket[5m]))

# 95th percentile
histogram_quantile(0.95, rate(zentinel_request_duration_seconds_bucket[5m]))

# 99th percentile
histogram_quantile(0.99, rate(zentinel_request_duration_seconds_bucket[5m]))

Upstream Health

# Unhealthy upstreams
zentinel_upstream_health == 0

# Upstream latency p95
histogram_quantile(0.95, rate(zentinel_upstream_latency_seconds_bucket[5m]))

Alerting Rules

alerts.yml

groups:
  - name: zentinel
    rules:
      # High error rate
      - alert: ZentinelHighErrorRate
        expr: |
          sum(rate(zentinel_requests_total{status=~"5.."}[5m]))
          / sum(rate(zentinel_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on Zentinel"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # High latency
      - alert: ZentinelHighLatency
        expr: |
          histogram_quantile(0.95, rate(zentinel_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency on Zentinel"
          description: "p95 latency is {{ $value | humanizeDuration }}"

      # Upstream down
      - alert: ZentinelUpstreamDown
        expr: zentinel_upstream_health == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Upstream target is down"
          description: "{{ $labels.upstream }}/{{ $labels.target }} is unhealthy"

      # No requests
      - alert: ZentinelNoTraffic
        expr: |
          sum(rate(zentinel_requests_total[5m])) == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "No traffic to Zentinel"
          description: "Zentinel has received no requests in 5 minutes"

      # Agent errors
      - alert: ZentinelAgentErrors
        expr: rate(zentinel_agent_errors_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent errors detected"
          description: "Agent {{ $labels.agent }} has errors"

      # High memory
      - alert: ZentinelHighMemory
        expr: |
          process_resident_memory_bytes / 1024 / 1024 > 1024
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage"
          description: "Zentinel using {{ $value | humanize }}MB"

Grafana Dashboards

Dashboard JSON

{
  "title": "Zentinel Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8},
      "targets": [{
        "expr": "sum(rate(zentinel_requests_total[5m]))",
        "legendFormat": "Requests/sec"
      }]
    },
    {
      "title": "Error Rate",
      "type": "gauge",
      "gridPos": {"x": 12, "y": 0, "w": 6, "h": 8},
      "targets": [{
        "expr": "sum(rate(zentinel_requests_total{status=~\"5..\"}[5m])) / sum(rate(zentinel_requests_total[5m])) * 100"
      }],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"value": 0, "color": "green"},
              {"value": 1, "color": "yellow"},
              {"value": 5, "color": "red"}
            ]
          },
          "unit": "percent"
        }
      }
    },
    {
      "title": "Latency",
      "type": "timeseries",
      "gridPos": {"x": 0, "y": 8, "w": 12, "h": 8},
      "targets": [
        {
          "expr": "histogram_quantile(0.50, rate(zentinel_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p50"
        },
        {
          "expr": "histogram_quantile(0.95, rate(zentinel_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p95"
        },
        {
          "expr": "histogram_quantile(0.99, rate(zentinel_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p99"
        }
      ]
    },
    {
      "title": "Upstream Health",
      "type": "stat",
      "gridPos": {"x": 12, "y": 8, "w": 6, "h": 8},
      "targets": [{
        "expr": "zentinel_upstream_health",
        "legendFormat": "{{upstream}}/{{target}}"
      }]
    }
  ]
}

Health Checks

Zentinel Health Endpoint

# Simple health check
curl http://localhost:9090/health

# Response
{"status": "healthy"}

Detailed Health

curl http://localhost:9090/health/detailed

# Response
{
  "status": "healthy",
  "upstreams": {
    "backend": {
      "healthy": 2,
      "unhealthy": 0,
      "targets": [
        {"address": "10.0.0.1:3000", "healthy": true},
        {"address": "10.0.0.2:3000", "healthy": true}
      ]
    }
  },
  "agents": {
    "waf": {"status": "connected"},
    "auth": {"status": "connected"}
  }
}

Kubernetes Probes

apiVersion: v1
kind: Pod
spec:
  containers:
    - name: zentinel
      livenessProbe:
        httpGet:
          path: /health
          port: 9090
        initialDelaySeconds: 5
        periodSeconds: 10
        failureThreshold: 3
      readinessProbe:
        httpGet:
          path: /health/detailed
          port: 9090
        initialDelaySeconds: 5
        periodSeconds: 5
        failureThreshold: 2

Logging

Structured Logging

system {
    worker-threads 0
}

listeners {
    listener "http" {
        address "0.0.0.0:8080"
        protocol "http"
    }
}

routes {
    route "default" {
        matches { path-prefix "/" }
        upstream "backend"
    }
}

upstreams {
    upstream "backend" {
        targets {
            target { address "127.0.0.1:3000" }
        }
    }
}

Log Output

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "message": "request completed",
  "request_id": "abc123",
  "method": "GET",
  "path": "/api/users",
  "status": 200,
  "latency_ms": 45,
  "upstream": "backend",
  "client_ip": "192.168.1.100"
}

Log Aggregation with Loki

# promtail.yml
server:
  http_listen_port: 9080

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: zentinel
    static_configs:
      - targets:
          - localhost
        labels:
          job: zentinel
          __path__: /var/log/zentinel/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            status: status
      - labels:
          level:
          status:

Distributed Tracing

OpenTelemetry Configuration

system {
    worker-threads 0
}

listeners {
    listener "http" {
        address "0.0.0.0:8080"
        protocol "http"
    }
}

routes {
    route "default" {
        matches { path-prefix "/" }
        upstream "backend"
    }
}

upstreams {
    upstream "backend" {
        targets {
            target { address "127.0.0.1:3000" }
        }
    }
}

Jaeger Setup

services:
  jaeger:
    image: jaegertracing/all-in-one:1.50
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true

SLA Monitoring

SLI/SLO Dashboard

# Availability SLI (non-5xx responses)
sum(rate(zentinel_requests_total{status!~"5.."}[5m]))
/ sum(rate(zentinel_requests_total[5m]))

# Latency SLI (requests under 200ms)
sum(rate(zentinel_request_duration_seconds_bucket{le="0.2"}[5m]))
/ sum(rate(zentinel_request_duration_seconds_count[5m]))

# Error budget remaining (99.9% SLO)
1 - (
  (1 - (sum(rate(zentinel_requests_total{status!~"5.."}[30d]))
  / sum(rate(zentinel_requests_total[30d]))))
  / (1 - 0.999)
)

Next Steps

Rolling Updates - Zero-downtime updates
Kubernetes - Cloud-native deployment
Docker Compose - Container orchestration