Observability

Complete observability setup with Prometheus metrics, Grafana dashboards, and Jaeger distributed tracing.

Use Case

Monitor request rates, latencies, and errors
Visualize traffic patterns and health status
Trace requests across services
Alert on anomalies

Architecture

                    ┌─────────────────┐
                    │    Zentinel     │
                    │   :8080/:9090   │
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
         ▼                   ▼                   ▼
   ┌───────────┐      ┌───────────┐      ┌───────────┐
   │Prometheus │      │  Grafana  │      │  Jaeger   │
   │  :9091    │◄─────│  :3000    │      │  :16686   │
   └───────────┘      └───────────┘      └───────────┘

Configuration

Create zentinel.kdl:

// Observability Configuration
// Metrics, logging, and distributed tracing

system {
    worker-threads 0
    graceful-shutdown-timeout-secs 30
}

listeners {
    listener "http" {
        address "0.0.0.0:8080"
        protocol "http"
    }
}

routes {
    route "api" {
        matches {
            path-prefix "/api/"
        }
        upstream "backend"
    }
}

upstreams {
    upstream "backend" {
        target "127.0.0.1:3000"
        health-check {
            type "http" {
                path "/health"
            }
            interval-secs 10
        }
    }
}

observability {
    // Prometheus metrics endpoint
    metrics {
        enabled #true
        address "0.0.0.0:9090"
        path "/metrics"
    }

    // Structured JSON logging
    logging {
        level "info"
        format "json"
        access-log {
            enabled #true
            fields "method" "path" "status" "latency" "upstream" "client_ip"
        }
    }

    // OpenTelemetry tracing
    tracing {
        enabled #true
        service-name "zentinel"
        sample-rate 1.0  // Sample all requests (reduce in production)
        propagation "w3c"  // W3C Trace Context
        backend "otlp" {
            endpoint "http://jaeger:4317"
        }
    }
}

Prometheus Setup

prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'zentinel'
    static_configs:
      - targets: ['zentinel:9090']
    metrics_path: /metrics

  - job_name: 'zentinel-agents'
    static_configs:
      - targets:
        - 'zentinel-waf:9091'
        - 'zentinel-auth:9092'
        - 'zentinel-ratelimit:9093'

Key Metrics

Metric	Type	Description
`zentinel_requests_total`	Counter	Total requests by route, method, status
`zentinel_request_duration_seconds`	Histogram	Request latency distribution
`zentinel_upstream_requests_total`	Counter	Requests per upstream target
`zentinel_upstream_latency_seconds`	Histogram	Upstream response times
`zentinel_upstream_health`	Gauge	Upstream health (1=healthy, 0=unhealthy)
`zentinel_connections_active`	Gauge	Active client connections
`zentinel_agent_duration_seconds`	Histogram	Agent processing time
`zentinel_agent_errors_total`	Counter	Agent errors by type

Useful PromQL Queries

# Request rate (requests per second)
rate(zentinel_requests_total[5m])

# Error rate (5xx responses)
rate(zentinel_requests_total{status=~"5.."}[5m]) / rate(zentinel_requests_total[5m])

# 95th percentile latency
histogram_quantile(0.95, rate(zentinel_request_duration_seconds_bucket[5m]))

# Upstream health status
zentinel_upstream_health

# Requests by route
sum by (route) (rate(zentinel_requests_total[5m]))

Grafana Dashboard

dashboard.json

{
  "title": "Zentinel Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [
        {
          "expr": "sum(rate(zentinel_requests_total[5m]))",
          "legendFormat": "Total"
        }
      ]
    },
    {
      "title": "Error Rate",
      "type": "gauge",
      "targets": [
        {
          "expr": "sum(rate(zentinel_requests_total{status=~\"5..\"}[5m])) / sum(rate(zentinel_requests_total[5m])) * 100"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              {"value": 0, "color": "green"},
              {"value": 1, "color": "yellow"},
              {"value": 5, "color": "red"}
            ]
          },
          "unit": "percent"
        }
      }
    },
    {
      "title": "Latency (p50, p95, p99)",
      "type": "timeseries",
      "targets": [
        {
          "expr": "histogram_quantile(0.50, rate(zentinel_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p50"
        },
        {
          "expr": "histogram_quantile(0.95, rate(zentinel_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p95"
        },
        {
          "expr": "histogram_quantile(0.99, rate(zentinel_request_duration_seconds_bucket[5m]))",
          "legendFormat": "p99"
        }
      ]
    },
    {
      "title": "Upstream Health",
      "type": "stat",
      "targets": [
        {
          "expr": "zentinel_upstream_health",
          "legendFormat": "{{upstream}}/{{target}}"
        }
      ]
    }
  ]
}

Jaeger Tracing

docker-compose.yml

version: '3.8'

services:
  zentinel:
    image: ghcr.io/zentinelproxy/zentinel:latest
    ports:
      - "8080:8080"
      - "9090:9090"
    volumes:
      - ./zentinel.kdl:/etc/zentinel/zentinel.kdl
    environment:
      - OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317

  jaeger:
    image: jaegertracing/all-in-one:1.50
    ports:
      - "16686:16686"  # UI
      - "4317:4317"    # OTLP gRPC
      - "4318:4318"    # OTLP HTTP
    environment:
      - COLLECTOR_OTLP_ENABLED=true

  prometheus:
    image: prom/prometheus:v2.47.0
    ports:
      - "9091:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana:10.1.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards

Trace Context Propagation

Zentinel propagates trace context through requests:

# Incoming request with trace context
curl -H "traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01" \
     http://localhost:8080/api/users

Backend receives headers:

traceparent - W3C Trace Context
tracestate - Vendor-specific trace state
X-Request-Id - Zentinel request ID

Viewing Traces

Open Jaeger UI: http://localhost:16686
Select service: zentinel
Find traces by:
- Operation (route name)
- Tags (status, method, path)
- Duration
- Request ID

Alerting

Prometheus Alerting Rules

Create alerts.yml:

groups:
  - name: zentinel
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(zentinel_requests_total{status=~"5.."}[5m]))
          / sum(rate(zentinel_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, rate(zentinel_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is {{ $value }}s"

      - alert: UpstreamDown
        expr: zentinel_upstream_health == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Upstream target is down"
          description: "{{ $labels.upstream }}/{{ $labels.target }} is unhealthy"

      - alert: AgentErrors
        expr: rate(zentinel_agent_errors_total[5m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent errors detected"
          description: "Agent {{ $labels.agent }} has errors"

Log Aggregation

Structured Log Output

{
  "timestamp": "2024-01-15T10:30:00Z",
  "level": "info",
  "message": "request completed",
  "request_id": "abc123",
  "method": "GET",
  "path": "/api/users",
  "status": 200,
  "latency_ms": 45,
  "upstream": "backend",
  "client_ip": "192.168.1.100",
  "trace_id": "0af7651916cd43dd8448eb211c80319c"
}

Loki Integration

# loki configuration
scrape_configs:
  - job_name: zentinel
    static_configs:
      - targets:
          - localhost
        labels:
          job: zentinel
          __path__: /var/log/zentinel/*.log

Testing

Verify Metrics

curl http://localhost:9090/metrics | grep zentinel

Generate Test Traffic

# Install hey (HTTP load generator)
go install github.com/rakyll/hey@latest

# Generate load
hey -n 1000 -c 10 http://localhost:8080/api/users

Check Traces

# Make a traced request
curl -H "X-Request-Id: test-trace-123" http://localhost:8080/api/users

# Find in Jaeger by tag: request_id=test-trace-123

Next Steps

Security - Add WAF and auth monitoring
Microservices - Trace across services
Load Balancer - Monitor upstream distribution