Metrics Reference

Prometheus metrics exposed by Zentinel for monitoring and alerting.

Metrics Endpoint

Metrics are available at the /metrics endpoint on the admin listener:

curl http://localhost:9090/metrics

Configure the admin listener:

listeners {
    listener "admin" {
        address "127.0.0.1:9090"
        protocol "http"
    }
}

routes {
    route "metrics" {
        matches {
            path "/metrics"
        }
        service-type "builtin"
        builtin-handler "metrics"
    }
}

Request Metrics

zentinel_request_duration_seconds

Request latency histogram.

TypeLabelsDescription
Histogramroute, methodRequest duration in seconds

Buckets: 1ms, 5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s

Example queries:

# Average latency by route
rate(zentinel_request_duration_seconds_sum[5m])
  / rate(zentinel_request_duration_seconds_count[5m])

# P99 latency
histogram_quantile(0.99,
  rate(zentinel_request_duration_seconds_bucket[5m]))

# P95 latency by route
histogram_quantile(0.95,
  sum(rate(zentinel_request_duration_seconds_bucket[5m])) by (le, route))

zentinel_requests_total

Total request counter.

TypeLabelsDescription
Counterroute, method, statusTotal requests

Example queries:

# Requests per second
rate(zentinel_requests_total[5m])

# Error rate (5xx)
sum(rate(zentinel_requests_total{status=~"5.."}[5m]))
  / sum(rate(zentinel_requests_total[5m]))

# Success rate by route
sum(rate(zentinel_requests_total{status="200"}[5m])) by (route)
  / sum(rate(zentinel_requests_total[5m])) by (route)

zentinel_active_requests

Currently active requests.

TypeLabelsDescription
Gauge-Number of in-flight requests

Example queries:

# Current active requests
zentinel_active_requests

# Alert if too high
zentinel_active_requests > 1000

zentinel_request_body_size_bytes

Request body size histogram.

TypeLabelsDescription
HistogramrouteRequest body size in bytes

Buckets: 100B, 1KB, 10KB, 100KB, 1MB, 10MB, 100MB

zentinel_response_body_size_bytes

Response body size histogram.

TypeLabelsDescription
HistogramrouteResponse body size in bytes

Upstream Metrics

zentinel_upstream_attempts_total

Upstream connection attempts.

TypeLabelsDescription
Counterupstream, routeTotal connection attempts

zentinel_upstream_failures_total

Upstream connection failures.

TypeLabelsDescription
Counterupstream, route, reasonTotal failures

Reason values:

  • connection_refused - TCP connection refused
  • connection_timeout - Connection timed out
  • read_timeout - Read timeout
  • write_timeout - Write timeout
  • tls_error - TLS handshake failed
  • dns_error - DNS resolution failed

Example queries:

# Failure rate by upstream
sum(rate(zentinel_upstream_failures_total[5m])) by (upstream)
  / sum(rate(zentinel_upstream_attempts_total[5m])) by (upstream)

# Connection refused errors
sum(rate(zentinel_upstream_failures_total{reason="connection_refused"}[5m])) by (upstream)

zentinel_circuit_breaker_state

Circuit breaker state.

TypeLabelsDescription
Gaugecomponent, routeState: 0=closed, 1=open

Example queries:

# Open circuit breakers
zentinel_circuit_breaker_state == 1

# Alert on circuit breaker open
zentinel_circuit_breaker_state{component="upstream"} == 1

Agent Metrics

zentinel_agent_latency_seconds

Agent call latency histogram.

TypeLabelsDescription
Histogramagent, eventAgent call duration

Event values:

  • on_request_headers
  • on_request_body
  • on_response_headers
  • on_response_body

Example queries:

# P99 agent latency
histogram_quantile(0.99,
  rate(zentinel_agent_latency_seconds_bucket[5m]))

# Average latency by agent
rate(zentinel_agent_latency_seconds_sum[5m])
  / rate(zentinel_agent_latency_seconds_count[5m])

zentinel_agent_timeouts_total

Agent call timeouts.

TypeLabelsDescription
Counteragent, eventTotal timeouts

Example queries:

# Timeout rate by agent
rate(zentinel_agent_timeouts_total[5m])

# Alert on high timeout rate
rate(zentinel_agent_timeouts_total[5m]) > 0.1

zentinel_blocked_requests_total

Requests blocked by agents/WAF.

TypeLabelsDescription
CounterreasonTotal blocked requests

Reason values:

  • waf - Blocked by WAF
  • auth - Authentication failed
  • rate_limit - Rate limited
  • policy - Policy violation

Connection Pool Metrics

zentinel_connection_pool_size

Total connections in pool.

TypeLabelsDescription
GaugeupstreamTotal connections

zentinel_connection_pool_idle

Idle connections in pool.

TypeLabelsDescription
GaugeupstreamIdle connections

zentinel_connection_pool_acquired_total

Connections acquired from pool.

TypeLabelsDescription
CounterupstreamTotal acquisitions

Example queries:

# Pool utilization
(zentinel_connection_pool_size - zentinel_connection_pool_idle)
  / zentinel_connection_pool_size

# Connection acquisition rate
rate(zentinel_connection_pool_acquired_total[5m])

TLS Metrics

zentinel_tls_handshake_duration_seconds

TLS handshake duration.

TypeLabelsDescription
HistogramversionHandshake duration

Version values: TLS1.2, TLS1.3

System Metrics

zentinel_memory_usage_bytes

Process memory usage.

TypeLabelsDescription
Gauge-Memory usage in bytes

zentinel_cpu_usage_percent

CPU usage percentage.

TypeLabelsDescription
Gauge-CPU usage 0-100

zentinel_open_connections

Open connections count.

TypeLabelsDescription
Gauge-Number of open connections

Prometheus Configuration

Basic Scrape Config

scrape_configs:
  - job_name: 'zentinel'
    static_configs:
      - targets: ['localhost:9090']
    scrape_interval: 15s
    metrics_path: /metrics

With Service Discovery

scrape_configs:
  - job_name: 'zentinel'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        regex: zentinel
        action: keep
      - source_labels: [__meta_kubernetes_pod_container_port_name]
        regex: metrics
        action: keep

Alerting Rules

Example Alerts

groups:
  - name: zentinel
    rules:
      # High error rate
      - alert: ZentinelHighErrorRate
        expr: |
          sum(rate(zentinel_requests_total{status=~"5.."}[5m]))
          / sum(rate(zentinel_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on Zentinel"
          description: "Error rate is {{ $value | humanizePercentage }}"

      # Circuit breaker open
      - alert: ZentinelCircuitBreakerOpen
        expr: zentinel_circuit_breaker_state == 1
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Circuit breaker open"
          description: "Circuit breaker open for {{ $labels.component }}"

      # High latency
      - alert: ZentinelHighLatency
        expr: |
          histogram_quantile(0.99,
            rate(zentinel_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P99 latency"
          description: "P99 latency is {{ $value }}s"

      # Agent timeouts
      - alert: ZentinelAgentTimeouts
        expr: rate(zentinel_agent_timeouts_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent timeouts detected"
          description: "Agent {{ $labels.agent }} timing out"

      # No healthy upstreams
      - alert: ZentinelNoHealthyUpstreams
        expr: |
          sum(zentinel_circuit_breaker_state{component="upstream"})
          == count(zentinel_circuit_breaker_state{component="upstream"})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "No healthy upstreams"

Grafana Dashboard

Key panels for a Zentinel dashboard:

  1. Request Rate - rate(zentinel_requests_total[5m])
  2. Error Rate - 5xx / total
  3. Latency P50/P95/P99 - histogram_quantile
  4. Active Requests - zentinel_active_requests
  5. Upstream Health - circuit breaker states
  6. Agent Latency - agent_latency histogram
  7. Connection Pool - size vs idle
  8. Memory/CPU - system metrics

See Also