Guide to diagnosing and resolving common Zentinel issues.
Quick Diagnostics
Check Service Status
# Is Zentinel running?
ps aux | grep zentinel
systemctl status zentinel
# Check listening ports
ss -tlnp | grep zentinel
lsof -i :8080
# View recent logs
journalctl -u zentinel -n 100
tail -f /var/log/zentinel/error.log
Test Configuration
# Validate configuration
zentinel --test --config zentinel.kdl
# Test with verbose output
zentinel --test --verbose --config zentinel.kdl
Check Connectivity
# Test listener
curl -v http://localhost:8080/health
# Test upstream directly
curl -v http://backend-server:8080/health
# Check DNS resolution
dig backend.internal
Common Issues
Startup Failures
“Address already in use”
Error: Address already in use (os error 98)
Cause: Another process is using the port.
Solution:
# Find what's using the port
lsof -i :8080
# or
ss -tlnp | grep 8080
# Kill the process or change Zentinel's port
“Permission denied” on privileged ports
Error: Permission denied (os error 13)
Cause: Ports below 1024 require root or capabilities.
Solution:
# Option 1: Grant capability
sudo setcap cap_net_bind_service=+ep /usr/local/bin/zentinel
# Option 2: Use port >= 1024 and redirect
iptables -t nat -A PREROUTING -p tcp --dport 80 -j REDIRECT --to-port 8080
# Option 3: Use systemd socket activation
“Configuration file not found”
Error: Configuration error: Failed to load configuration file
Solution:
# Check file exists and permissions
ls -la /etc/zentinel/zentinel.kdl
# Verify path
zentinel --test --config /etc/zentinel/zentinel.kdl
Connection Issues
502 Bad Gateway
Symptoms: All requests return 502.
Diagnosis:
# Check upstream health
curl http://localhost:9090/admin/upstreams
# Test upstream directly
curl -v http://upstream-host:port/health
# Check logs for upstream errors
grep "upstream" /var/log/zentinel/error.log
Common causes:
- Upstream server not running
- Firewall blocking connection
- DNS resolution failure
- Wrong upstream address/port
Solutions:
# Verify upstream is accessible
nc -zv upstream-host 8080
# Check firewall
iptables -L -n | grep 8080
# Verify DNS
dig upstream.internal
503 Service Unavailable
Symptoms: Intermittent 503 errors.
Diagnosis:
# Check circuit breaker status
curl http://localhost:9090/admin/upstreams
# Check connection limits
curl http://localhost:9090/metrics | grep connections
Common causes:
- Circuit breaker open
- All upstreams unhealthy
- Connection limit reached
- Rate limit exceeded
Solutions:
// Increase connection limits
limits {
max-total-connections 20000
max-connections-per-client 200
}
// Adjust circuit breaker
routes {
route "api" {
circuit-breaker {
failure-threshold 10 // More tolerant
timeout-seconds 60 // Longer recovery
}
}
}
504 Gateway Timeout
Symptoms: Requests timeout after delay.
Diagnosis:
# Check upstream response time
time curl http://upstream-host:8080/endpoint
# Check timeout settings
grep timeout zentinel.kdl
Solutions:
// Increase timeouts for slow endpoints
routes {
route "slow-api" {
policies {
timeout-secs 120
}
}
}
upstreams {
upstream "backend" {
timeouts {
request-secs 120
read-secs 60
}
}
}
TLS/Certificate Issues
“Invalid certificate chain”
# Verify certificate
openssl x509 -in /etc/zentinel/certs/server.crt -noout -text
# Check certificate chain
openssl verify -CAfile ca.crt server.crt
# Test TLS connection
openssl s_client -connect localhost:443 -servername example.com
“Certificate expired”
# Check expiration
openssl x509 -in server.crt -noout -dates
# Check days until expiration
openssl x509 -in server.crt -noout -enddate
Key/cert mismatch
# Compare modulus
openssl x509 -noout -modulus -in server.crt | md5sum
openssl rsa -noout -modulus -in server.key | md5sum
# These should match
Performance Issues
High Latency
Diagnosis:
# Check P99 latency
curl -s http://localhost:9090/metrics | grep request_duration
# Profile request
curl -w "@curl-format.txt" http://localhost:8080/api/endpoint
curl-format.txt:
time_namelookup: %{time_namelookup}s\n
time_connect: %{time_connect}s\n
time_appconnect: %{time_appconnect}s\n
time_pretransfer: %{time_pretransfer}s\n
time_redirect: %{time_redirect}s\n
time_starttransfer: %{time_starttransfer}s\n
time_total: %{time_total}s\n
Common causes and solutions:
| Cause | Solution |
|---|---|
| DNS resolution slow | Use IP addresses or local DNS cache |
| TLS handshake slow | Enable session resumption |
| Connection establishment | Increase connection pool |
| Upstream slow | Add caching, optimize backend |
| Body too large | Stream instead of buffer |
High Memory Usage
Diagnosis:
# Check memory metrics
curl -s http://localhost:9090/metrics | grep memory
# Check process memory
ps aux | grep zentinel
cat /proc/$(pgrep zentinel)/status | grep Vm
Solutions:
// Reduce buffer sizes
limits {
max-body-buffer-bytes 524288 // 512KB
max-body-inspection-bytes 524288
}
// Reduce connection pool
upstreams {
upstream "backend" {
connection-pool {
max-connections 50
max-idle 10
}
}
}
// Set memory limit
limits {
max-memory-percent 70.0
}
High CPU Usage
Diagnosis:
# Check CPU metrics
curl -s http://localhost:9090/metrics | grep cpu
# Profile with perf (Linux)
perf top -p $(pgrep zentinel)
Solutions:
// Adjust worker threads
server {
worker-threads 4 // Match CPU cores
}
// Reduce logging
// Set RUST_LOG=warn in environment
// Disable unnecessary features
routes {
route "api" {
policies {
buffer-requests false
buffer-responses false
}
}
}
Debug Mode
Enable Debug Logging
# Via environment
RUST_LOG=debug zentinel --config zentinel.kdl
# Module-specific debug
RUST_LOG=zentinel::proxy=debug,zentinel::agents=trace zentinel --config zentinel.kdl
# Pretty format for development
ZENTINEL_LOG_FORMAT=pretty RUST_LOG=debug zentinel --config zentinel.kdl
Log Analysis
# Find errors
grep -i error /var/log/zentinel/*.log
# Find specific correlation ID
grep "2kF8xQw4BnM" /var/log/zentinel/*.log
# Count errors by type
grep "error" /var/log/zentinel/error.log | jq -r '.error_type' | sort | uniq -c
# Find slow requests (>1s)
jq 'select(.duration_ms > 1000)' /var/log/zentinel/access.log
Request Tracing
Every request has a correlation ID in X-Correlation-Id header:
# Make request and get correlation ID
curl -i http://localhost:8080/api/endpoint
# X-Correlation-Id: 2kF8xQw4BnM
# Search logs by ID
grep "2kF8xQw4BnM" /var/log/zentinel/*.log | jq .
Metrics Analysis
# Dump all metrics
curl http://localhost:9090/metrics > metrics.txt
# Check error rates
curl -s http://localhost:9090/metrics | grep -E "requests_total.*status=\"5"
# Check upstream health
curl -s http://localhost:9090/metrics | grep circuit_breaker
Health Check Failures
Zentinel Health Check
# Basic health
curl http://localhost:9090/health
# Detailed status
curl http://localhost:9090/status
Upstream Health Check Failures
Diagnosis:
# Check upstream status
curl http://localhost:9090/admin/upstreams
# Test health endpoint directly
curl -v http://upstream:8080/health
Common causes:
- Health endpoint returns non-200
- Health check timeout too short
- Health endpoint path wrong
- Upstream overloaded
Solutions:
upstreams {
upstream "backend" {
health-check {
type "http" {
path "/health" // Verify path
expected-status 200
}
timeout-secs 10 // Increase timeout
unhealthy-threshold 5 // More tolerant
}
}
}
Agent Issues
Agent Connection Failed
Agent error: auth - connection refused
Diagnosis:
# Check agent is running
ps aux | grep agent
# Check socket exists
ls -la /var/run/zentinel/*.sock
# Test socket connection
nc -U /var/run/zentinel/auth.sock
Solutions:
# Start agent
systemctl start zentinel-auth-agent
# Check socket permissions
chmod 660 /var/run/zentinel/auth.sock
chown zentinel:zentinel /var/run/zentinel/auth.sock
Agent Timeouts
Diagnosis:
# Check agent latency metrics
curl -s http://localhost:9090/metrics | grep agent_latency
# Check timeout count
curl -s http://localhost:9090/metrics | grep agent_timeout
Solutions:
agents {
agent "auth" {
timeout-ms 200 // Increase timeout
circuit-breaker {
failure-threshold 10 // More tolerant
}
}
}
Configuration Reload Issues
Reload Failed
# Check reload status
journalctl -u zentinel | grep -i reload
# Validate new config before reload
zentinel --test --config zentinel.kdl
# Manual reload
kill -HUP $(cat /var/run/zentinel.pid)
Config Validation Errors
# Get detailed validation errors
zentinel --test --verbose --config zentinel.kdl 2>&1
# Common issues:
# - Route references undefined upstream
# - Duplicate route/upstream IDs
# - Invalid regex in path-regex
# - Missing required fields
Getting Help
Collect Diagnostic Information
# System info
uname -a
cat /etc/os-release
# Zentinel version
zentinel --version
# Configuration (sanitized)
cat zentinel.kdl | grep -v -E "(key|password|secret)"
# Recent logs
journalctl -u zentinel --since "1 hour ago"
# Metrics snapshot
curl http://localhost:9090/metrics > metrics.txt
Log Locations
| Platform | Location |
|---|---|
| systemd | journalctl -u zentinel |
| Docker | docker logs zentinel |
| Kubernetes | kubectl logs -l app=zentinel |
| Custom | Check working-directory in config |
See Also
- Health Monitoring - Health checks and monitoring
- Metrics Reference - Available metrics
- Error Codes - Error types and codes