Monitoring & Observability
AxonFlow provides built-in monitoring to help you understand system health, performance, and usage patterns. This guide covers the monitoring capabilities available in the open-source deployment.
Overview
AxonFlow exposes metrics and health endpoints that integrate with standard monitoring tools:
- Health Endpoints - Check service status
- Prometheus Metrics - Detailed performance data
- Structured Logs - Request and error tracking
Health Endpoints
Agent Health
curl https://YOUR_AGENT_ENDPOINT/health
Response:
{
"status": "healthy",
"version": "1.0.0",
"uptime_seconds": 86400,
"checks": {
"database": "ok",
"redis": "ok"
}
}
Orchestrator Health
curl https://YOUR_AGENT_ENDPOINT/orchestrator/health
Response:
{
"status": "healthy",
"components": {
"llm_router": true,
"planning_engine": true
}
}
Health Check Integration
Use these endpoints for:
- Load balancer health checks - ALB/NLB target health
- Kubernetes probes - Liveness and readiness
- Uptime monitoring - External monitoring services
Example: Kubernetes Probe Configuration
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
Prometheus Metrics
AxonFlow exposes Prometheus-compatible metrics at the /metrics endpoint.
Enabling Metrics
Metrics are enabled by default. Access them at:
curl https://YOUR_AGENT_ENDPOINT/metrics
Key Metrics
Request Metrics
| Metric | Type | Description |
|---|---|---|
axonflow_requests_total | Counter | Total requests processed |
axonflow_request_duration_seconds | Histogram | Request latency distribution |
axonflow_requests_in_flight | Gauge | Currently processing requests |
Policy Metrics
| Metric | Type | Description |
|---|---|---|
axonflow_policy_evaluations_total | Counter | Policy evaluations performed |
axonflow_policy_evaluation_duration_seconds | Histogram | Policy evaluation latency |
axonflow_policy_decisions | Counter | Decisions by result (allow/deny) |
System Metrics
| Metric | Type | Description |
|---|---|---|
axonflow_database_connections | Gauge | Active database connections |
axonflow_goroutines | Gauge | Active goroutines |
axonflow_memory_bytes | Gauge | Memory usage |
Prometheus Configuration
Add AxonFlow to your Prometheus scrape config:
scrape_configs:
- job_name: 'axonflow-agent'
static_configs:
- targets: ['YOUR_AGENT_HOST:8080']
metrics_path: /metrics
scrape_interval: 15s
- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['YOUR_ORCHESTRATOR_HOST:8081']
metrics_path: /metrics
scrape_interval: 15s
Logging
AxonFlow outputs structured JSON logs for easy parsing and analysis.
Log Format
{
"level": "info",
"timestamp": "2025-11-26T10:30:00Z",
"message": "Request processed",
"request_id": "req-abc123",
"duration_ms": 8,
"user": "[email protected]",
"action": "mcp:salesforce:query",
"decision": "allow"
}
Log Levels
| Level | Description |
|---|---|
debug | Detailed debugging information |
info | Normal operational messages |
warn | Warning conditions |
error | Error conditions |
Configuring Log Level
Set via environment variable:
export LOG_LEVEL=info # debug, info, warn, error
CloudWatch Logs (AWS)
When deployed on AWS, logs are automatically sent to CloudWatch:
Log Groups:
/ecs/{STACK_NAME}/agent
/ecs/{STACK_NAME}/orchestrator
/ecs/{STACK_NAME}/customer-portal
View logs:
aws logs tail /ecs/YOUR_STACK/agent --follow --region YOUR_REGION
Search for errors:
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000)
Basic Alerting
Recommended Alerts
Set up alerts for these conditions:
| Condition | Threshold | Severity |
|---|---|---|
| Service unhealthy | Health check fails for 1 min | Critical |
| High error rate | > 1% of requests | Warning |
| High latency | P99 > 100ms | Warning |
| Database connection errors | Any | Critical |
Prometheus Alerting Rules
groups:
- name: axonflow
rules:
- alert: AxonFlowDown
expr: up{job="axonflow-agent"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "AxonFlow Agent is down"
- alert: HighErrorRate
expr: rate(axonflow_requests_total{status="error"}[5m]) / rate(axonflow_requests_total[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(axonflow_request_duration_seconds_bucket[5m])) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency exceeds 100ms"
Docker Compose Monitoring Stack
For local development and testing, use this Docker Compose setup:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=7d'
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana
volumes:
grafana-data:
prometheus.yml:
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'axonflow-agent'
static_configs:
- targets: ['host.docker.internal:8080']
- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['host.docker.internal:8081']
Start the stack:
docker-compose up -d
Access:
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
Useful Queries
Request Rate
rate(axonflow_requests_total[5m])
Error Rate
rate(axonflow_requests_total{status="error"}[5m]) / rate(axonflow_requests_total[5m])
P99 Latency
histogram_quantile(0.99, rate(axonflow_request_duration_seconds_bucket[5m]))
Policy Evaluation Time
histogram_quantile(0.95, rate(axonflow_policy_evaluation_duration_seconds_bucket[5m]))
Active Connections
axonflow_database_connections
Best Practices
1. Monitor Key Indicators
Focus on these metrics:
- Availability - Health check success rate
- Latency - P50, P95, P99 response times
- Error rate - Percentage of failed requests
- Throughput - Requests per second
2. Set Appropriate Thresholds
Start with conservative thresholds and tune based on baseline:
- Measure normal operation for 1-2 weeks
- Set warning thresholds at 2x normal
- Set critical thresholds at 5x normal
3. Include Context in Alerts
Alert messages should include:
- What is happening
- What the impact is
- Link to runbook or dashboard
4. Test Your Monitoring
Periodically verify:
- Alerts fire correctly
- Dashboards show accurate data
- Log aggregation is working
Next Steps
- Deployment Guide - Deploy with monitoring enabled
- Troubleshooting - Debug common issues
- Architecture Overview - Understand system components
Enterprise deployments include pre-configured Grafana dashboards, advanced alerting, LLM cost tracking, and comprehensive audit logging. Contact sales for details.