Skip to main content

Monitoring & Observability

AxonFlow provides built-in monitoring to help you understand system health, performance, and usage patterns. This guide covers the monitoring capabilities available in the open-source deployment.

Overview

AxonFlow exposes metrics and health endpoints that integrate with standard monitoring tools:

  • Health Endpoints - Check service status
  • Prometheus Metrics - Detailed performance data
  • Structured Logs - Request and error tracking

Health Endpoints

Agent Health

curl https://YOUR_AGENT_ENDPOINT/health

Response:

{
"status": "healthy",
"version": "1.0.0",
"uptime_seconds": 86400,
"checks": {
"database": "ok",
"redis": "ok"
}
}

Orchestrator Health

curl https://YOUR_AGENT_ENDPOINT/orchestrator/health

Response:

{
"status": "healthy",
"components": {
"llm_router": true,
"planning_engine": true
}
}

Health Check Integration

Use these endpoints for:

  • Load balancer health checks - ALB/NLB target health
  • Kubernetes probes - Liveness and readiness
  • Uptime monitoring - External monitoring services

Example: Kubernetes Probe Configuration

livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10

readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5

Prometheus Metrics

AxonFlow exposes Prometheus-compatible metrics at the /metrics endpoint.

Enabling Metrics

Metrics are enabled by default. Access them at:

curl https://YOUR_AGENT_ENDPOINT/metrics

Key Metrics

Request Metrics

MetricTypeDescription
axonflow_requests_totalCounterTotal requests processed
axonflow_request_duration_secondsHistogramRequest latency distribution
axonflow_requests_in_flightGaugeCurrently processing requests

Policy Metrics

MetricTypeDescription
axonflow_policy_evaluations_totalCounterPolicy evaluations performed
axonflow_policy_evaluation_duration_secondsHistogramPolicy evaluation latency
axonflow_policy_decisionsCounterDecisions by result (allow/deny)

System Metrics

MetricTypeDescription
axonflow_database_connectionsGaugeActive database connections
axonflow_goroutinesGaugeActive goroutines
axonflow_memory_bytesGaugeMemory usage

Prometheus Configuration

Add AxonFlow to your Prometheus scrape config:

scrape_configs:
- job_name: 'axonflow-agent'
static_configs:
- targets: ['YOUR_AGENT_HOST:8080']
metrics_path: /metrics
scrape_interval: 15s

- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['YOUR_ORCHESTRATOR_HOST:8081']
metrics_path: /metrics
scrape_interval: 15s

Logging

AxonFlow outputs structured JSON logs for easy parsing and analysis.

Log Format

{
"level": "info",
"timestamp": "2025-11-26T10:30:00Z",
"message": "Request processed",
"request_id": "req-abc123",
"duration_ms": 8,
"user": "[email protected]",
"action": "mcp:salesforce:query",
"decision": "allow"
}

Log Levels

LevelDescription
debugDetailed debugging information
infoNormal operational messages
warnWarning conditions
errorError conditions

Configuring Log Level

Set via environment variable:

export LOG_LEVEL=info  # debug, info, warn, error

CloudWatch Logs (AWS)

When deployed on AWS, logs are automatically sent to CloudWatch:

Log Groups:
/ecs/{STACK_NAME}/agent
/ecs/{STACK_NAME}/orchestrator
/ecs/{STACK_NAME}/customer-portal

View logs:

aws logs tail /ecs/YOUR_STACK/agent --follow --region YOUR_REGION

Search for errors:

aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000)

Basic Alerting

Set up alerts for these conditions:

ConditionThresholdSeverity
Service unhealthyHealth check fails for 1 minCritical
High error rate> 1% of requestsWarning
High latencyP99 > 100msWarning
Database connection errorsAnyCritical

Prometheus Alerting Rules

groups:
- name: axonflow
rules:
- alert: AxonFlowDown
expr: up{job="axonflow-agent"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "AxonFlow Agent is down"

- alert: HighErrorRate
expr: rate(axonflow_requests_total{status="error"}[5m]) / rate(axonflow_requests_total[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"

- alert: HighLatency
expr: histogram_quantile(0.99, rate(axonflow_request_duration_seconds_bucket[5m])) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "P99 latency exceeds 100ms"

Docker Compose Monitoring Stack

For local development and testing, use this Docker Compose setup:

version: '3.8'

services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=7d'

grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-data:/var/lib/grafana

volumes:
grafana-data:

prometheus.yml:

global:
scrape_interval: 15s

scrape_configs:
- job_name: 'axonflow-agent'
static_configs:
- targets: ['host.docker.internal:8080']

- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['host.docker.internal:8081']

Start the stack:

docker-compose up -d

Access:


Useful Queries

Request Rate

rate(axonflow_requests_total[5m])

Error Rate

rate(axonflow_requests_total{status="error"}[5m]) / rate(axonflow_requests_total[5m])

P99 Latency

histogram_quantile(0.99, rate(axonflow_request_duration_seconds_bucket[5m]))

Policy Evaluation Time

histogram_quantile(0.95, rate(axonflow_policy_evaluation_duration_seconds_bucket[5m]))

Active Connections

axonflow_database_connections

Best Practices

1. Monitor Key Indicators

Focus on these metrics:

  • Availability - Health check success rate
  • Latency - P50, P95, P99 response times
  • Error rate - Percentage of failed requests
  • Throughput - Requests per second

2. Set Appropriate Thresholds

Start with conservative thresholds and tune based on baseline:

  • Measure normal operation for 1-2 weeks
  • Set warning thresholds at 2x normal
  • Set critical thresholds at 5x normal

3. Include Context in Alerts

Alert messages should include:

  • What is happening
  • What the impact is
  • Link to runbook or dashboard

4. Test Your Monitoring

Periodically verify:

  • Alerts fire correctly
  • Dashboards show accurate data
  • Log aggregation is working

Next Steps


Enterprise Monitoring

Enterprise deployments include pre-configured Grafana dashboards, advanced alerting, LLM cost tracking, and comprehensive audit logging. Contact sales for details.