Skip to main content

Monitoring & Observability

AxonFlow ships with a practical local observability stack so engineers can see policy activity, latency, token usage, and connector behavior while they build. In Community Docker Compose, Prometheus and Grafana are started by default.

That observability story matters because governed AI systems are hard to trust if operators cannot answer basic runtime questions:

  • what is the request volume and latency trend?
  • are policies blocking too much or too little?
  • are connectors healthy?
  • which workloads are driving token usage and cost?
  • can we debug a bad workflow run without reconstructing everything by hand?

What the Platform Exposes

  • GET /health on the Agent (:8080)
  • GET /health on the Orchestrator (:8081)
  • GET /prometheus on both services for native Prometheus scraping
  • GET /metrics on both services for JSON/debug-style metrics output
  • Grafana on :3000
  • Prometheus on :9090

Health Endpoints

Agent

curl -s http://localhost:8080/health | jq .

Typical fields:

{
"status": "healthy",
"service": "axonflow-agent",
"version": "5.4.0",
"capabilities": [],
"sdk_compatibility": {}
}

Orchestrator

curl -s http://localhost:8081/health | jq .

Typical fields:

{
"status": "healthy",
"service": "axonflow-orchestrator",
"version": "5.4.0",
"components": {
"policy_engine": true,
"llm_router": true,
"response_processor": true,
"audit_logger": true,
"workflow_engine": true
}
}

Use these endpoints for:

  • container and load-balancer health checks
  • readiness probes
  • fast smoke tests after config changes

Health checks are useful, but they are not enough by themselves. Mature teams pair them with request, policy, token, and connector metrics so they can see both availability and behavior.

Prometheus Scraping

Prometheus scraping should target /prometheus, not /metrics.

scrape_configs:
- job_name: 'axonflow-agent'
static_configs:
- targets: ['YOUR_AGENT_HOST:8080']
metrics_path: /prometheus
scrape_interval: 15s

- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['YOUR_ORCHESTRATOR_HOST:8081']
metrics_path: /prometheus
scrape_interval: 15s

The local Docker Compose file already provisions this for you via config/prometheus-local.yml.

For shared or longer-lived environments, teams usually keep the same scrape pattern but send the data to their existing Prometheus and Grafana estate rather than relying only on a laptop-local stack.

High-Value Metrics

These are the metrics the bundled community dashboard uses today:

MetricDescription
axonflow_agent_requests_totalAgent request volume
axonflow_agent_blocked_requests_totalAgent-side blocks
axonflow_agent_policy_evaluations_totalAgent policy evaluations
axonflow_agent_request_duration_milliseconds_bucketAgent latency histogram
axonflow_gateway_precheck_requests_totalGateway pre-check traffic
axonflow_gateway_precheck_duration_milliseconds_bucketGateway pre-check latency
axonflow_gateway_llm_tokens_totalGateway token tracking
axonflow_gateway_llm_cost_usd_totalGateway cost tracking
axonflow_orchestrator_llm_calls_totalOrchestrator provider traffic
axonflow_orchestrator_request_duration_milliseconds_bucketOrchestrator latency
axonflow_connector_calls_totalMCP connector traffic
axonflow_connector_duration_milliseconds_bucketConnector latency
axonflow_connector_errors_totalConnector error counts
axonflow_orchestrator_requests_totalOrchestrator request volume
axonflow_orchestrator_blocked_requests_totalOrchestrator-side blocks
axonflow_orchestrator_policy_evaluations_totalOrchestrator policy evaluations
axonflow_gateway_audit_requests_totalGateway audit call volume
axonflow_gateway_audit_duration_millisecondsGateway audit latency
axonflow_gateway_audit_queued_totalQueued audit entries (async mode)
axonflow_gateway_audit_fallback_totalAudit entries written to disk fallback
axonflow_circuit_trips_totalCircuit breaker trip events (Enterprise)
axonflow_circuit_blocked_requests_totalRequests blocked by circuit breaker (Enterprise)

Logs

AxonFlow logs are useful for runtime debugging, especially when you are validating new policies, providers, or MCP connectors.

docker compose logs -f axonflow-agent
docker compose logs -f axonflow-orchestrator
docker compose logs -f prometheus
docker compose logs -f grafana

What to Watch First

When a team is adopting AxonFlow, the most useful first signals are:

  • request latency on agent and orchestrator
  • blocked vs allowed traffic
  • gateway pre-check latency
  • connector errors and latency
  • token usage and estimated LLM cost

Those signals tell you whether the control plane is trustworthy enough for real workloads.

What This Gives Senior Engineers

For senior and staff engineers, observability is one of the strongest review criteria because it answers whether AxonFlow is just enforcing controls or also making them operable.

If the platform can show:

  • latency
  • block rates
  • token and cost movement
  • connector health
  • audit-driven operational signals

then it is much easier to justify as the control plane for a serious AI application rather than a thin wrapper around provider APIs.

Community vs Higher Tiers

Community already gives teams enough observability to validate production behavior. Evaluation and Enterprise matter when organizations need longer retention, enterprise integrations, broader governance workflows, and procurement-friendly rollout.

Next Steps