Monitoring & Observability
AxonFlow ships with a practical local observability stack so engineers can see policy activity, latency, token usage, and connector behavior while they build. In Community Docker Compose, Prometheus and Grafana are started by default.
That observability story matters because governed AI systems are hard to trust if operators cannot answer basic runtime questions:
- what is the request volume and latency trend?
- are policies blocking too much or too little?
- are connectors healthy?
- which workloads are driving token usage and cost?
- can we debug a bad workflow run without reconstructing everything by hand?
What the Platform Exposes
GET /healthon the Agent (:8080)GET /healthon the Orchestrator (:8081)GET /prometheuson both services for native Prometheus scrapingGET /metricson both services for JSON/debug-style metrics output- Grafana on
:3000 - Prometheus on
:9090
Health Endpoints
Agent
curl -s http://localhost:8080/health | jq .
Typical fields:
{
"status": "healthy",
"service": "axonflow-agent",
"version": "5.4.0",
"capabilities": [],
"sdk_compatibility": {}
}
Orchestrator
curl -s http://localhost:8081/health | jq .
Typical fields:
{
"status": "healthy",
"service": "axonflow-orchestrator",
"version": "5.4.0",
"components": {
"policy_engine": true,
"llm_router": true,
"response_processor": true,
"audit_logger": true,
"workflow_engine": true
}
}
Use these endpoints for:
- container and load-balancer health checks
- readiness probes
- fast smoke tests after config changes
Health checks are useful, but they are not enough by themselves. Mature teams pair them with request, policy, token, and connector metrics so they can see both availability and behavior.
Prometheus Scraping
Prometheus scraping should target /prometheus, not /metrics.
scrape_configs:
- job_name: 'axonflow-agent'
static_configs:
- targets: ['YOUR_AGENT_HOST:8080']
metrics_path: /prometheus
scrape_interval: 15s
- job_name: 'axonflow-orchestrator'
static_configs:
- targets: ['YOUR_ORCHESTRATOR_HOST:8081']
metrics_path: /prometheus
scrape_interval: 15s
The local Docker Compose file already provisions this for you via config/prometheus-local.yml.
For shared or longer-lived environments, teams usually keep the same scrape pattern but send the data to their existing Prometheus and Grafana estate rather than relying only on a laptop-local stack.
High-Value Metrics
These are the metrics the bundled community dashboard uses today:
| Metric | Description |
|---|---|
axonflow_agent_requests_total | Agent request volume |
axonflow_agent_blocked_requests_total | Agent-side blocks |
axonflow_agent_policy_evaluations_total | Agent policy evaluations |
axonflow_agent_request_duration_milliseconds_bucket | Agent latency histogram |
axonflow_gateway_precheck_requests_total | Gateway pre-check traffic |
axonflow_gateway_precheck_duration_milliseconds_bucket | Gateway pre-check latency |
axonflow_gateway_llm_tokens_total | Gateway token tracking |
axonflow_gateway_llm_cost_usd_total | Gateway cost tracking |
axonflow_orchestrator_llm_calls_total | Orchestrator provider traffic |
axonflow_orchestrator_request_duration_milliseconds_bucket | Orchestrator latency |
axonflow_connector_calls_total | MCP connector traffic |
axonflow_connector_duration_milliseconds_bucket | Connector latency |
axonflow_connector_errors_total | Connector error counts |
axonflow_orchestrator_requests_total | Orchestrator request volume |
axonflow_orchestrator_blocked_requests_total | Orchestrator-side blocks |
axonflow_orchestrator_policy_evaluations_total | Orchestrator policy evaluations |
axonflow_gateway_audit_requests_total | Gateway audit call volume |
axonflow_gateway_audit_duration_milliseconds | Gateway audit latency |
axonflow_gateway_audit_queued_total | Queued audit entries (async mode) |
axonflow_gateway_audit_fallback_total | Audit entries written to disk fallback |
axonflow_circuit_trips_total | Circuit breaker trip events (Enterprise) |
axonflow_circuit_blocked_requests_total | Requests blocked by circuit breaker (Enterprise) |
Logs
AxonFlow logs are useful for runtime debugging, especially when you are validating new policies, providers, or MCP connectors.
docker compose logs -f axonflow-agent
docker compose logs -f axonflow-orchestrator
docker compose logs -f prometheus
docker compose logs -f grafana
What to Watch First
When a team is adopting AxonFlow, the most useful first signals are:
- request latency on agent and orchestrator
- blocked vs allowed traffic
- gateway pre-check latency
- connector errors and latency
- token usage and estimated LLM cost
Those signals tell you whether the control plane is trustworthy enough for real workloads.
What This Gives Senior Engineers
For senior and staff engineers, observability is one of the strongest review criteria because it answers whether AxonFlow is just enforcing controls or also making them operable.
If the platform can show:
- latency
- block rates
- token and cost movement
- connector health
- audit-driven operational signals
then it is much easier to justify as the control plane for a serious AI application rather than a thin wrapper around provider APIs.
Community vs Higher Tiers
Community already gives teams enough observability to validate production behavior. Evaluation and Enterprise matter when organizations need longer retention, enterprise integrations, broader governance workflows, and procurement-friendly rollout.
