Monitoring & Observability

AxonFlow ships with a practical local observability stack so engineers can see policy activity, latency, token usage, and connector behavior while they build. In Community Docker Compose, Prometheus and Grafana are started by default.

That observability story matters because governed AI systems are hard to trust if operators cannot answer basic runtime questions:

what is the request volume and latency trend?
are policies blocking too much or too little?
are connectors healthy?
which workloads are driving token usage and cost?
can we debug a bad workflow run without reconstructing everything by hand?

What the Platform Exposes

GET /health on the Agent (:8080)
GET /health on the Orchestrator (:8081)
GET /prometheus on both services for native Prometheus scraping
GET /metrics on both services for JSON/debug-style metrics output
Grafana on :3000
Prometheus on :9090

Health Endpoints

Agent

curl -s http://localhost:8080/health | jq .

Typical fields:

{
  "status": "healthy",
  "service": "axonflow-agent",
  "version": "5.4.0",
  "capabilities": [],
  "sdk_compatibility": {}
}

Orchestrator

curl -s http://localhost:8081/health | jq .

Typical fields:

{
  "status": "healthy",
  "service": "axonflow-orchestrator",
  "version": "5.4.0",
  "components": {
    "policy_engine": true,
    "llm_router": true,
    "response_processor": true,
    "audit_logger": true,
    "workflow_engine": true
  }
}

Use these endpoints for:

container and load-balancer health checks
readiness probes
fast smoke tests after config changes

Health checks are useful, but they are not enough by themselves. Mature teams pair them with request, policy, token, and connector metrics so they can see both availability and behavior.

Prometheus Scraping

Prometheus scraping should target /prometheus, not /metrics.

scrape_configs:
  - job_name: 'axonflow-agent'
    static_configs:
      - targets: ['YOUR_AGENT_HOST:8080']
    metrics_path: /prometheus
    scrape_interval: 15s

  - job_name: 'axonflow-orchestrator'
    static_configs:
      - targets: ['YOUR_ORCHESTRATOR_HOST:8081']
    metrics_path: /prometheus
    scrape_interval: 15s

The local Docker Compose file already provisions this for you via config/prometheus-local.yml.

For shared or longer-lived environments, teams usually keep the same scrape pattern but send the data to their existing Prometheus and Grafana estate rather than relying only on a laptop-local stack.

High-Value Metrics

These are the metrics the bundled community dashboard uses today:

Metric	Description
`axonflow_agent_requests_total`	Agent request volume
`axonflow_agent_blocked_requests_total`	Agent-side blocks
`axonflow_agent_policy_evaluations_total`	Agent policy evaluations
`axonflow_agent_request_duration_milliseconds_bucket`	Agent latency histogram
`axonflow_gateway_precheck_requests_total`	Gateway pre-check traffic
`axonflow_gateway_precheck_duration_milliseconds_bucket`	Gateway pre-check latency
`axonflow_gateway_llm_tokens_total`	Gateway token tracking
`axonflow_gateway_llm_cost_usd_total`	Gateway cost tracking
`axonflow_orchestrator_llm_calls_total`	Orchestrator provider traffic
`axonflow_orchestrator_request_duration_milliseconds_bucket`	Orchestrator latency
`axonflow_connector_calls_total`	MCP connector traffic
`axonflow_connector_duration_milliseconds_bucket`	Connector latency
`axonflow_connector_errors_total`	Connector error counts
`axonflow_orchestrator_requests_total`	Orchestrator request volume
`axonflow_orchestrator_blocked_requests_total`	Orchestrator-side blocks
`axonflow_orchestrator_policy_evaluations_total`	Orchestrator policy evaluations
`axonflow_gateway_audit_requests_total`	Gateway audit call volume
`axonflow_gateway_audit_duration_milliseconds`	Gateway audit latency
`axonflow_gateway_audit_queued_total`	Queued audit entries (async mode)
`axonflow_gateway_audit_fallback_total`	Audit entries written to disk fallback
`axonflow_circuit_trips_total`	Circuit breaker trip events (Enterprise)
`axonflow_circuit_blocked_requests_total`	Requests blocked by circuit breaker (Enterprise)

Logs

AxonFlow logs are useful for runtime debugging, especially when you are validating new policies, providers, or MCP connectors.

docker compose logs -f axonflow-agent
docker compose logs -f axonflow-orchestrator
docker compose logs -f prometheus
docker compose logs -f grafana

What to Watch First

When a team is adopting AxonFlow, the most useful first signals are:

request latency on agent and orchestrator
blocked vs allowed traffic
gateway pre-check latency
connector errors and latency
token usage and estimated LLM cost

Those signals tell you whether the control plane is trustworthy enough for real workloads.

What This Gives Senior Engineers

For senior and staff engineers, observability is one of the strongest review criteria because it answers whether AxonFlow is just enforcing controls or also making them operable.

If the platform can show:

latency
block rates
token and cost movement
connector health
audit-driven operational signals

then it is much easier to justify as the control plane for a serious AI application rather than a thin wrapper around provider APIs.

Community vs Higher Tiers

Community already gives teams enough observability to validate production behavior. Evaluation and Enterprise matter when organizations need longer retention, enterprise integrations, broader governance workflows, and procurement-friendly rollout.

What the Platform Exposes​

Health Endpoints​

Agent​

Orchestrator​

Prometheus Scraping​

High-Value Metrics​

Logs​

What to Watch First​

What This Gives Senior Engineers​

Community vs Higher Tiers​

Next Steps​