Runtime Troubleshooting
This page focuses on incidents that happen after the platform starts successfully. The services may be healthy, the containers may be running, and yet engineers still see request failures, stalled workflows, missing audit trails, or latency spikes. That is the runtime troubleshooting problem space.
For deployment bootstrap failures, use Troubleshooting. Use this page when the runtime is already up and the problem appears only under real traffic.
Fast Runtime Triage
# Health and readiness
curl -sf http://localhost:8080/health | jq .
curl -sf http://localhost:8081/health | jq .
# Recent logs
docker compose logs --tail=200 agent
docker compose logs --tail=200 orchestrator
# Metrics endpoints
curl -sf http://localhost:8080/prometheus | head
curl -sf http://localhost:8081/prometheus | head
When you look at logs, separate the failure domain:
- Agent problems affect request acceptance, system-policy enforcement, MCP, audit, and gateway-mode flows
- Orchestrator problems affect proxy-mode execution, provider routing, tenant policies, MAP, and workflows
Log Format
AxonFlow services emit structured JSON logs:
{"level":"info","ts":"2026-03-30T12:00:00Z","caller":"agent/run.go:123","msg":"request processed","request_id":"req_abc","duration_ms":4}
Common Log Patterns
| Message | Meaning | Action |
|---|---|---|
license validated | License check passed on startup | Normal |
database connection failed | Cannot reach PostgreSQL | Check DB host, credentials, network |
orchestrator unreachable | Agent cannot reach orchestrator | Check orchestrator health, network between services |
connector health check failed | MCP connector down | Check connector config and target system |
policy evaluation failed | Error during policy check | Check policy syntax and database |
rate limit exceeded | Client exceeded configured rate | Normal throttling behavior |
circuit breaker tripped | Auto-trip threshold reached | Check provider health, review error rate |
request blocked | Policy denied the request | Normal governance behavior |
context deadline exceeded | Request timed out | Check LLM provider latency, increase timeout |
too many open connections | DB pool exhausted | Increase max_open_conns or reduce concurrency |
Exit Codes
| Code | Meaning |
|---|---|
| 0 | Clean shutdown |
| 1 | Startup error (config, database, license) |
| 137 | OOM killed (container memory limit exceeded) |
| 139 | Segfault (report as bug) |
| 143 | SIGTERM received (normal container shutdown) |
Request Path Failures
The current client-facing request endpoint is:
POST /api/request
If traffic is failing, reproduce it with a minimal request first:
curl -s http://localhost:8080/api/request \
-H 'Content-Type: application/json' \
-d '{
"query": "Summarize the latest customer ticket",
"client_id": "ops-debug",
"request_type": "proxy"
}' | jq .
Then classify the failure:
| Failure type | What it usually means | First place to look |
|---|---|---|
| blocked response | policy decision, not platform outage | Agent logs, policy docs, tenant policy state |
| provider error | LLM credentials, model access, or routing problem | Orchestrator logs and provider configuration |
| timeout | downstream provider, connector, or database slowness | Prometheus, dependency health, retry behavior |
| malformed request | stale SDK example or incorrect payload | current SDK docs and request body |
Service Restart Loops
When a service keeps restarting, the exit pattern matters more than the fact of the restart.
For local and self-hosted environments:
docker compose ps
docker compose logs --tail=200 agent
docker compose logs --tail=200 orchestrator
Common signals:
| Signal | Likely root cause |
|---|---|
OOM or exit 137 | memory sizing is too small for the current workload |
| startup panic | bad env var, missing secret, or provider bootstrap failure |
| repeated database errors | PostgreSQL connectivity, migration, or pool pressure |
| internal auth failures | mismatched AXONFLOW_INTERNAL_SERVICE_SECRET values |
Workflow and MAP Issues
AxonFlow multi-agent planning is not a single opaque “magic” call. The workflow path involves plan generation, execution, state tracking, and UI or API visibility.
Symptoms to watch for:
- plan generation succeeds but execution never starts
- execution starts but stalls between steps
- UI or API cannot see expected execution history
- WCP actions succeed intermittently but replay or debugging fails
What to inspect:
- Orchestrator logs first, not just Agent logs
- workflow history and execution records
- tenant policy or approval logic that may be intentionally blocking a step
- provider routing configuration for the plan-generation stage
Related docs:
MCP Connector Runtime Failures
If LLM-only requests work but tool calls fail, do not debug them as generic request failures. Debug the connector path directly.
Useful checks:
curl -sf http://localhost:8080/mcp/health | jq .
curl -sf http://localhost:8080/mcp/connectors/postgresql/health | jq .
Connector failures usually come from one of these areas:
- connector is not available in the current build or tier
- database or SaaS credentials are wrong
- network path to the target system is blocked
- system or tenant policies block either input or output
- response limits such as
MCP_MAX_ROWS_PER_QUERYorMCP_MAX_BYTES_PER_QUERYare being hit
The important operational habit is to check whether the connector is unavailable, unhealthy, or intentionally blocked. Those are different problems with different fixes.
Audit, Usage, and History Problems
If engineers say “the request worked but I cannot see it in AxonFlow,” treat that as a persistence or observability issue.
Check:
- Agent logs for request acceptance on
/api/request - PostgreSQL health and write capacity
- whether the request path used AxonFlow at all, or bypassed it
- execution history and audit retention limits for the active tier
This matters for regulated and high-accountability workflows because missing audit evidence is often treated as a platform failure even when the response itself succeeded.
Provider and Routing Problems
Provider incidents commonly look like platform bugs. Confirm the basics before changing policy or workflow code.
Review:
DEFAULT_LLM_PROVIDERLLM_ROUTING_STRATEGYPROVIDER_WEIGHTS- provider credentials such as
OPENAI_API_KEY,ANTHROPIC_API_KEY,AZURE_OPENAI_API_KEY, or Bedrock-related runtime config in enterprise deployments
If no providers bootstrapped successfully, proxy-mode and MAP flows will degrade quickly even though health endpoints may still be green.
Performance Incidents
For performance debugging, the key question is where time is being spent:
- Agent-side policy evaluation
- Orchestrator-side routing or workflow orchestration
- provider latency
- connector latency
- database or Redis contention
Use /prometheus on both services and compare:
- request volume
- error growth
- latency buckets
- workflow or execution behavior
The goal is to determine whether AxonFlow is the bottleneck or whether it is accurately surfacing latency from a downstream provider or tool.
Recovery Priorities for Production Teams
When an incident is active, recover in this order:
- restore health on Agent and Orchestrator
- restore core request flow on
/api/request - restore audit visibility and execution history
- restore connector and workflow features
- tune routing, budgets, and policy behavior after the platform is stable
This ordering helps teams avoid spending time on tuning when the real issue is basic request-path health.
