Troubleshooting
This guide is the public operational runbook for AxonFlow community and self-hosted deployments. It focuses on the runtime surfaces engineers actually have in front of them today: the Agent on :8080, the Orchestrator on :8081, PostgreSQL, Redis, Prometheus, and Grafana.
If you are running an enterprise package behind AWS, Kubernetes, ECS, or another managed ingress layer, the same diagnostics still apply. The only thing that changes is how you reach the services and collect logs.
Start With the Fastest Checks
Run these first before diving into policy details or connector code:
# Health checks
curl -sf http://localhost:8080/health | jq .
curl -sf http://localhost:8081/health | jq .
# Local container status
docker compose ps
# Recent logs
docker compose logs --tail=150 agent
docker compose logs --tail=150 orchestrator
# Prometheus targets
curl -sf http://localhost:8080/prometheus | head
curl -sf http://localhost:8081/prometheus | head
# Grafana health
curl -sf http://localhost:3000/api/health | jq .
If the platform is fronted by a reverse proxy or load balancer, you may also expose convenience paths like /orchestrator/health. The default self-hosted runtime uses direct service ports, so http://localhost:8080/health and http://localhost:8081/health are the canonical checks.
Troubleshooting Flow
Use this sequence to narrow down the failure domain quickly:
Is the health endpoint reachable?
├── NO → Check process/container status, port bindings, reverse proxy config, and firewall rules
│
└── YES → Does /health report "healthy"?
├── NO → Identify whether the failure is in PostgreSQL, Redis, provider bootstrap, or internal service auth
│
└── YES → Are requests still failing?
├── LLM requests → Check provider credentials, routing config, and Orchestrator logs
├── MCP requests → Check connector registration, auth, and policy blocks
├── Workflow requests → Check Orchestrator state, execution history, and WCP endpoints
└── Audit/usage gaps → Check Agent request path and persistence health
Deployment Startup Problems
Symptoms
docker compose upnever reaches healthy containers- Agent or Orchestrator exits immediately
- one service keeps restarting while the other stays up
What to check
docker compose ps
docker compose logs --tail=200 agent
docker compose logs --tail=200 orchestrator
docker compose logs --tail=100 postgres
docker compose logs --tail=100 redis
Common causes:
| Failure mode | What it usually means | Recovery path |
|---|---|---|
| Agent unhealthy, Orchestrator healthy | Agent cannot reach PostgreSQL, Redis, or internal config | Check database URL, Redis connectivity, policy/config startup errors |
| Orchestrator unhealthy, Agent healthy | LLM bootstrap or runtime-config initialization failed | Check provider env vars, routing config, and Orchestrator logs |
| Both unhealthy | shared dependency problem | Start with PostgreSQL, Redis, secrets, and compose env |
| Health passes but requests fail | request-path issue, not startup issue | Move to the request diagnostics below |
Health Endpoint Fails or Stays Degraded
The Agent and Orchestrator both expose /health. In self-hosted community deployments those are the fastest indicators of whether the process is merely alive or actually ready.
Useful checks:
curl -sf http://localhost:8080/health | jq .
curl -sf http://localhost:8081/health | jq .
curl -sf http://localhost:8080/prometheus | grep -E "http|request|policy" | head
curl -sf http://localhost:8081/prometheus | grep -E "http|workflow|llm" | head
If /health is failing:
- verify PostgreSQL is reachable and writable
- verify Redis is up if your workflow or rate-limit path depends on it
- verify
AXONFLOW_INTERNAL_SERVICE_SECRETis identical on Agent and Orchestrator when you set it - verify provider configuration for any proxy-mode or MAP flow that needs an LLM
Requests Fail Even Though Health Is Green
The canonical client request path in the current runtime is:
POST /api/request
If older documentation or scripts still point to /api/v1/agent/execute, treat that as stale.
Check the request path directly
curl -s http://localhost:8080/api/request \
-H 'Content-Type: application/json' \
-d '{
"query": "hello",
"client_id": "test-client",
"request_type": "proxy"
}' | jq .
Look for three broad classes of failure:
| Symptom | Usually points to | What to inspect |
|---|---|---|
| 4xx response | auth, policy, request-shape, or tier limit issue | Agent logs, policy config, request body |
| 5xx response | provider, connector, orchestration, or persistence failure | both service logs, Prometheus, dependency health |
| long latency / timeout | provider or connector path, not usually health path | Orchestrator logs, provider routing, connector latency |
MCP Connector Problems
Connector issues are often confused with generic “platform” failures. Separate them early.
Useful checks:
curl -sf http://localhost:8080/mcp/health | jq .
curl -sf http://localhost:8080/mcp/connectors/postgresql/health | jq .
When connector traffic fails:
- verify the connector is actually registered in your current edition/build
- verify credentials and network reachability to the target system
- verify request blocking is not coming from system or tenant policies
- verify MCP response limits such as
MCP_MAX_ROWS_PER_QUERYandMCP_MAX_BYTES_PER_QUERY
If you are debugging a policy block, compare the connector type and operation you are calling with:
License and Tier Problems
Community runs without a license key. Evaluation and Enterprise rely on AXONFLOW_LICENSE_KEY.
Quick checks:
echo "$AXONFLOW_LICENSE_KEY" | cut -c1-5
docker compose exec agent printenv AXONFLOW_LICENSE_KEY | cut -c1-5
docker compose exec orchestrator printenv AXONFLOW_LICENSE_KEY | cut -c1-5
Typical failure patterns:
- key is set in your shell but not passed into the container
- key contains whitespace or newline corruption
- expired Evaluation or Enterprise key causes graceful fallback to Community limits
- the runtime is healthy, but a higher-tier feature now returns a limit or entitlement error
For exact limits and upgrade guidance, see License Management.
Latency and Throughput Problems
AxonFlow performance issues usually come from one of four places:
- LLM provider latency
- connector/database latency
- excessive policy complexity
- undersized Agent or Orchestrator capacity
Start here:
curl -sf http://localhost:8080/prometheus | grep -i request | head -20
curl -sf http://localhost:8081/prometheus | grep -Ei 'workflow|llm|provider|execution' | head -20
Questions to ask:
- Is the slowdown visible on the Agent, the Orchestrator, or both?
- Is the problem only on proxy-mode traffic, only on gateway-mode traffic, or only on MCP?
- Did latency spike after a policy change, a new connector rollout, or a provider switch?
Use Monitoring Overview and Load Testing to confirm whether the bottleneck is in AxonFlow or in a downstream dependency.
AWS and Managed-Infrastructure Notes
If you run AxonFlow behind ECS, Kubernetes, an ALB, or another platform layer, map the same checks onto your environment:
- service or container health instead of
docker compose ps - platform logs instead of local compose logs
- the same
/healthand/prometheusendpoints behind service discovery or ingress - the same
AXONFLOW_LICENSE_KEYand provider env vars through your secret manager
The public troubleshooting logic should still work even if the deployment topology changes.
When Troubleshooting Becomes an Upgrade Signal
Community is enough for serious engineering review and for many production use cases. But repeated operational pain in any of these areas is usually a signal that teams should consider Evaluation or Enterprise:
- tighter governance workflows
- higher limits for policies, providers, and workflow history
- richer approval and evidence workflows
- broader connector coverage
- protected deployment and operations guidance
