Troubleshooting

This guide is the public operational runbook for AxonFlow community and self-hosted deployments. It focuses on the runtime surfaces engineers actually have in front of them today: the Agent on :8080, the Orchestrator on :8081, PostgreSQL, Redis, Prometheus, and Grafana.

If you are running an enterprise package behind AWS, Kubernetes, ECS, or another managed ingress layer, the same diagnostics still apply. The only thing that changes is how you reach the services and collect logs.

Start With the Fastest Checks

Run these first before diving into policy details or connector code:

# Health checks
curl -sf http://localhost:8080/health | jq .
curl -sf http://localhost:8081/health | jq .

# Local container status
docker compose ps

# Recent logs
docker compose logs --tail=150 agent
docker compose logs --tail=150 orchestrator

# Prometheus targets
curl -sf http://localhost:8080/prometheus | head
curl -sf http://localhost:8081/prometheus | head

# Grafana health
curl -sf http://localhost:3000/api/health | jq .

If the platform is fronted by a reverse proxy or load balancer, you may also expose convenience paths like /orchestrator/health. The default self-hosted runtime uses direct service ports, so http://localhost:8080/health and http://localhost:8081/health are the canonical checks.

Troubleshooting Flow

Use this sequence to narrow down the failure domain quickly:

Is the health endpoint reachable?
├── NO → Check process/container status, port bindings, reverse proxy config, and firewall rules
│
└── YES → Does /health report "healthy"?
         ├── NO → Identify whether the failure is in PostgreSQL, Redis, provider bootstrap, or internal service auth
         │
         └── YES → Are requests still failing?
                  ├── LLM requests → Check provider credentials, routing config, and Orchestrator logs
                  ├── MCP requests → Check connector registration, auth, and policy blocks
                  ├── Workflow requests → Check Orchestrator state, execution history, and WCP endpoints
                  └── Audit/usage gaps → Check Agent request path and persistence health

Deployment Startup Problems

Symptoms

docker compose up never reaches healthy containers
Agent or Orchestrator exits immediately
one service keeps restarting while the other stays up

What to check

docker compose ps
docker compose logs --tail=200 agent
docker compose logs --tail=200 orchestrator
docker compose logs --tail=100 postgres
docker compose logs --tail=100 redis

Common causes:

Failure mode	What it usually means	Recovery path
Agent unhealthy, Orchestrator healthy	Agent cannot reach PostgreSQL, Redis, or internal config	Check database URL, Redis connectivity, policy/config startup errors
Orchestrator unhealthy, Agent healthy	LLM bootstrap or runtime-config initialization failed	Check provider env vars, routing config, and Orchestrator logs
Both unhealthy	shared dependency problem	Start with PostgreSQL, Redis, secrets, and compose env
Health passes but requests fail	request-path issue, not startup issue	Move to the request diagnostics below

Health Endpoint Fails or Stays Degraded

The Agent and Orchestrator both expose /health. In self-hosted community deployments those are the fastest indicators of whether the process is merely alive or actually ready.

Useful checks:

curl -sf http://localhost:8080/health | jq .
curl -sf http://localhost:8081/health | jq .
curl -sf http://localhost:8080/prometheus | grep -E "http|request|policy" | head
curl -sf http://localhost:8081/prometheus | grep -E "http|workflow|llm" | head

If /health is failing:

verify PostgreSQL is reachable and writable
verify Redis is up if your workflow or rate-limit path depends on it
verify AXONFLOW_INTERNAL_SERVICE_SECRET is identical on Agent and Orchestrator when you set it
verify provider configuration for any proxy-mode or MAP flow that needs an LLM

Requests Fail Even Though Health Is Green

The canonical client request path in the current runtime is:

POST /api/request

If older documentation or scripts still point to /api/v1/agent/execute, treat that as stale.

Check the request path directly

curl -s http://localhost:8080/api/request \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "hello",
    "client_id": "test-client",
    "request_type": "proxy"
  }' | jq .

Look for three broad classes of failure:

Symptom	Usually points to	What to inspect
4xx response	auth, policy, request-shape, or tier limit issue	Agent logs, policy config, request body
5xx response	provider, connector, orchestration, or persistence failure	both service logs, Prometheus, dependency health
long latency / timeout	provider or connector path, not usually health path	Orchestrator logs, provider routing, connector latency

MCP Connector Problems

Connector issues are often confused with generic “platform” failures. Separate them early.

Useful checks:

curl -sf http://localhost:8080/mcp/health | jq .
curl -sf http://localhost:8080/mcp/connectors/postgresql/health | jq .

When connector traffic fails:

verify the connector is actually registered in your current edition/build
verify credentials and network reachability to the target system
verify request blocking is not coming from system or tenant policies
verify MCP response limits such as MCP_MAX_ROWS_PER_QUERY and MCP_MAX_BYTES_PER_QUERY

If you are debugging a policy block, compare the connector type and operation you are calling with:

License and Tier Problems

Community runs without a license key. Evaluation and Enterprise rely on AXONFLOW_LICENSE_KEY.

Quick checks:

echo "$AXONFLOW_LICENSE_KEY" | cut -c1-5
docker compose exec agent printenv AXONFLOW_LICENSE_KEY | cut -c1-5
docker compose exec orchestrator printenv AXONFLOW_LICENSE_KEY | cut -c1-5

Typical failure patterns:

key is set in your shell but not passed into the container
key contains whitespace or newline corruption
expired Evaluation or Enterprise key causes graceful fallback to Community limits
the runtime is healthy, but a higher-tier feature now returns a limit or entitlement error

For exact limits and upgrade guidance, see License Management.

Latency and Throughput Problems

AxonFlow performance issues usually come from one of four places:

LLM provider latency
connector/database latency
excessive policy complexity
undersized Agent or Orchestrator capacity

Start here:

curl -sf http://localhost:8080/prometheus | grep -i request | head -20
curl -sf http://localhost:8081/prometheus | grep -Ei 'workflow|llm|provider|execution' | head -20

Questions to ask:

Is the slowdown visible on the Agent, the Orchestrator, or both?
Is the problem only on proxy-mode traffic, only on gateway-mode traffic, or only on MCP?
Did latency spike after a policy change, a new connector rollout, or a provider switch?

Use Monitoring Overview and Load Testing to confirm whether the bottleneck is in AxonFlow or in a downstream dependency.

AWS and Managed-Infrastructure Notes

If you run AxonFlow behind ECS, Kubernetes, an ALB, or another platform layer, map the same checks onto your environment:

service or container health instead of docker compose ps
platform logs instead of local compose logs
the same /health and /prometheus endpoints behind service discovery or ingress
the same AXONFLOW_LICENSE_KEY and provider env vars through your secret manager

The public troubleshooting logic should still work even if the deployment topology changes.

When Troubleshooting Becomes an Upgrade Signal

Community is enough for serious engineering review and for many production use cases. But repeated operational pain in any of these areas is usually a signal that teams should consider Evaluation or Enterprise:

tighter governance workflows
higher limits for policies, providers, and workflow history
richer approval and evidence workflows
broader connector coverage
protected deployment and operations guidance

Start With the Fastest Checks​

Troubleshooting Flow​

Deployment Startup Problems​

Symptoms​

What to check​

Health Endpoint Fails or Stays Degraded​

Requests Fail Even Though Health Is Green​

Check the request path directly​

MCP Connector Problems​

License and Tier Problems​

Latency and Throughput Problems​

AWS and Managed-Infrastructure Notes​

When Troubleshooting Becomes an Upgrade Signal​

Related Docs​