Skip to main content

Troubleshooting

This guide is the public operational runbook for AxonFlow community and self-hosted deployments. It focuses on the runtime surfaces engineers actually have in front of them today: the Agent on :8080, the Orchestrator on :8081, PostgreSQL, Redis, Prometheus, and Grafana.

If you are running an enterprise package behind AWS, Kubernetes, ECS, or another managed ingress layer, the same diagnostics still apply. The only thing that changes is how you reach the services and collect logs.

Start With the Fastest Checks

Run these first before diving into policy details or connector code:

# Health checks
curl -sf http://localhost:8080/health | jq .
curl -sf http://localhost:8081/health | jq .

# Local container status
docker compose ps

# Recent logs
docker compose logs --tail=150 agent
docker compose logs --tail=150 orchestrator

# Prometheus targets
curl -sf http://localhost:8080/prometheus | head
curl -sf http://localhost:8081/prometheus | head

# Grafana health
curl -sf http://localhost:3000/api/health | jq .

If the platform is fronted by a reverse proxy or load balancer, you may also expose convenience paths like /orchestrator/health. The default self-hosted runtime uses direct service ports, so http://localhost:8080/health and http://localhost:8081/health are the canonical checks.

Troubleshooting Flow

Use this sequence to narrow down the failure domain quickly:

Is the health endpoint reachable?
├── NO → Check process/container status, port bindings, reverse proxy config, and firewall rules

└── YES → Does /health report "healthy"?
├── NO → Identify whether the failure is in PostgreSQL, Redis, provider bootstrap, or internal service auth

└── YES → Are requests still failing?
├── LLM requests → Check provider credentials, routing config, and Orchestrator logs
├── MCP requests → Check connector registration, auth, and policy blocks
├── Workflow requests → Check Orchestrator state, execution history, and WCP endpoints
└── Audit/usage gaps → Check Agent request path and persistence health

Deployment Startup Problems

Symptoms

  • docker compose up never reaches healthy containers
  • Agent or Orchestrator exits immediately
  • one service keeps restarting while the other stays up

What to check

docker compose ps
docker compose logs --tail=200 agent
docker compose logs --tail=200 orchestrator
docker compose logs --tail=100 postgres
docker compose logs --tail=100 redis

Common causes:

Failure modeWhat it usually meansRecovery path
Agent unhealthy, Orchestrator healthyAgent cannot reach PostgreSQL, Redis, or internal configCheck database URL, Redis connectivity, policy/config startup errors
Orchestrator unhealthy, Agent healthyLLM bootstrap or runtime-config initialization failedCheck provider env vars, routing config, and Orchestrator logs
Both unhealthyshared dependency problemStart with PostgreSQL, Redis, secrets, and compose env
Health passes but requests failrequest-path issue, not startup issueMove to the request diagnostics below

Health Endpoint Fails or Stays Degraded

The Agent and Orchestrator both expose /health. In self-hosted community deployments those are the fastest indicators of whether the process is merely alive or actually ready.

Useful checks:

curl -sf http://localhost:8080/health | jq .
curl -sf http://localhost:8081/health | jq .
curl -sf http://localhost:8080/prometheus | grep -E "http|request|policy" | head
curl -sf http://localhost:8081/prometheus | grep -E "http|workflow|llm" | head

If /health is failing:

  • verify PostgreSQL is reachable and writable
  • verify Redis is up if your workflow or rate-limit path depends on it
  • verify AXONFLOW_INTERNAL_SERVICE_SECRET is identical on Agent and Orchestrator when you set it
  • verify provider configuration for any proxy-mode or MAP flow that needs an LLM

Requests Fail Even Though Health Is Green

The canonical client request path in the current runtime is:

POST /api/request

If older documentation or scripts still point to /api/v1/agent/execute, treat that as stale.

Check the request path directly

curl -s http://localhost:8080/api/request \
-H 'Content-Type: application/json' \
-d '{
"query": "hello",
"client_id": "test-client",
"request_type": "proxy"
}' | jq .

Look for three broad classes of failure:

SymptomUsually points toWhat to inspect
4xx responseauth, policy, request-shape, or tier limit issueAgent logs, policy config, request body
5xx responseprovider, connector, orchestration, or persistence failureboth service logs, Prometheus, dependency health
long latency / timeoutprovider or connector path, not usually health pathOrchestrator logs, provider routing, connector latency

MCP Connector Problems

Connector issues are often confused with generic “platform” failures. Separate them early.

Useful checks:

curl -sf http://localhost:8080/mcp/health | jq .
curl -sf http://localhost:8080/mcp/connectors/postgresql/health | jq .

When connector traffic fails:

  • verify the connector is actually registered in your current edition/build
  • verify credentials and network reachability to the target system
  • verify request blocking is not coming from system or tenant policies
  • verify MCP response limits such as MCP_MAX_ROWS_PER_QUERY and MCP_MAX_BYTES_PER_QUERY

If you are debugging a policy block, compare the connector type and operation you are calling with:

License and Tier Problems

Community runs without a license key. Evaluation and Enterprise rely on AXONFLOW_LICENSE_KEY.

Quick checks:

echo "$AXONFLOW_LICENSE_KEY" | cut -c1-5
docker compose exec agent printenv AXONFLOW_LICENSE_KEY | cut -c1-5
docker compose exec orchestrator printenv AXONFLOW_LICENSE_KEY | cut -c1-5

Typical failure patterns:

  • key is set in your shell but not passed into the container
  • key contains whitespace or newline corruption
  • expired Evaluation or Enterprise key causes graceful fallback to Community limits
  • the runtime is healthy, but a higher-tier feature now returns a limit or entitlement error

For exact limits and upgrade guidance, see License Management.

Latency and Throughput Problems

AxonFlow performance issues usually come from one of four places:

  • LLM provider latency
  • connector/database latency
  • excessive policy complexity
  • undersized Agent or Orchestrator capacity

Start here:

curl -sf http://localhost:8080/prometheus | grep -i request | head -20
curl -sf http://localhost:8081/prometheus | grep -Ei 'workflow|llm|provider|execution' | head -20

Questions to ask:

  • Is the slowdown visible on the Agent, the Orchestrator, or both?
  • Is the problem only on proxy-mode traffic, only on gateway-mode traffic, or only on MCP?
  • Did latency spike after a policy change, a new connector rollout, or a provider switch?

Use Monitoring Overview and Load Testing to confirm whether the bottleneck is in AxonFlow or in a downstream dependency.

AWS and Managed-Infrastructure Notes

If you run AxonFlow behind ECS, Kubernetes, an ALB, or another platform layer, map the same checks onto your environment:

  • service or container health instead of docker compose ps
  • platform logs instead of local compose logs
  • the same /health and /prometheus endpoints behind service discovery or ingress
  • the same AXONFLOW_LICENSE_KEY and provider env vars through your secret manager

The public troubleshooting logic should still work even if the deployment topology changes.

When Troubleshooting Becomes an Upgrade Signal

Community is enough for serious engineering review and for many production use cases. But repeated operational pain in any of these areas is usually a signal that teams should consider Evaluation or Enterprise:

  • tighter governance workflows
  • higher limits for policies, providers, and workflow history
  • richer approval and evidence workflows
  • broader connector coverage
  • protected deployment and operations guidance