Skip to main content

Runtime Troubleshooting

This page focuses on incidents that happen after the platform starts successfully. The services may be healthy, the containers may be running, and yet engineers still see request failures, stalled workflows, missing audit trails, or latency spikes. That is the runtime troubleshooting problem space.

For deployment bootstrap failures, use Troubleshooting. Use this page when the runtime is already up and the problem appears only under real traffic.

Fast Runtime Triage

# Health and readiness
curl -sf http://localhost:8080/health | jq .
curl -sf http://localhost:8081/health | jq .

# Recent logs
docker compose logs --tail=200 agent
docker compose logs --tail=200 orchestrator

# Metrics endpoints
curl -sf http://localhost:8080/prometheus | head
curl -sf http://localhost:8081/prometheus | head

When you look at logs, separate the failure domain:

  • Agent problems affect request acceptance, system-policy enforcement, MCP, audit, and gateway-mode flows
  • Orchestrator problems affect proxy-mode execution, provider routing, tenant policies, MAP, and workflows

Log Format

AxonFlow services emit structured JSON logs:

{"level":"info","ts":"2026-03-30T12:00:00Z","caller":"agent/run.go:123","msg":"request processed","request_id":"req_abc","duration_ms":4}

Common Log Patterns

MessageMeaningAction
license validatedLicense check passed on startupNormal
database connection failedCannot reach PostgreSQLCheck DB host, credentials, network
orchestrator unreachableAgent cannot reach orchestratorCheck orchestrator health, network between services
connector health check failedMCP connector downCheck connector config and target system
policy evaluation failedError during policy checkCheck policy syntax and database
rate limit exceededClient exceeded configured rateNormal throttling behavior
circuit breaker trippedAuto-trip threshold reachedCheck provider health, review error rate
request blockedPolicy denied the requestNormal governance behavior
context deadline exceededRequest timed outCheck LLM provider latency, increase timeout
too many open connectionsDB pool exhaustedIncrease max_open_conns or reduce concurrency

Exit Codes

CodeMeaning
0Clean shutdown
1Startup error (config, database, license)
137OOM killed (container memory limit exceeded)
139Segfault (report as bug)
143SIGTERM received (normal container shutdown)

Request Path Failures

The current client-facing request endpoint is:

POST /api/request

If traffic is failing, reproduce it with a minimal request first:

curl -s http://localhost:8080/api/request \
-H 'Content-Type: application/json' \
-d '{
"query": "Summarize the latest customer ticket",
"client_id": "ops-debug",
"request_type": "proxy"
}' | jq .

Then classify the failure:

Failure typeWhat it usually meansFirst place to look
blocked responsepolicy decision, not platform outageAgent logs, policy docs, tenant policy state
provider errorLLM credentials, model access, or routing problemOrchestrator logs and provider configuration
timeoutdownstream provider, connector, or database slownessPrometheus, dependency health, retry behavior
malformed requeststale SDK example or incorrect payloadcurrent SDK docs and request body

Service Restart Loops

When a service keeps restarting, the exit pattern matters more than the fact of the restart.

For local and self-hosted environments:

docker compose ps
docker compose logs --tail=200 agent
docker compose logs --tail=200 orchestrator

Common signals:

SignalLikely root cause
OOM or exit 137memory sizing is too small for the current workload
startup panicbad env var, missing secret, or provider bootstrap failure
repeated database errorsPostgreSQL connectivity, migration, or pool pressure
internal auth failuresmismatched AXONFLOW_INTERNAL_SERVICE_SECRET values

Workflow and MAP Issues

AxonFlow multi-agent planning is not a single opaque “magic” call. The workflow path involves plan generation, execution, state tracking, and UI or API visibility.

Symptoms to watch for:

  • plan generation succeeds but execution never starts
  • execution starts but stalls between steps
  • UI or API cannot see expected execution history
  • WCP actions succeed intermittently but replay or debugging fails

What to inspect:

  • Orchestrator logs first, not just Agent logs
  • workflow history and execution records
  • tenant policy or approval logic that may be intentionally blocking a step
  • provider routing configuration for the plan-generation stage

Related docs:

MCP Connector Runtime Failures

If LLM-only requests work but tool calls fail, do not debug them as generic request failures. Debug the connector path directly.

Useful checks:

curl -sf http://localhost:8080/mcp/health | jq .
curl -sf http://localhost:8080/mcp/connectors/postgresql/health | jq .

Connector failures usually come from one of these areas:

  • connector is not available in the current build or tier
  • database or SaaS credentials are wrong
  • network path to the target system is blocked
  • system or tenant policies block either input or output
  • response limits such as MCP_MAX_ROWS_PER_QUERY or MCP_MAX_BYTES_PER_QUERY are being hit

The important operational habit is to check whether the connector is unavailable, unhealthy, or intentionally blocked. Those are different problems with different fixes.

Audit, Usage, and History Problems

If engineers say “the request worked but I cannot see it in AxonFlow,” treat that as a persistence or observability issue.

Check:

  • Agent logs for request acceptance on /api/request
  • PostgreSQL health and write capacity
  • whether the request path used AxonFlow at all, or bypassed it
  • execution history and audit retention limits for the active tier

This matters for regulated and high-accountability workflows because missing audit evidence is often treated as a platform failure even when the response itself succeeded.

Provider and Routing Problems

Provider incidents commonly look like platform bugs. Confirm the basics before changing policy or workflow code.

Review:

  • DEFAULT_LLM_PROVIDER
  • LLM_ROUTING_STRATEGY
  • PROVIDER_WEIGHTS
  • provider credentials such as OPENAI_API_KEY, ANTHROPIC_API_KEY, AZURE_OPENAI_API_KEY, or Bedrock-related runtime config in enterprise deployments

If no providers bootstrapped successfully, proxy-mode and MAP flows will degrade quickly even though health endpoints may still be green.

Performance Incidents

For performance debugging, the key question is where time is being spent:

  • Agent-side policy evaluation
  • Orchestrator-side routing or workflow orchestration
  • provider latency
  • connector latency
  • database or Redis contention

Use /prometheus on both services and compare:

  • request volume
  • error growth
  • latency buckets
  • workflow or execution behavior

The goal is to determine whether AxonFlow is the bottleneck or whether it is accurately surfacing latency from a downstream provider or tool.

Recovery Priorities for Production Teams

When an incident is active, recover in this order:

  1. restore health on Agent and Orchestrator
  2. restore core request flow on /api/request
  3. restore audit visibility and execution history
  4. restore connector and workflow features
  5. tune routing, budgets, and policy behavior after the platform is stable

This ordering helps teams avoid spending time on tuning when the real issue is basic request-path health.