Runtime Troubleshooting

This page focuses on incidents that happen after the platform starts successfully. The services may be healthy, the containers may be running, and yet engineers still see request failures, stalled workflows, missing audit trails, or latency spikes. That is the runtime troubleshooting problem space.

For deployment bootstrap failures, use Troubleshooting. Use this page when the runtime is already up and the problem appears only under real traffic.

Fast Runtime Triage

# Health and readiness
curl -sf http://localhost:8080/health | jq .
curl -sf http://localhost:8081/health | jq .

# Recent logs
docker compose logs --tail=200 agent
docker compose logs --tail=200 orchestrator

# Metrics endpoints
curl -sf http://localhost:8080/prometheus | head
curl -sf http://localhost:8081/prometheus | head

When you look at logs, separate the failure domain:

Agent problems affect request acceptance, system-policy enforcement, MCP, audit, and gateway-mode flows
Orchestrator problems affect proxy-mode execution, provider routing, tenant policies, MAP, and workflows

Log Format

AxonFlow services emit structured JSON logs:

{"level":"info","ts":"2026-03-30T12:00:00Z","caller":"agent/run.go:123","msg":"request processed","request_id":"req_abc","duration_ms":4}

Common Log Patterns

Message	Meaning	Action
`license validated`	License check passed on startup	Normal
`database connection failed`	Cannot reach PostgreSQL	Check DB host, credentials, network
`orchestrator unreachable`	Agent cannot reach orchestrator	Check orchestrator health, network between services
`connector health check failed`	MCP connector down	Check connector config and target system
`policy evaluation failed`	Error during policy check	Check policy syntax and database
`rate limit exceeded`	Client exceeded configured rate	Normal throttling behavior
`circuit breaker tripped`	Auto-trip threshold reached	Check provider health, review error rate
`request blocked`	Policy denied the request	Normal governance behavior
`context deadline exceeded`	Request timed out	Check LLM provider latency, increase timeout
`too many open connections`	DB pool exhausted	Increase max_open_conns or reduce concurrency

Exit Codes

Code	Meaning
0	Clean shutdown
1	Startup error (config, database, license)
137	OOM killed (container memory limit exceeded)
139	Segfault (report as bug)
143	SIGTERM received (normal container shutdown)

Request Path Failures

The current client-facing request endpoint is:

POST /api/request

If traffic is failing, reproduce it with a minimal request first:

curl -s http://localhost:8080/api/request \
  -H 'Content-Type: application/json' \
  -d '{
    "query": "Summarize the latest customer ticket",
    "client_id": "ops-debug",
    "request_type": "proxy"
  }' | jq .

Then classify the failure:

Failure type	What it usually means	First place to look
blocked response	policy decision, not platform outage	Agent logs, policy docs, tenant policy state
provider error	LLM credentials, model access, or routing problem	Orchestrator logs and provider configuration
timeout	downstream provider, connector, or database slowness	Prometheus, dependency health, retry behavior
malformed request	stale SDK example or incorrect payload	current SDK docs and request body

Service Restart Loops

When a service keeps restarting, the exit pattern matters more than the fact of the restart.

For local and self-hosted environments:

docker compose ps
docker compose logs --tail=200 agent
docker compose logs --tail=200 orchestrator

Common signals:

Signal	Likely root cause
OOM or exit `137`	memory sizing is too small for the current workload
startup panic	bad env var, missing secret, or provider bootstrap failure
repeated database errors	PostgreSQL connectivity, migration, or pool pressure
internal auth failures	mismatched `AXONFLOW_INTERNAL_SERVICE_SECRET` values

Workflow and MAP Issues

AxonFlow multi-agent planning is not a single opaque “magic” call. The workflow path involves plan generation, execution, state tracking, and UI or API visibility.

Symptoms to watch for:

plan generation succeeds but execution never starts
execution starts but stalls between steps
UI or API cannot see expected execution history
WCP actions succeed intermittently but replay or debugging fails

What to inspect:

Orchestrator logs first, not just Agent logs
workflow history and execution records
tenant policy or approval logic that may be intentionally blocking a step
provider routing configuration for the plan-generation stage

Related docs:

MCP Connector Runtime Failures

If LLM-only requests work but tool calls fail, do not debug them as generic request failures. Debug the connector path directly.

Useful checks:

curl -sf http://localhost:8080/mcp/health | jq .
curl -sf http://localhost:8080/mcp/connectors/postgresql/health | jq .

Connector failures usually come from one of these areas:

connector is not available in the current build or tier
database or SaaS credentials are wrong
network path to the target system is blocked
system or tenant policies block either input or output
response limits such as MCP_MAX_ROWS_PER_QUERY or MCP_MAX_BYTES_PER_QUERY are being hit

The important operational habit is to check whether the connector is unavailable, unhealthy, or intentionally blocked. Those are different problems with different fixes.

Audit, Usage, and History Problems

If engineers say “the request worked but I cannot see it in AxonFlow,” treat that as a persistence or observability issue.

Check:

Agent logs for request acceptance on /api/request
PostgreSQL health and write capacity
whether the request path used AxonFlow at all, or bypassed it
execution history and audit retention limits for the active tier

This matters for regulated and high-accountability workflows because missing audit evidence is often treated as a platform failure even when the response itself succeeded.

Provider and Routing Problems

Provider incidents commonly look like platform bugs. Confirm the basics before changing policy or workflow code.

Review:

DEFAULT_LLM_PROVIDER
LLM_ROUTING_STRATEGY
PROVIDER_WEIGHTS
provider credentials such as OPENAI_API_KEY, ANTHROPIC_API_KEY, AZURE_OPENAI_API_KEY, or Bedrock-related runtime config in enterprise deployments

If no providers bootstrapped successfully, proxy-mode and MAP flows will degrade quickly even though health endpoints may still be green.

Performance Incidents

For performance debugging, the key question is where time is being spent:

Agent-side policy evaluation
Orchestrator-side routing or workflow orchestration
provider latency
connector latency
database or Redis contention

Use /prometheus on both services and compare:

request volume
error growth
latency buckets
workflow or execution behavior

The goal is to determine whether AxonFlow is the bottleneck or whether it is accurately surfacing latency from a downstream provider or tool.

Recovery Priorities for Production Teams

When an incident is active, recover in this order:

restore health on Agent and Orchestrator
restore core request flow on /api/request
restore audit visibility and execution history
restore connector and workflow features
tune routing, budgets, and policy behavior after the platform is stable

This ordering helps teams avoid spending time on tuning when the real issue is basic request-path health.

Fast Runtime Triage​

Log Format​

Common Log Patterns​

Exit Codes​

Request Path Failures​

Service Restart Loops​

Workflow and MAP Issues​

MCP Connector Runtime Failures​

Audit, Usage, and History Problems​

Provider and Routing Problems​

Performance Incidents​

Recovery Priorities for Production Teams​

Related Docs​