Runtime Troubleshooting

This guide covers runtime issues that may occur after successful deployment. For deployment-specific issues, see the Deployment Troubleshooting guide.

Quick Diagnostics

Run these commands first to assess system state:

# Check all services health
curl -sf https://YOUR_ENDPOINT/health | jq .
curl -sf https://YOUR_ENDPOINT/orchestrator/health | jq .

# Check recent errors (last 30 min)
aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern "ERROR" \
  --start-time $(date -d '30 minutes ago' +%s000) \
  --region YOUR_REGION

Service Issues

Service Becomes Unhealthy

Symptoms:

Health endpoint returns non-200
ALB showing unhealthy targets
Intermittent request failures

Diagnosis:

# Check current health
curl -sf https://YOUR_ENDPOINT/health

# Check ALB target health
aws elbv2 describe-target-health \
  --target-group-arn YOUR_TG_ARN \
  --region YOUR_REGION

# Check ECS task status
aws ecs describe-services \
  --cluster YOUR_CLUSTER \
  --services agent \
  --query 'services[0].{Running:runningCount,Desired:desiredCount,Events:events[0:3]}'

Common Causes & Solutions:

Cause	Indicators	Solution
Memory pressure	OOMKilled in logs	Increase task memory
Database connection pool exhausted	"too many connections"	Reduce pool size or add replicas
Downstream dependency down	Timeout errors	Check dependent services
Resource starvation	High CPU in metrics	Scale up or out

Service Restart Loop

Symptoms:

Tasks keep restarting
ECS events show repeated task stops
Circuit breaker triggered

Diagnosis:

# Check stopped tasks
aws ecs list-tasks \
  --cluster YOUR_CLUSTER \
  --service-name agent \
  --desired-status STOPPED \
  --region YOUR_REGION

# Get stop reason
TASK_ARN=$(aws ecs list-tasks --cluster YOUR_CLUSTER --service-name agent --desired-status STOPPED --query 'taskArns[0]' --output text)
aws ecs describe-tasks \
  --cluster YOUR_CLUSTER \
  --tasks $TASK_ARN \
  --query 'tasks[0].{Status:lastStatus,Reason:stoppedReason,Code:containers[0].exitCode}'

Common Exit Codes:

Code	Meaning	Action
0	Normal exit	Check if intended (scaling down?)
1	Application error	Check logs for startup errors
137	OOMKilled	Increase memory allocation
139	Segfault	Check for memory corruption, update image
143	SIGTERM	Normal shutdown (deployment/scaling)

Request Handling Issues

Requests Timing Out

Symptoms:

504 Gateway Timeout errors
Requests taking > 30 seconds
Clients receiving timeout errors

Diagnosis:

# Check latency metrics
curl -s https://YOUR_ENDPOINT/metrics | grep request_duration

# Check for slow queries in logs
aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern "duration_ms" \
  --region YOUR_REGION | jq -r '.events[].message' | sort -t: -k2 -n | tail -20

Solutions:

Identify slow component:
- Database queries → Add indexes, optimize queries
- External connectors → Increase timeout, add retry
- Policy evaluation → Simplify policies

Adjust timeouts:

# ALB idle timeout (default 60s)
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn YOUR_ALB_ARN \
  --attributes Key=idle_timeout.timeout_seconds,Value=120

High Error Rate

Symptoms:

5xx errors spiking
Error rate > 1%
Alerts firing

Diagnosis:

# Get error breakdown from logs
aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern "level=error" \
  --start-time $(date -d '1 hour ago' +%s000) \
  --region YOUR_REGION \
  | jq -r '.events[].message' | sort | uniq -c | sort -rn | head -20

# Check error rate in metrics
curl -s https://YOUR_ENDPOINT/metrics | grep -E "requests_total|errors_total"

Common Error Patterns:

Error	Cause	Solution
"connection refused"	Downstream service down	Check dependent services
"context deadline exceeded"	Timeout	Increase timeout or optimize
"permission denied"	IAM/policy issue	Check permissions
"rate limit exceeded"	Too many requests	Implement backoff

Database Issues

Connection Pool Exhausted

Symptoms:

"too many connections" errors
Intermittent database failures
Slow queries

Diagnosis:

# Check current connections (requires DB access)
psql -h YOUR_DB_ENDPOINT -U axonflow -c "SELECT count(*) FROM pg_stat_activity;"

# Check RDS metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/RDS \
  --metric-name DatabaseConnections \
  --dimensions Name=DBInstanceIdentifier,Value=YOUR_DB \
  --start-time $(date -u -d '1 hour ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --period 300 \
  --statistics Maximum

Solutions:

Reduce connection pool size per service:
- Each Agent task uses ~10 connections
- Scale down replicas or reduce pool size
Upgrade database instance:
- db.t3.micro: ~80 connections
- db.t3.small: ~150 connections
- db.t3.medium: ~400 connections
- db.t3.large: ~800 connections
Add connection pooler (PgBouncer):
- For high-replica deployments
- Reduces total connections to database

Slow Queries

Symptoms:

High database latency
P99 response times elevated
RDS CPU spikes

Diagnosis:

# Check RDS performance insights (if enabled)
aws pi get-resource-metrics \
  --service-type RDS \
  --identifier db-XXXXX \
  --metric-queries file://pi-query.json

# Check slow query log
aws logs filter-log-events \
  --log-group-name /aws/rds/instance/YOUR_DB/postgresql \
  --filter-pattern "duration:" \
  --region YOUR_REGION

Solutions:

Add missing indexes
Optimize queries - Check EXPLAIN plans
Increase RDS instance size
Enable query caching (if applicable)

Memory and CPU Issues

High Memory Usage

Symptoms:

OOMKilled containers
Memory metrics near limit
Performance degradation

Diagnosis:

# Check container memory
aws ecs describe-tasks \
  --cluster YOUR_CLUSTER \
  --tasks TASK_ARN \
  --query 'tasks[0].containers[0].{Memory:memory,MemoryReservation:memoryReservation}'

# Check CloudWatch Container Insights (if enabled)
aws cloudwatch get-metric-statistics \
  --namespace ECS/ContainerInsights \
  --metric-name MemoryUtilized \
  --dimensions Name=ClusterName,Value=YOUR_CLUSTER Name=ServiceName,Value=agent \
  --start-time $(date -u -d '1 hour ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --period 300 \
  --statistics Maximum

Solutions:

Increase task memory:
- Update task definition
- Redeploy service
Check for memory leaks:
- Monitor memory growth over time
- Check for goroutine leaks
Optimize memory usage:
- Reduce connection pool sizes
- Limit concurrent requests

High CPU Usage

Symptoms:

CPU throttling
Increased latency
Slow request processing

Diagnosis:

# Check ECS CPU reservation vs utilization
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ClusterName,Value=YOUR_CLUSTER Name=ServiceName,Value=agent \
  --start-time $(date -u -d '1 hour ago' +%FT%TZ) \
  --end-time $(date -u +%FT%TZ) \
  --period 300 \
  --statistics Average,Maximum

Solutions:

Scale horizontally:

aws ecs update-service \
  --cluster YOUR_CLUSTER \
  --service agent \
  --desired-count 4

Increase CPU allocation in task definition
Profile application for CPU hotspots

Log Interpretation

AxonFlow uses structured JSON logging. Understanding common log patterns helps you quickly identify and resolve issues.

Log Format

Each log line is a JSON object with these standard fields:

{
  "level": "error",
  "ts": "2026-02-06T10:30:45.123Z",
  "caller": "agent/handler.go:142",
  "msg": "policy evaluation failed",
  "request_id": "req-abc123",
  "duration_ms": 12.5,
  "error": "context deadline exceeded"
}

Common Log Patterns and What They Mean

Log Pattern	Severity	Meaning	Action
`"msg":"health check passed"`	Info	Normal operation	None
`"msg":"license validated","tier":"enterprise"`	Info	License accepted on startup	None
`"msg":"license validation failed"`	Error	Invalid or expired license key	Check `AXONFLOW_LICENSE_KEY` env var; verify key format and expiry
`"msg":"database connection established"`	Info	DB connected successfully	None
`"msg":"database connection failed","error":"..."`	Error	Cannot reach PostgreSQL	Check DB endpoint, security groups, credentials
`"msg":"too many connections"`	Error	Connection pool exhausted	Reduce pool size or upgrade DB instance
`"msg":"context deadline exceeded"`	Warn	Request or query timed out	Identify slow component (DB, LLM, connector)
`"msg":"policy evaluation failed"`	Error	Policy engine error	Check policy syntax; simplify complex regex rules
`"msg":"llm provider timeout","provider":"bedrock"`	Warn	LLM provider did not respond in time	Increase `AXONFLOW_LLM_TIMEOUT` or check provider status
`"msg":"connector health check failed","connector":"..."`	Warn	MCP connector unreachable	Verify connector credentials and network path
`"msg":"OOMKilled"`	Fatal	Container exceeded memory limit	Increase task memory allocation
`"msg":"goroutine leak detected","count":N`	Warn	Goroutine count growing	Profile with pprof; check for unclosed connections

Filtering Logs by Severity

# Errors only (last 30 min)
aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern '{ $.level = "error" }' \
  --start-time $(date -d '30 minutes ago' +%s000) \
  --region YOUR_REGION

# Warnings and errors
aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern '{ $.level = "error" || $.level = "warn" }' \
  --start-time $(date -d '1 hour ago' +%s000) \
  --region YOUR_REGION

Recovery Procedures

Service Crash Recovery

When an Agent or Orchestrator service crashes and does not self-heal via ECS:

Check the stop reason:

TASK_ARN=$(aws ecs list-tasks --cluster YOUR_CLUSTER --service-name agent --desired-status STOPPED --query 'taskArns[0]' --output text)
aws ecs describe-tasks --cluster YOUR_CLUSTER --tasks $TASK_ARN --query 'tasks[0].stoppedReason'

If OOMKilled (exit code 137): Increase memory in the task definition and redeploy.
If application error (exit code 1): Check CloudWatch logs for the startup error (often a missing env var or bad config).

Force a fresh deployment:

aws ecs update-service --cluster YOUR_CLUSTER --service agent --force-new-deployment
aws ecs wait services-stable --cluster YOUR_CLUSTER --services agent

Verify recovery:

curl -sf http://YOUR_ENDPOINT:8080/health | jq .

Database Connection Loss Recovery

When the Agent or Orchestrator loses its database connection:

Verify the database is reachable:

# Check RDS status
aws rds describe-db-instances \
  --db-instance-identifier YOUR_DB \
  --query 'DBInstances[0].DBInstanceStatus'

# Test direct connectivity (from a bastion or ECS exec)
psql -h YOUR_DB_ENDPOINT -U axonflow -c "SELECT 1;"

If RDS is available but connections are exhausted:

# Check active connections
psql -h YOUR_DB_ENDPOINT -U axonflow -c "SELECT count(*) FROM pg_stat_activity;"

# Terminate idle connections if needed
psql -h YOUR_DB_ENDPOINT -U axonflow -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"

If RDS is in a failed state: Restore from the latest automated backup or manual snapshot.

Restart services to re-establish connection pools:

aws ecs update-service --cluster YOUR_CLUSTER --service agent --force-new-deployment
aws ecs update-service --cluster YOUR_CLUSTER --service orchestrator-service --force-new-deployment

LLM Provider Timeout Recovery

When the LLM provider (Bedrock, OpenAI, Ollama) stops responding:

Check provider status:
- AWS Bedrock: AWS Health Dashboard
- OpenAI: status.openai.com

Verify credentials are still valid:

# For Bedrock, test IAM permissions
aws bedrock list-foundation-models --region us-east-1 --query 'modelSummaries[0].modelId'

# For OpenAI, check the secret value
aws secretsmanager get-secret-value --secret-id axonflow-openai-key --query 'SecretString' --output text | head -c 10

Increase timeout if the provider is slow but responsive: Set AXONFLOW_LLM_TIMEOUT=120s in the Orchestrator environment.
Switch to a fallback provider if configured in multi-model routing (Enterprise feature).

Restart the Orchestrator to clear any stale connections:

aws ecs update-service --cluster YOUR_CLUSTER --service orchestrator-service --force-new-deployment

Log Analysis

Finding Errors

# All errors in last hour
aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern "ERROR" \
  --start-time $(date -d '1 hour ago' +%s000) \
  --region YOUR_REGION

# Specific error type
aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern "database connection" \
  --start-time $(date -d '1 hour ago' +%s000) \
  --region YOUR_REGION

Tracing Requests

# Find request by ID
aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern "request_id=req-abc123" \
  --region YOUR_REGION

# Find all requests for a user
aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern "[email protected]" \
  --start-time $(date -d '1 hour ago' +%s000) \
  --region YOUR_REGION

Performance Analysis

# Find slow requests (>100ms)
aws logs filter-log-events \
  --log-group-name /ecs/YOUR_STACK/agent \
  --filter-pattern "[..., duration_ms > 100, ...]" \
  --start-time $(date -d '1 hour ago' +%s000) \
  --region YOUR_REGION

Recovery Procedures

Restart a Service

# Force new deployment (rolling restart)
aws ecs update-service \
  --cluster YOUR_CLUSTER \
  --service agent \
  --force-new-deployment

# Wait for stability
aws ecs wait services-stable \
  --cluster YOUR_CLUSTER \
  --services agent

Scale Service

# Scale up
aws ecs update-service \
  --cluster YOUR_CLUSTER \
  --service agent \
  --desired-count 4

# Scale down
aws ecs update-service \
  --cluster YOUR_CLUSTER \
  --service agent \
  --desired-count 2

Rollback Deployment

# Get previous task definition
PREV_TD=$(aws ecs describe-services \
  --cluster YOUR_CLUSTER \
  --services agent \
  --query 'services[0].deployments[1].taskDefinition' \
  --output text)

# Rollback
aws ecs update-service \
  --cluster YOUR_CLUSTER \
  --service agent \
  --task-definition $PREV_TD \
  --force-new-deployment

Getting Help

If you can't resolve an issue:

Collect diagnostics:

# Service status
aws ecs describe-services --cluster YOUR_CLUSTER --services agent orchestrator > diagnostics.json

# Recent logs
aws logs filter-log-events --log-group-name /ecs/YOUR_STACK/agent --start-time $(date -d '1 hour ago' +%s000) > logs.txt

# Health endpoints
curl -sf https://YOUR_ENDPOINT/health > health.json

Contact support with:
- Issue description
- Timeline (when started, any changes)
- Diagnostic files

Support: [email protected]

Deployment Troubleshooting - Deployment issues
Monitoring Overview - Set up monitoring
Architecture Overview - System components

Quick Diagnostics​

Service Issues​

Service Becomes Unhealthy​

Service Restart Loop​

Request Handling Issues​

Requests Timing Out​

High Error Rate​

Database Issues​

Connection Pool Exhausted​

Slow Queries​

Memory and CPU Issues​

High Memory Usage​

High CPU Usage​

Log Interpretation​

Log Format​

Common Log Patterns and What They Mean​

Filtering Logs by Severity​

Recovery Procedures​

Service Crash Recovery​

Database Connection Loss Recovery​

LLM Provider Timeout Recovery​

Log Analysis​

Finding Errors​

Tracing Requests​

Performance Analysis​

Recovery Procedures​

Restart a Service​

Scale Service​

Rollback Deployment​

Getting Help​

Related Documentation​

Quick Diagnostics

Service Issues

Service Becomes Unhealthy

Service Restart Loop

Request Handling Issues

Requests Timing Out

High Error Rate

Database Issues

Connection Pool Exhausted

Slow Queries

Memory and CPU Issues

High Memory Usage

High CPU Usage

Log Interpretation

Log Format

Common Log Patterns and What They Mean

Filtering Logs by Severity

Recovery Procedures

Service Crash Recovery

Database Connection Loss Recovery

LLM Provider Timeout Recovery

Log Analysis

Finding Errors

Tracing Requests

Performance Analysis

Recovery Procedures

Restart a Service

Scale Service

Rollback Deployment

Getting Help

Related Documentation