Skip to main content

Runtime Troubleshooting

This guide covers runtime issues that may occur after successful deployment. For deployment-specific issues, see the Deployment Troubleshooting guide.

Quick Diagnostics

Run these commands first to assess system state:

# Check all services health
curl -sf https://YOUR_ENDPOINT/health | jq .
curl -sf https://YOUR_ENDPOINT/orchestrator/health | jq .

# Check recent errors (last 30 min)
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "ERROR" \
--start-time $(date -d '30 minutes ago' +%s000) \
--region YOUR_REGION

Service Issues

Service Becomes Unhealthy

Symptoms:

  • Health endpoint returns non-200
  • ALB showing unhealthy targets
  • Intermittent request failures

Diagnosis:

# Check current health
curl -sf https://YOUR_ENDPOINT/health

# Check ALB target health
aws elbv2 describe-target-health \
--target-group-arn YOUR_TG_ARN \
--region YOUR_REGION

# Check ECS task status
aws ecs describe-services \
--cluster YOUR_CLUSTER \
--services agent \
--query 'services[0].{Running:runningCount,Desired:desiredCount,Events:events[0:3]}'

Common Causes & Solutions:

CauseIndicatorsSolution
Memory pressureOOMKilled in logsIncrease task memory
Database connection pool exhausted"too many connections"Reduce pool size or add replicas
Downstream dependency downTimeout errorsCheck dependent services
Resource starvationHigh CPU in metricsScale up or out

Service Restart Loop

Symptoms:

  • Tasks keep restarting
  • ECS events show repeated task stops
  • Circuit breaker triggered

Diagnosis:

# Check stopped tasks
aws ecs list-tasks \
--cluster YOUR_CLUSTER \
--service-name agent \
--desired-status STOPPED \
--region YOUR_REGION

# Get stop reason
TASK_ARN=$(aws ecs list-tasks --cluster YOUR_CLUSTER --service-name agent --desired-status STOPPED --query 'taskArns[0]' --output text)
aws ecs describe-tasks \
--cluster YOUR_CLUSTER \
--tasks $TASK_ARN \
--query 'tasks[0].{Status:lastStatus,Reason:stoppedReason,Code:containers[0].exitCode}'

Common Exit Codes:

CodeMeaningAction
0Normal exitCheck if intended (scaling down?)
1Application errorCheck logs for startup errors
137OOMKilledIncrease memory allocation
139SegfaultCheck for memory corruption, update image
143SIGTERMNormal shutdown (deployment/scaling)

Request Handling Issues

Requests Timing Out

Symptoms:

  • 504 Gateway Timeout errors
  • Requests taking > 30 seconds
  • Clients receiving timeout errors

Diagnosis:

# Check latency metrics
curl -s https://YOUR_ENDPOINT/metrics | grep request_duration

# Check for slow queries in logs
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "duration_ms" \
--region YOUR_REGION | jq -r '.events[].message' | sort -t: -k2 -n | tail -20

Solutions:

  1. Identify slow component:

    • Database queries → Add indexes, optimize queries
    • External connectors → Increase timeout, add retry
    • Policy evaluation → Simplify policies
  2. Adjust timeouts:

    # ALB idle timeout (default 60s)
    aws elbv2 modify-load-balancer-attributes \
    --load-balancer-arn YOUR_ALB_ARN \
    --attributes Key=idle_timeout.timeout_seconds,Value=120

High Error Rate

Symptoms:

  • 5xx errors spiking
  • Error rate > 1%
  • Alerts firing

Diagnosis:

# Get error breakdown from logs
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "level=error" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION \
| jq -r '.events[].message' | sort | uniq -c | sort -rn | head -20

# Check error rate in metrics
curl -s https://YOUR_ENDPOINT/metrics | grep -E "requests_total|errors_total"

Common Error Patterns:

ErrorCauseSolution
"connection refused"Downstream service downCheck dependent services
"context deadline exceeded"TimeoutIncrease timeout or optimize
"permission denied"IAM/policy issueCheck permissions
"rate limit exceeded"Too many requestsImplement backoff

Database Issues

Connection Pool Exhausted

Symptoms:

  • "too many connections" errors
  • Intermittent database failures
  • Slow queries

Diagnosis:

# Check current connections (requires DB access)
psql -h YOUR_DB_ENDPOINT -U axonflow -c "SELECT count(*) FROM pg_stat_activity;"

# Check RDS metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=YOUR_DB \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Maximum

Solutions:

  1. Reduce connection pool size per service:

    • Each Agent task uses ~10 connections
    • Scale down replicas or reduce pool size
  2. Upgrade database instance:

    • db.t3.micro: ~80 connections
    • db.t3.small: ~150 connections
    • db.t3.medium: ~400 connections
    • db.t3.large: ~800 connections
  3. Add connection pooler (PgBouncer):

    • For high-replica deployments
    • Reduces total connections to database

Slow Queries

Symptoms:

  • High database latency
  • P99 response times elevated
  • RDS CPU spikes

Diagnosis:

# Check RDS performance insights (if enabled)
aws pi get-resource-metrics \
--service-type RDS \
--identifier db-XXXXX \
--metric-queries file://pi-query.json

# Check slow query log
aws logs filter-log-events \
--log-group-name /aws/rds/instance/YOUR_DB/postgresql \
--filter-pattern "duration:" \
--region YOUR_REGION

Solutions:

  1. Add missing indexes
  2. Optimize queries - Check EXPLAIN plans
  3. Increase RDS instance size
  4. Enable query caching (if applicable)

Memory and CPU Issues

High Memory Usage

Symptoms:

  • OOMKilled containers
  • Memory metrics near limit
  • Performance degradation

Diagnosis:

# Check container memory
aws ecs describe-tasks \
--cluster YOUR_CLUSTER \
--tasks TASK_ARN \
--query 'tasks[0].containers[0].{Memory:memory,MemoryReservation:memoryReservation}'

# Check CloudWatch Container Insights (if enabled)
aws cloudwatch get-metric-statistics \
--namespace ECS/ContainerInsights \
--metric-name MemoryUtilized \
--dimensions Name=ClusterName,Value=YOUR_CLUSTER Name=ServiceName,Value=agent \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Maximum

Solutions:

  1. Increase task memory:

    • Update task definition
    • Redeploy service
  2. Check for memory leaks:

    • Monitor memory growth over time
    • Check for goroutine leaks
  3. Optimize memory usage:

    • Reduce connection pool sizes
    • Limit concurrent requests

High CPU Usage

Symptoms:

  • CPU throttling
  • Increased latency
  • Slow request processing

Diagnosis:

# Check ECS CPU reservation vs utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ClusterName,Value=YOUR_CLUSTER Name=ServiceName,Value=agent \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Average,Maximum

Solutions:

  1. Scale horizontally:

    aws ecs update-service \
    --cluster YOUR_CLUSTER \
    --service agent \
    --desired-count 4
  2. Increase CPU allocation in task definition

  3. Profile application for CPU hotspots


Log Interpretation

AxonFlow uses structured JSON logging. Understanding common log patterns helps you quickly identify and resolve issues.

Log Format

Each log line is a JSON object with these standard fields:

{
"level": "error",
"ts": "2026-02-06T10:30:45.123Z",
"caller": "agent/handler.go:142",
"msg": "policy evaluation failed",
"request_id": "req-abc123",
"duration_ms": 12.5,
"error": "context deadline exceeded"
}

Common Log Patterns and What They Mean

Log PatternSeverityMeaningAction
"msg":"health check passed"InfoNormal operationNone
"msg":"license validated","tier":"enterprise"InfoLicense accepted on startupNone
"msg":"license validation failed"ErrorInvalid or expired license keyCheck AXONFLOW_LICENSE_KEY env var; verify key format and expiry
"msg":"database connection established"InfoDB connected successfullyNone
"msg":"database connection failed","error":"..."ErrorCannot reach PostgreSQLCheck DB endpoint, security groups, credentials
"msg":"too many connections"ErrorConnection pool exhaustedReduce pool size or upgrade DB instance
"msg":"context deadline exceeded"WarnRequest or query timed outIdentify slow component (DB, LLM, connector)
"msg":"policy evaluation failed"ErrorPolicy engine errorCheck policy syntax; simplify complex regex rules
"msg":"llm provider timeout","provider":"bedrock"WarnLLM provider did not respond in timeIncrease AXONFLOW_LLM_TIMEOUT or check provider status
"msg":"connector health check failed","connector":"..."WarnMCP connector unreachableVerify connector credentials and network path
"msg":"OOMKilled"FatalContainer exceeded memory limitIncrease task memory allocation
"msg":"goroutine leak detected","count":NWarnGoroutine count growingProfile with pprof; check for unclosed connections

Filtering Logs by Severity

# Errors only (last 30 min)
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern '{ $.level = "error" }' \
--start-time $(date -d '30 minutes ago' +%s000) \
--region YOUR_REGION

# Warnings and errors
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern '{ $.level = "error" || $.level = "warn" }' \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION

Recovery Procedures

Service Crash Recovery

When an Agent or Orchestrator service crashes and does not self-heal via ECS:

  1. Check the stop reason:

    TASK_ARN=$(aws ecs list-tasks --cluster YOUR_CLUSTER --service-name agent --desired-status STOPPED --query 'taskArns[0]' --output text)
    aws ecs describe-tasks --cluster YOUR_CLUSTER --tasks $TASK_ARN --query 'tasks[0].stoppedReason'
  2. If OOMKilled (exit code 137): Increase memory in the task definition and redeploy.

  3. If application error (exit code 1): Check CloudWatch logs for the startup error (often a missing env var or bad config).

  4. Force a fresh deployment:

    aws ecs update-service --cluster YOUR_CLUSTER --service agent --force-new-deployment
    aws ecs wait services-stable --cluster YOUR_CLUSTER --services agent
  5. Verify recovery:

    curl -sf http://YOUR_ENDPOINT:8080/health | jq .

Database Connection Loss Recovery

When the Agent or Orchestrator loses its database connection:

  1. Verify the database is reachable:

    # Check RDS status
    aws rds describe-db-instances \
    --db-instance-identifier YOUR_DB \
    --query 'DBInstances[0].DBInstanceStatus'

    # Test direct connectivity (from a bastion or ECS exec)
    psql -h YOUR_DB_ENDPOINT -U axonflow -c "SELECT 1;"
  2. If RDS is available but connections are exhausted:

    # Check active connections
    psql -h YOUR_DB_ENDPOINT -U axonflow -c "SELECT count(*) FROM pg_stat_activity;"

    # Terminate idle connections if needed
    psql -h YOUR_DB_ENDPOINT -U axonflow -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';"
  3. If RDS is in a failed state: Restore from the latest automated backup or manual snapshot.

  4. Restart services to re-establish connection pools:

    aws ecs update-service --cluster YOUR_CLUSTER --service agent --force-new-deployment
    aws ecs update-service --cluster YOUR_CLUSTER --service orchestrator-service --force-new-deployment

LLM Provider Timeout Recovery

When the LLM provider (Bedrock, OpenAI, Ollama) stops responding:

  1. Check provider status:

  2. Verify credentials are still valid:

    # For Bedrock, test IAM permissions
    aws bedrock list-foundation-models --region us-east-1 --query 'modelSummaries[0].modelId'

    # For OpenAI, check the secret value
    aws secretsmanager get-secret-value --secret-id axonflow-openai-key --query 'SecretString' --output text | head -c 10
  3. Increase timeout if the provider is slow but responsive: Set AXONFLOW_LLM_TIMEOUT=120s in the Orchestrator environment.

  4. Switch to a fallback provider if configured in multi-model routing (Enterprise feature).

  5. Restart the Orchestrator to clear any stale connections:

    aws ecs update-service --cluster YOUR_CLUSTER --service orchestrator-service --force-new-deployment

Log Analysis

Finding Errors

# All errors in last hour
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION

# Specific error type
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "database connection" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION

Tracing Requests

# Find request by ID
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "request_id=req-abc123" \
--region YOUR_REGION

# Find all requests for a user
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "[email protected]" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION

Performance Analysis

# Find slow requests (>100ms)
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "[..., duration_ms > 100, ...]" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION

Recovery Procedures

Restart a Service

# Force new deployment (rolling restart)
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--force-new-deployment

# Wait for stability
aws ecs wait services-stable \
--cluster YOUR_CLUSTER \
--services agent

Scale Service

# Scale up
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--desired-count 4

# Scale down
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--desired-count 2

Rollback Deployment

# Get previous task definition
PREV_TD=$(aws ecs describe-services \
--cluster YOUR_CLUSTER \
--services agent \
--query 'services[0].deployments[1].taskDefinition' \
--output text)

# Rollback
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--task-definition $PREV_TD \
--force-new-deployment

Getting Help

If you can't resolve an issue:

  1. Collect diagnostics:

    # Service status
    aws ecs describe-services --cluster YOUR_CLUSTER --services agent orchestrator > diagnostics.json

    # Recent logs
    aws logs filter-log-events --log-group-name /ecs/YOUR_STACK/agent --start-time $(date -d '1 hour ago' +%s000) > logs.txt

    # Health endpoints
    curl -sf https://YOUR_ENDPOINT/health > health.json
  2. Contact support with:

    • Issue description
    • Timeline (when started, any changes)
    • Diagnostic files

Support: [email protected]