Skip to main content

Runtime Troubleshooting

This guide covers runtime issues that may occur after successful deployment. For deployment-specific issues, see the Deployment Troubleshooting guide.

Quick Diagnostics

Run these commands first to assess system state:

# Check all services health
curl -sf https://YOUR_ENDPOINT/health | jq .
curl -sf https://YOUR_ENDPOINT/orchestrator/health | jq .

# Check recent errors (last 30 min)
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "ERROR" \
--start-time $(date -d '30 minutes ago' +%s000) \
--region YOUR_REGION

Service Issues

Service Becomes Unhealthy

Symptoms:

  • Health endpoint returns non-200
  • ALB showing unhealthy targets
  • Intermittent request failures

Diagnosis:

# Check current health
curl -sf https://YOUR_ENDPOINT/health

# Check ALB target health
aws elbv2 describe-target-health \
--target-group-arn YOUR_TG_ARN \
--region YOUR_REGION

# Check ECS task status
aws ecs describe-services \
--cluster YOUR_CLUSTER \
--services agent \
--query 'services[0].{Running:runningCount,Desired:desiredCount,Events:events[0:3]}'

Common Causes & Solutions:

CauseIndicatorsSolution
Memory pressureOOMKilled in logsIncrease task memory
Database connection pool exhausted"too many connections"Reduce pool size or add replicas
Downstream dependency downTimeout errorsCheck dependent services
Resource starvationHigh CPU in metricsScale up or out

Service Restart Loop

Symptoms:

  • Tasks keep restarting
  • ECS events show repeated task stops
  • Circuit breaker triggered

Diagnosis:

# Check stopped tasks
aws ecs list-tasks \
--cluster YOUR_CLUSTER \
--service-name agent \
--desired-status STOPPED \
--region YOUR_REGION

# Get stop reason
TASK_ARN=$(aws ecs list-tasks --cluster YOUR_CLUSTER --service-name agent --desired-status STOPPED --query 'taskArns[0]' --output text)
aws ecs describe-tasks \
--cluster YOUR_CLUSTER \
--tasks $TASK_ARN \
--query 'tasks[0].{Status:lastStatus,Reason:stoppedReason,Code:containers[0].exitCode}'

Common Exit Codes:

CodeMeaningAction
0Normal exitCheck if intended (scaling down?)
1Application errorCheck logs for startup errors
137OOMKilledIncrease memory allocation
139SegfaultCheck for memory corruption, update image
143SIGTERMNormal shutdown (deployment/scaling)

Request Handling Issues

Requests Timing Out

Symptoms:

  • 504 Gateway Timeout errors
  • Requests taking > 30 seconds
  • Clients receiving timeout errors

Diagnosis:

# Check latency metrics
curl -s https://YOUR_ENDPOINT/metrics | grep request_duration

# Check for slow queries in logs
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "duration_ms" \
--region YOUR_REGION | jq -r '.events[].message' | sort -t: -k2 -n | tail -20

Solutions:

  1. Identify slow component:

    • Database queries → Add indexes, optimize queries
    • External connectors → Increase timeout, add retry
    • Policy evaluation → Simplify policies
  2. Adjust timeouts:

    # ALB idle timeout (default 60s)
    aws elbv2 modify-load-balancer-attributes \
    --load-balancer-arn YOUR_ALB_ARN \
    --attributes Key=idle_timeout.timeout_seconds,Value=120

High Error Rate

Symptoms:

  • 5xx errors spiking
  • Error rate > 1%
  • Alerts firing

Diagnosis:

# Get error breakdown from logs
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "level=error" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION \
| jq -r '.events[].message' | sort | uniq -c | sort -rn | head -20

# Check error rate in metrics
curl -s https://YOUR_ENDPOINT/metrics | grep -E "requests_total|errors_total"

Common Error Patterns:

ErrorCauseSolution
"connection refused"Downstream service downCheck dependent services
"context deadline exceeded"TimeoutIncrease timeout or optimize
"permission denied"IAM/policy issueCheck permissions
"rate limit exceeded"Too many requestsImplement backoff

Database Issues

Connection Pool Exhausted

Symptoms:

  • "too many connections" errors
  • Intermittent database failures
  • Slow queries

Diagnosis:

# Check current connections (requires DB access)
psql -h YOUR_DB_ENDPOINT -U axonflow -c "SELECT count(*) FROM pg_stat_activity;"

# Check RDS metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=YOUR_DB \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Maximum

Solutions:

  1. Reduce connection pool size per service:

    • Each Agent task uses ~10 connections
    • Scale down replicas or reduce pool size
  2. Upgrade database instance:

    • db.t3.micro: ~80 connections
    • db.t3.small: ~150 connections
    • db.t3.medium: ~400 connections
    • db.t3.large: ~800 connections
  3. Add connection pooler (PgBouncer):

    • For high-replica deployments
    • Reduces total connections to database

Slow Queries

Symptoms:

  • High database latency
  • P99 response times elevated
  • RDS CPU spikes

Diagnosis:

# Check RDS performance insights (if enabled)
aws pi get-resource-metrics \
--service-type RDS \
--identifier db-XXXXX \
--metric-queries file://pi-query.json

# Check slow query log
aws logs filter-log-events \
--log-group-name /aws/rds/instance/YOUR_DB/postgresql \
--filter-pattern "duration:" \
--region YOUR_REGION

Solutions:

  1. Add missing indexes
  2. Optimize queries - Check EXPLAIN plans
  3. Increase RDS instance size
  4. Enable query caching (if applicable)

Memory and CPU Issues

High Memory Usage

Symptoms:

  • OOMKilled containers
  • Memory metrics near limit
  • Performance degradation

Diagnosis:

# Check container memory
aws ecs describe-tasks \
--cluster YOUR_CLUSTER \
--tasks TASK_ARN \
--query 'tasks[0].containers[0].{Memory:memory,MemoryReservation:memoryReservation}'

# Check CloudWatch Container Insights (if enabled)
aws cloudwatch get-metric-statistics \
--namespace ECS/ContainerInsights \
--metric-name MemoryUtilized \
--dimensions Name=ClusterName,Value=YOUR_CLUSTER Name=ServiceName,Value=agent \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Maximum

Solutions:

  1. Increase task memory:

    • Update task definition
    • Redeploy service
  2. Check for memory leaks:

    • Monitor memory growth over time
    • Check for goroutine leaks
  3. Optimize memory usage:

    • Reduce connection pool sizes
    • Limit concurrent requests

High CPU Usage

Symptoms:

  • CPU throttling
  • Increased latency
  • Slow request processing

Diagnosis:

# Check ECS CPU reservation vs utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ClusterName,Value=YOUR_CLUSTER Name=ServiceName,Value=agent \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Average,Maximum

Solutions:

  1. Scale horizontally:

    aws ecs update-service \
    --cluster YOUR_CLUSTER \
    --service agent \
    --desired-count 4
  2. Increase CPU allocation in task definition

  3. Profile application for CPU hotspots


Log Analysis

Finding Errors

# All errors in last hour
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION

# Specific error type
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "database connection" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION

Tracing Requests

# Find request by ID
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "request_id=req-abc123" \
--region YOUR_REGION

# Find all requests for a user
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "[email protected]" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION

Performance Analysis

# Find slow requests (>100ms)
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "[..., duration_ms > 100, ...]" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION

Recovery Procedures

Restart a Service

# Force new deployment (rolling restart)
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--force-new-deployment

# Wait for stability
aws ecs wait services-stable \
--cluster YOUR_CLUSTER \
--services agent

Scale Service

# Scale up
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--desired-count 4

# Scale down
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--desired-count 2

Rollback Deployment

# Get previous task definition
PREV_TD=$(aws ecs describe-services \
--cluster YOUR_CLUSTER \
--services agent \
--query 'services[0].deployments[1].taskDefinition' \
--output text)

# Rollback
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--task-definition $PREV_TD \
--force-new-deployment

Getting Help

If you can't resolve an issue:

  1. Collect diagnostics:

    # Service status
    aws ecs describe-services --cluster YOUR_CLUSTER --services agent orchestrator > diagnostics.json

    # Recent logs
    aws logs filter-log-events --log-group-name /ecs/YOUR_STACK/agent --start-time $(date -d '1 hour ago' +%s000) > logs.txt

    # Health endpoints
    curl -sf https://YOUR_ENDPOINT/health > health.json
  2. Contact support with:

    • Issue description
    • Timeline (when started, any changes)
    • Diagnostic files

Support: [email protected]