Runtime Troubleshooting
This guide covers runtime issues that may occur after successful deployment. For deployment-specific issues, see the Deployment Troubleshooting guide.
Quick Diagnostics
Run these commands first to assess system state:
# Check all services health
curl -sf https://YOUR_ENDPOINT/health | jq .
curl -sf https://YOUR_ENDPOINT/orchestrator/health | jq .
# Check recent errors (last 30 min)
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "ERROR" \
--start-time $(date -d '30 minutes ago' +%s000) \
--region YOUR_REGION
Service Issues
Service Becomes Unhealthy
Symptoms:
- Health endpoint returns non-200
- ALB showing unhealthy targets
- Intermittent request failures
Diagnosis:
# Check current health
curl -sf https://YOUR_ENDPOINT/health
# Check ALB target health
aws elbv2 describe-target-health \
--target-group-arn YOUR_TG_ARN \
--region YOUR_REGION
# Check ECS task status
aws ecs describe-services \
--cluster YOUR_CLUSTER \
--services agent \
--query 'services[0].{Running:runningCount,Desired:desiredCount,Events:events[0:3]}'
Common Causes & Solutions:
| Cause | Indicators | Solution |
|---|---|---|
| Memory pressure | OOMKilled in logs | Increase task memory |
| Database connection pool exhausted | "too many connections" | Reduce pool size or add replicas |
| Downstream dependency down | Timeout errors | Check dependent services |
| Resource starvation | High CPU in metrics | Scale up or out |
Service Restart Loop
Symptoms:
- Tasks keep restarting
- ECS events show repeated task stops
- Circuit breaker triggered
Diagnosis:
# Check stopped tasks
aws ecs list-tasks \
--cluster YOUR_CLUSTER \
--service-name agent \
--desired-status STOPPED \
--region YOUR_REGION
# Get stop reason
TASK_ARN=$(aws ecs list-tasks --cluster YOUR_CLUSTER --service-name agent --desired-status STOPPED --query 'taskArns[0]' --output text)
aws ecs describe-tasks \
--cluster YOUR_CLUSTER \
--tasks $TASK_ARN \
--query 'tasks[0].{Status:lastStatus,Reason:stoppedReason,Code:containers[0].exitCode}'
Common Exit Codes:
| Code | Meaning | Action |
|---|---|---|
| 0 | Normal exit | Check if intended (scaling down?) |
| 1 | Application error | Check logs for startup errors |
| 137 | OOMKilled | Increase memory allocation |
| 139 | Segfault | Check for memory corruption, update image |
| 143 | SIGTERM | Normal shutdown (deployment/scaling) |
Request Handling Issues
Requests Timing Out
Symptoms:
- 504 Gateway Timeout errors
- Requests taking > 30 seconds
- Clients receiving timeout errors
Diagnosis:
# Check latency metrics
curl -s https://YOUR_ENDPOINT/metrics | grep request_duration
# Check for slow queries in logs
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "duration_ms" \
--region YOUR_REGION | jq -r '.events[].message' | sort -t: -k2 -n | tail -20
Solutions:
-
Identify slow component:
- Database queries → Add indexes, optimize queries
- External connectors → Increase timeout, add retry
- Policy evaluation → Simplify policies
-
Adjust timeouts:
# ALB idle timeout (default 60s)
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn YOUR_ALB_ARN \
--attributes Key=idle_timeout.timeout_seconds,Value=120
High Error Rate
Symptoms:
- 5xx errors spiking
- Error rate > 1%
- Alerts firing
Diagnosis:
# Get error breakdown from logs
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "level=error" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION \
| jq -r '.events[].message' | sort | uniq -c | sort -rn | head -20
# Check error rate in metrics
curl -s https://YOUR_ENDPOINT/metrics | grep -E "requests_total|errors_total"
Common Error Patterns:
| Error | Cause | Solution |
|---|---|---|
| "connection refused" | Downstream service down | Check dependent services |
| "context deadline exceeded" | Timeout | Increase timeout or optimize |
| "permission denied" | IAM/policy issue | Check permissions |
| "rate limit exceeded" | Too many requests | Implement backoff |
Database Issues
Connection Pool Exhausted
Symptoms:
- "too many connections" errors
- Intermittent database failures
- Slow queries
Diagnosis:
# Check current connections (requires DB access)
psql -h YOUR_DB_ENDPOINT -U axonflow -c "SELECT count(*) FROM pg_stat_activity;"
# Check RDS metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name DatabaseConnections \
--dimensions Name=DBInstanceIdentifier,Value=YOUR_DB \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Maximum
Solutions:
-
Reduce connection pool size per service:
- Each Agent task uses ~10 connections
- Scale down replicas or reduce pool size
-
Upgrade database instance:
- db.t3.micro: ~80 connections
- db.t3.small: ~150 connections
- db.t3.medium: ~400 connections
- db.t3.large: ~800 connections
-
Add connection pooler (PgBouncer):
- For high-replica deployments
- Reduces total connections to database
Slow Queries
Symptoms:
- High database latency
- P99 response times elevated
- RDS CPU spikes
Diagnosis:
# Check RDS performance insights (if enabled)
aws pi get-resource-metrics \
--service-type RDS \
--identifier db-XXXXX \
--metric-queries file://pi-query.json
# Check slow query log
aws logs filter-log-events \
--log-group-name /aws/rds/instance/YOUR_DB/postgresql \
--filter-pattern "duration:" \
--region YOUR_REGION
Solutions:
- Add missing indexes
- Optimize queries - Check EXPLAIN plans
- Increase RDS instance size
- Enable query caching (if applicable)
Memory and CPU Issues
High Memory Usage
Symptoms:
- OOMKilled containers
- Memory metrics near limit
- Performance degradation
Diagnosis:
# Check container memory
aws ecs describe-tasks \
--cluster YOUR_CLUSTER \
--tasks TASK_ARN \
--query 'tasks[0].containers[0].{Memory:memory,MemoryReservation:memoryReservation}'
# Check CloudWatch Container Insights (if enabled)
aws cloudwatch get-metric-statistics \
--namespace ECS/ContainerInsights \
--metric-name MemoryUtilized \
--dimensions Name=ClusterName,Value=YOUR_CLUSTER Name=ServiceName,Value=agent \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Maximum
Solutions:
-
Increase task memory:
- Update task definition
- Redeploy service
-
Check for memory leaks:
- Monitor memory growth over time
- Check for goroutine leaks
-
Optimize memory usage:
- Reduce connection pool sizes
- Limit concurrent requests
High CPU Usage
Symptoms:
- CPU throttling
- Increased latency
- Slow request processing
Diagnosis:
# Check ECS CPU reservation vs utilization
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ClusterName,Value=YOUR_CLUSTER Name=ServiceName,Value=agent \
--start-time $(date -u -d '1 hour ago' +%FT%TZ) \
--end-time $(date -u +%FT%TZ) \
--period 300 \
--statistics Average,Maximum
Solutions:
-
Scale horizontally:
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--desired-count 4 -
Increase CPU allocation in task definition
-
Profile application for CPU hotspots
Log Analysis
Finding Errors
# All errors in last hour
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "ERROR" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION
# Specific error type
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "database connection" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION
Tracing Requests
# Find request by ID
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "request_id=req-abc123" \
--region YOUR_REGION
# Find all requests for a user
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "[email protected]" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION
Performance Analysis
# Find slow requests (>100ms)
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--filter-pattern "[..., duration_ms > 100, ...]" \
--start-time $(date -d '1 hour ago' +%s000) \
--region YOUR_REGION
Recovery Procedures
Restart a Service
# Force new deployment (rolling restart)
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--force-new-deployment
# Wait for stability
aws ecs wait services-stable \
--cluster YOUR_CLUSTER \
--services agent
Scale Service
# Scale up
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--desired-count 4
# Scale down
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--desired-count 2
Rollback Deployment
# Get previous task definition
PREV_TD=$(aws ecs describe-services \
--cluster YOUR_CLUSTER \
--services agent \
--query 'services[0].deployments[1].taskDefinition' \
--output text)
# Rollback
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent \
--task-definition $PREV_TD \
--force-new-deployment
Getting Help
If you can't resolve an issue:
-
Collect diagnostics:
# Service status
aws ecs describe-services --cluster YOUR_CLUSTER --services agent orchestrator > diagnostics.json
# Recent logs
aws logs filter-log-events --log-group-name /ecs/YOUR_STACK/agent --start-time $(date -d '1 hour ago' +%s000) > logs.txt
# Health endpoints
curl -sf https://YOUR_ENDPOINT/health > health.json -
Contact support with:
- Issue description
- Timeline (when started, any changes)
- Diagnostic files
Support: [email protected]
Related Documentation
- Deployment Troubleshooting - Deployment issues
- Monitoring Overview - Set up monitoring
- Architecture Overview - System components