Troubleshooting
Common issues and solutions for AxonFlow deployment and operation.
Deployment Issues
Issue: Tasks Not Starting
Symptoms:
- ECS tasks stuck in PENDING state
- Services showing 0 running tasks
- CloudFormation stack creation hangs
Possible Causes:
-
Insufficient subnet capacity
- Private subnets too small for RDS + ECS tasks
- Solution: Use /24 subnets (256 IPs minimum)
-
No NAT Gateway
- Tasks can't pull images from ECR
- Check: Do private subnets have route to NAT Gateway?
- Solution: Add NAT Gateway or use VPC endpoints for ECR
-
Security group misconfiguration
- Tasks can't communicate
- Check: Security group rules allow required ports
- Solution: Verify Agent→Orchestrator (8081), Agent→RDS (5432)
Debug Steps:
# Check ECS service events
aws ecs describe-services \
--cluster YOUR_CLUSTER \
--services agent-service \
--query 'services[0].events[0:5]'
# Check task stopped reason
aws ecs describe-tasks \
--cluster YOUR_CLUSTER \
--tasks TASK_ARN \
--query 'tasks[0].stoppedReason'
Issue: Health Check Failing
Symptoms:
/healthendpoint returns 500- CloudFormation stack rollback
- ALB shows 0 healthy targets
Possible Causes:
-
Database not accessible
- Check: Can Agent reach RDS endpoint?
- Solution: Verify security group allows inbound 5432 from Agent SG
-
Database credentials invalid
- Check: Is DATABASE_URL environment variable correct?
- Solution: Verify password in Secrets Manager matches RDS
-
Orchestrator not reachable
- Check: Can Agent reach Orchestrator on port 8081?
- Solution: Verify Orchestrator service is running
Debug Steps:
# Test database connectivity
curl https://YOUR_AGENT_ENDPOINT/health/db
# Check CloudWatch logs
aws logs tail /ecs/YOUR_STACK/agent --follow
# Test from within VPC
aws ecs execute-command \
--cluster YOUR_CLUSTER \
--task TASK_ARN \
--container agent \
--interactive \
--command "/bin/sh"
Performance Issues
Issue: High Latency (>10ms P95)
Symptoms:
- Policy evaluation taking >10ms P95
- Slow agent responses
- Timeouts under load
Possible Causes:
-
Insufficient Agent capacity
- Too few Agent tasks for load
- Solution: Increase
AgentDesiredCountparameter
-
Database performance
- RDS instance too small
- High database latency
- Check: Monitor RDS CloudWatch metrics
- Solution: Upgrade to larger instance class
-
Network latency
- Cross-AZ latency
- VPC routing issues
- Check: Measure network latency
- Solution: Review VPC configuration
Debug Steps:
# Check Agent CPU/memory
aws ecs describe-services \
--cluster YOUR_CLUSTER \
--services agent-service \
--query 'services[0].deployments[0].runningCount'
# Check database performance
aws cloudwatch get-metric-statistics \
--namespace AWS/RDS \
--metric-name ReadLatency \
--dimensions Name=DBInstanceIdentifier,Value=YOUR_DB \
--start-time 2025-10-23T00:00:00Z \
--end-time 2025-10-23T23:59:59Z \
--period 300 \
--statistics Average
# Load test
curl -X POST https://YOUR_AGENT_ENDPOINT/api/v1/agent/execute \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"task": "test"}' \
-w "\nTime: %{time_total}s\n"
Solutions:
-
Scale up Agents:
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent-service \
--desired-count 10 -
Upgrade database:
aws rds modify-db-instance \
--db-instance-identifier YOUR_DB \
--db-instance-class db.t3.large \
--apply-immediately -
Enable HTTP/2 (if not already):
- Verify ALB target group uses
ProtocolVersion: HTTP2
- Verify ALB target group uses
Issue: Auto-Scaling Not Working
Symptoms:
- Service doesn't scale despite high CPU
- Desired count not changing
Possible Causes:
-
No auto-scaling target
- Check: Is auto-scaling policy attached?
- Solution: Verify CloudFormation created scaling policy
-
CPU not reaching threshold
- Target is 70%, CPU at 65%
- Solution: Lower threshold or increase load
Debug Steps:
# Check scaling policy
aws application-autoscaling describe-scaling-policies \
--service-namespace ecs \
--resource-id service/YOUR_CLUSTER/agent-service
# Check scaling activities
aws application-autoscaling describe-scaling-activities \
--service-namespace ecs \
--resource-id service/YOUR_CLUSTER/agent-service
Database Issues
Issue: Database Connection Errors
Symptoms:
- "Unable to connect to database"
- Agent health check shows "database": "disconnected"
Possible Causes:
-
Incorrect connection string
- Check: DATABASE_URL environment variable
- Solution: Verify format:
postgresql://user:pass@host:5432/dbname?sslmode=require
-
Security group blocking
- Check: RDS security group inbound rules
- Solution: Allow port 5432 from Agent security group
-
RDS not accessible from private subnet
- Check: RDS subnet group includes private subnets
- Solution: Verify RDS deployed in correct subnets
Debug Steps:
# Test direct connection
psql -h YOUR_DB_ENDPOINT -U axonflow -d axonflow -c "SELECT version();"
# Check security groups
aws ec2 describe-security-groups \
--group-ids sg-xxxxx \
--query 'SecurityGroups[0].IpPermissions'
# Check RDS status
aws rds describe-db-instances \
--db-instance-identifier YOUR_DB \
--query 'DBInstances[0].{Status:DBInstanceStatus,Endpoint:Endpoint.Address}'
Issue: Database Out of Connections
Symptoms:
- "too many connections" errors
- Intermittent database failures
Possible Causes:
- Connection pool exhausted
- Too many Agent tasks for database capacity
Solutions:
-
Increase max_connections:
ALTER SYSTEM SET max_connections = 400;
SELECT pg_reload_conf(); -
Reduce Agent count:
aws ecs update-service \
--cluster YOUR_CLUSTER \
--service agent-service \
--desired-count 5 -
Upgrade database instance:
- Larger instances support more connections
- db.t3.medium = ~400 connections
- db.t3.large = ~800 connections
Policy Issues
Issue: Policy Not Enforcing
Symptoms:
- Requests not being blocked
- Policy rules not applying
Possible Causes:
-
Policy disabled
- Check:
spec.enabledis true - Solution: Enable policy
- Check:
-
Priority too low
- Earlier policy allowing request
- Solution: Increase priority or check other policies
-
Conditions not matching
- Request doesn't match policy conditions
- Solution: Use policy simulation to debug
Debug Steps:
# Simulate policy evaluation
curl -X POST https://YOUR_AGENT_ENDPOINT/api/v1/policies/simulate \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"user_id": "[email protected]",
"resource": "database:customers:read",
"action": "read"
}'
# Check policy is active
curl https://YOUR_AGENT_ENDPOINT/api/v1/policies/pol_abc123 \
-H "Authorization: Bearer YOUR_API_KEY"
Issue: Policy Evaluation Too Slow
Symptoms:
- Policy latency >10ms
- Timeout errors
Possible Causes:
- Too many policies (>50)
- Complex regex patterns
- Large policy files
Solutions:
-
Optimize policies:
- Use simple string matching instead of regex
- Combine related rules
- Remove unused policies
-
Increase priority of frequently-matched policies:
- Higher priority = evaluated earlier
-
Use caching:
- Enable policy result caching (if available)
Connector Issues
Issue: Connector Health Check Failing
Symptoms:
- Connector status: "unhealthy"
- Agent can't access data source
Possible Causes:
-
Credentials invalid
- Check: Secrets Manager contains correct credentials
- Solution: Update secret and restart Agent
-
Network connectivity
- Data source not reachable
- Solution: Check security groups and routing
-
Permissions insufficient
- Connector can't perform required operations
- Solution: Grant necessary permissions
Debug Steps:
# Check connector health
curl https://YOUR_AGENT_ENDPOINT/api/v1/connectors/conn_abc123/health \
-H "Authorization: Bearer YOUR_API_KEY"
# Test connector
curl -X POST https://YOUR_AGENT_ENDPOINT/api/v1/connectors/conn_abc123/test \
-H "Authorization: Bearer YOUR_API_KEY"
Monitoring Issues
Issue: Metrics Not Appearing in CloudWatch
Symptoms:
- CloudWatch dashboard empty
- No metrics in namespace
Possible Causes:
-
IAM permissions
- Task role missing CloudWatch permissions
- Solution: Add
cloudwatch:PutMetricDatapermission
-
Metrics not enabled
- Agent not configured to send metrics
- Solution: Check Agent environment variables
Debug Steps:
# Check IAM role permissions
aws iam get-role-policy \
--role-name YOUR_TASK_ROLE \
--policy-name AxonFlowTaskPolicy
# Manually put metric to test
aws cloudwatch put-metric-data \
--namespace AxonFlow \
--metric-name test_metric \
--value 1
Network Issues
Issue: Cannot Access Agent Endpoint
Symptoms:
- Connection timeout
- No route to host
Possible Causes:
-
Internal ALB
- ALB is internal (in-VPC only)
- Solution: Access from within VPC or use VPN/bastion
-
Security group blocking
- ALB security group doesn't allow your IP
- Solution: Add inbound rule for your IP
-
Wrong endpoint
- Using incorrect URL
- Solution: Get endpoint from CloudFormation outputs
Debug Steps:
# Check from EC2 instance in same VPC
ssh -i key.pem ubuntu@ec2-in-vpc
curl https://AGENT_ENDPOINT/health
# Check ALB status
aws elbv2 describe-load-balancers \
--names YOUR_ALB_NAME
# Check target health
aws elbv2 describe-target-health \
--target-group-arn YOUR_TARGET_GROUP_ARN
Getting Help
Collect Debug Information
Before contacting support, collect:
-
CloudFormation stack events:
aws cloudformation describe-stack-events \
--stack-name YOUR_STACK \
--max-items 50 > stack-events.json -
ECS service description:
aws ecs describe-services \
--cluster YOUR_CLUSTER \
--services agent-service orchestrator-service \
> ecs-services.json -
CloudWatch logs (last 1 hour):
aws logs filter-log-events \
--log-group-name /ecs/YOUR_STACK/agent \
--start-time $(date -u -d '1 hour ago' +%s)000 \
> agent-logs.txt -
Recent metrics:
curl https://YOUR_AGENT_ENDPOINT/api/v1/metrics \
-H "Authorization: Bearer YOUR_API_KEY" \
> metrics.json
Contact Support
Email: [email protected]
Include:
- Stack name and region
- Issue description and symptoms
- Debug information collected above
- CloudFormation template parameters (redact sensitive values)
Response Times:
- Pilot: 24 hours
- Growth: 12 hours
- Enterprise: 4 hours (24/7)
Common Error Messages
| Error | Meaning | Solution |
|---|---|---|
ResourceInitializationError: unable to pull secrets | Can't access Secrets Manager | Check task execution role permissions |
CannotPullContainerError | Can't pull ECR image | Add NAT Gateway or ECR VPC endpoint |
ECS service AGENT_SERVICE did not stabilize | Tasks keep failing health checks | Check Agent logs and database connectivity |
The specified subnet does not have enough IP addresses | Subnet full | Use larger subnet (/24) or reduce task count |
Policy evaluation timeout | Policy taking too long | Reduce policy complexity or increase Agent count |