CloudWatch Integration
CloudWatch is relevant for AxonFlow teams running on AWS, especially when they need centralized log retention, alarms, and handoff into the wider AWS operations toolchain. The core community observability story is still Prometheus plus Grafana. CloudWatch is the AWS-facing layer you add on top when your deployment, security, or platform operations already standardize on CloudWatch.
Overview
Use CloudWatch for:
- Logs shipped from Docker, ECS, or surrounding AWS infrastructure
- Alarms that route into SNS, PagerDuty, or internal AWS operations flows
- Dashboards that combine AxonFlow health with load balancers, databases, and compute
- AWS-native visibility when platform teams want one control surface across multiple services
Use Prometheus and Grafana for:
- the built-in runtime metrics exposed by Agent and Orchestrator
- detailed policy, request, and latency dashboards bundled with the local and staging stack
- most day-to-day debugging and regression analysis
When CloudWatch Is The Right Choice
CloudWatch becomes the better front door when:
- your AxonFlow deployment already sits inside a broader AWS operations estate
- incident response routes through SNS, CloudWatch alarms, or internal AWS workflows
- platform teams want one place to correlate AxonFlow health with ALB, ECS, RDS, or VPC signals
- security and compliance teams expect standardized AWS-native log retention and access controls
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ AxonFlow CloudWatch Integration │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Agent │───▶│ CloudWatch │───▶│ Alarms & │ │
│ │ │ │ Agent │ │ Dashboard │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │Orchestrator │───▶│ CloudWatch │ │
│ │ │ │ Logs │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Deployment Pattern
The practical AWS pattern looks like this:
AxonFlow Agent / Orchestrator
├─ Prometheus scrape targets for runtime metrics
├─ Grafana dashboards for platform analysis
└─ CloudWatch Logs shipping for centralized AWS log retention and alerting
This page intentionally does not document fake “turn on CloudWatch metrics with one env var” behavior because the platform does not expose a generic community AXONFLOW_METRICS_PROVIDER=cloudwatch switch. If you are on AWS, wire CloudWatch in at the deployment layer and keep Prometheus/Grafana as the runtime source of truth.
IAM Permissions
Required IAM permissions depend on how you ship logs and whether your AWS deployment also publishes custom application metrics. The common minimum for log shipping is:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
],
"Resource": [
"arn:aws:logs:*:*:log-group:/axonflow/*"
]
}
]
}
What To Send To CloudWatch
The highest-value data to mirror into CloudWatch is usually:
- structured application logs from Agent and Orchestrator
- ALB / API Gateway / ECS / EC2 health signals
- deployment and infrastructure alarms
- selected business or operations alerts derived from Prometheus metrics
If you need detailed runtime metrics, scrape Prometheus first and then mirror or summarize only the signals your AWS operations team actually pages on.
CloudWatch Alarms
Recommended Alarms
Application Load Balancer 5xx Errors
AWSTemplateFormatVersion: '2010-09-09'
Resources:
HighErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-high-error-rate
AlarmDescription: Target group is returning elevated 5xx responses
Namespace: AWS/ApplicationELB
MetricName: HTTPCode_Target_5XX_Count
Dimensions:
- Name: LoadBalancer
Value: app/axonflow-prod/1234567890abcdef
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertSNSTopic
High Target Latency
HighLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-high-latency
AlarmDescription: ALB target latency exceeds 5 seconds
Namespace: AWS/ApplicationELB
MetricName: TargetResponseTime
Dimensions:
- Name: LoadBalancer
Value: app/axonflow-prod/1234567890abcdef
ExtendedStatistic: p95
Period: 300
EvaluationPeriods: 3
Threshold: 5000
ComparisonOperator: GreaterThanThreshold
ECS Task Restart Spike
RestartSpikeAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-ecs-restarts
AlarmDescription: ECS service is churning tasks unexpectedly
Namespace: ECS/ContainerInsights
MetricName: RestartCount
Dimensions:
- Name: ServiceName
Value: axonflow-agent
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 3
ComparisonOperator: GreaterThanThreshold
Log Ingestion Failure Or Silence
MissingLogsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-log-silence
AlarmDescription: Expected application logs are missing
Namespace: AWS/Logs
MetricName: IncomingLogEvents
Dimensions:
- Name: LogGroupName
Value: /axonflow/production/agent
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: LessThanThreshold
CloudWatch Logs
Log Groups
/axonflow/
├── production/
│ ├── agent/
│ ├── orchestrator/
│ └── customer-portal/
└── staging/
├── agent/
├── orchestrator/
└── customer-portal/
Log Format
AxonFlow logs are most useful in CloudWatch when they remain structured JSON and preserve fields like request ID, service name, latency, model, and policy context:
{
"timestamp": "2025-12-08T10:30:00.123Z",
"level": "info",
"service": "orchestrator",
"request_id": "req-abc123",
"organization_id": "org-xyz",
"user_id": "user-123",
"message": "Request processed successfully",
"latency_ms": 245,
"model": "claude-sonnet-4",
"tokens_used": 1250,
"policy_evaluations": 3,
"blocked": false
}
Log Insights Queries
Error Analysis
fields @timestamp, @message, level, service, error_message
| filter level = "error"
| sort @timestamp desc
| limit 100
Latency Distribution
stats avg(latency_ms) as avg_latency,
percentile(latency_ms, 50) as p50,
percentile(latency_ms, 95) as p95,
percentile(latency_ms, 99) as p99
by bin(5m)
| sort @timestamp desc
Policy Blocks by Type
fields @timestamp, policy_id, blocked_reason
| filter blocked = true
| stats count(*) as block_count by policy_id
| sort block_count desc
Token Usage by Organization
fields @timestamp, organization_id, tokens_used
| stats sum(tokens_used) as total_tokens by organization_id
| sort total_tokens desc
Request Volume
stats count(*) as request_count by bin(1h)
| sort @timestamp desc
Error Rate Over Time
filter level = "error" OR level = "info"
| stats count(*) as total,
sum(case level = 'error' then 1 else 0 end) as errors
by bin(5m)
| display @timestamp, errors, total, (errors * 100.0 / total) as error_rate_pct
| sort @timestamp desc
Latency Percentiles by Model
filter ispresent(latency_ms) and ispresent(model)
| stats percentile(latency_ms, 50) as p50,
percentile(latency_ms, 90) as p90,
percentile(latency_ms, 95) as p95,
percentile(latency_ms, 99) as p99,
count(*) as requests
by model
| sort p95 desc
Policy Violations by Type and Severity
filter blocked = true
| stats count(*) as violations by policy_id, blocked_reason
| sort violations desc
| limit 25
Slow Requests (Above P95 Threshold)
filter latency_ms > 5000
| fields @timestamp, service, model, latency_ms, tokens_used, organization_id
| sort latency_ms desc
| limit 50
Log Retention
Configure log retention to manage costs:
LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: /axonflow/production
RetentionInDays: 30 # Adjust based on requirements
Retention should follow your operating model. Community teams often keep shorter-lived logs for debugging, while evaluation and enterprise buyers usually align retention to internal policy, incident response, or regulator expectations.
CloudWatch Dashboards
Creating a Dashboard
AxonFlowDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: AxonFlow-Production
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0,
"width": 12, "height": 6,
"properties": {
"title": "ALB Request Volume",
"region": "${AWS::Region}",
"metrics": [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/axonflow-prod/1234567890abcdef"]
],
"period": 300,
"stat": "Sum"
}
},
{
"type": "metric",
"x": 12, "y": 0,
"width": 12, "height": 6,
"properties": {
"title": "Target Latency",
"region": "${AWS::Region}",
"metrics": [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/axonflow-prod/1234567890abcdef", {"stat": "Average"}],
["...", {"stat": "p95"}],
["...", {"stat": "p99"}]
],
"period": 300
}
}
]
}
Dashboard Widgets
Request Overview
{
"type": "metric",
"properties": {
"title": "Request Overview",
"metrics": [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/axonflow-prod/1234567890abcdef", {"label": "Total"}],
[".", "HTTPCode_Target_5XX_Count", ".", ".", {"label": "Target 5xx"}],
[".", "HTTPCode_ELB_5XX_Count", ".", ".", {"label": "ELB 5xx"}]
],
"view": "timeSeries",
"period": 300
}
}
Target Latency
{
"type": "metric",
"properties": {
"title": "Target Latency",
"metrics": [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/axonflow-prod/1234567890abcdef"]
],
"view": "timeSeries",
"period": 300,
"stat": "p95"
}
}
Error Rate
{
"type": "metric",
"properties": {
"title": "Error Rate (%)",
"metrics": [
[{
"expression": "(m2/m1)*100",
"label": "Error Rate",
"id": "e1"
}],
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/axonflow-prod/1234567890abcdef", {"id": "m1", "visible": false}],
[".", "HTTPCode_Target_5XX_Count", ".", ".", {"id": "m2", "visible": false}]
],
"view": "timeSeries",
"period": 300
}
}
What To Keep In Grafana Versus CloudWatch
As a practical split:
- keep runtime counters, latency histograms, policy blocks, connector metrics, and token metrics in Prometheus and Grafana
- mirror logs, deployment alarms, ALB health, ECS churn, and AWS infrastructure alerts into CloudWatch
- page only on the derived signals your operations team genuinely needs
That split keeps AxonFlow observability rich without turning CloudWatch into an expensive copy of every runtime dashboard.
Related Docs
Terraform Configuration
Complete CloudWatch Setup
# cloudwatch.tf
# Log Group
resource "aws_cloudwatch_log_group" "axonflow" {
name = "/axonflow/${var.environment}"
retention_in_days = var.log_retention_days
tags = {
Environment = var.environment
Application = "AxonFlow"
}
}
# Metric Alarm - High Error Rate
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
alarm_name = "axonflow-${var.environment}-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "HTTPCode_Target_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 300
statistic = "Sum"
threshold = 10
alarm_description = "High upstream error rate detected"
dimensions = {
Environment = var.environment
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
# Metric Alarm - High Latency
resource "aws_cloudwatch_metric_alarm" "high_latency" {
alarm_name = "axonflow-${var.environment}-high-latency"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = 300
extended_statistic = "p95"
threshold = 5000
alarm_description = "ALB target latency exceeds threshold"
dimensions = {
Environment = var.environment
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# SNS Topic for Alerts
resource "aws_sns_topic" "alerts" {
name = "axonflow-${var.environment}-alerts"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = var.alert_email
}
# Dashboard
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "AxonFlow-${var.environment}"
dashboard_body = templatefile("${path.module}/dashboard.json.tpl", {
environment = var.environment
region = var.aws_region
})
}
Best Practices
Metrics
- Use meaningful dimensions - Add context without high cardinality
- Set appropriate periods - Balance granularity and cost
- Create composite metrics - Calculate error rates, success rates
- Monitor trends - Use anomaly detection for baseline deviations
Logging
- Structured logging - Use JSON format consistently
- Include correlation IDs - Track requests across services
- Log levels appropriately - DEBUG in dev, INFO/WARN in prod
- Manage retention - Balance debugging needs and costs
Alarms
- Avoid alarm fatigue - Set meaningful thresholds
- Use multiple periods - Reduce false positives
- Configure actions - Automated responses where possible
- Document runbooks - Link to resolution steps
Cost Optimization
- Filter metrics - Only publish what you need
- Aggregate logs - Use Log Insights instead of streaming
- Set retention policies - Don't store logs indefinitely
- Use metric math - Calculate derived metrics in CloudWatch
Troubleshooting
Metrics Not Appearing
- Verify IAM permissions
- Check namespace spelling
- Confirm region configuration
- Wait for metric aggregation (up to 5 minutes)
Logs Not Streaming
- Check log group exists
- Verify IAM permissions
- Confirm log agent is running
- Check log format configuration
Alarms Not Triggering
- Verify metric data exists
- Check dimension spelling
- Review threshold values
- Confirm alarm state
