Skip to main content

CloudWatch Integration

CloudWatch is relevant for AxonFlow teams running on AWS, especially when they need centralized log retention, alarms, and handoff into the wider AWS operations toolchain. The core community observability story is still Prometheus plus Grafana. CloudWatch is the AWS-facing layer you add on top when your deployment, security, or platform operations already standardize on CloudWatch.

Overview

Use CloudWatch for:

  • Logs shipped from Docker, ECS, or surrounding AWS infrastructure
  • Alarms that route into SNS, PagerDuty, or internal AWS operations flows
  • Dashboards that combine AxonFlow health with load balancers, databases, and compute
  • AWS-native visibility when platform teams want one control surface across multiple services

Use Prometheus and Grafana for:

  • the built-in runtime metrics exposed by Agent and Orchestrator
  • detailed policy, request, and latency dashboards bundled with the local and staging stack
  • most day-to-day debugging and regression analysis

When CloudWatch Is The Right Choice

CloudWatch becomes the better front door when:

  • your AxonFlow deployment already sits inside a broader AWS operations estate
  • incident response routes through SNS, CloudWatch alarms, or internal AWS workflows
  • platform teams want one place to correlate AxonFlow health with ALB, ECS, RDS, or VPC signals
  • security and compliance teams expect standardized AWS-native log retention and access controls

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ AxonFlow CloudWatch Integration │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Agent │───▶│ CloudWatch │───▶│ Alarms & │ │
│ │ │ │ Agent │ │ Dashboard │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │Orchestrator │───▶│ CloudWatch │ │
│ │ │ │ Logs │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Deployment Pattern

The practical AWS pattern looks like this:

AxonFlow Agent / Orchestrator
├─ Prometheus scrape targets for runtime metrics
├─ Grafana dashboards for platform analysis
└─ CloudWatch Logs shipping for centralized AWS log retention and alerting

This page intentionally does not document fake “turn on CloudWatch metrics with one env var” behavior because the platform does not expose a generic community AXONFLOW_METRICS_PROVIDER=cloudwatch switch. If you are on AWS, wire CloudWatch in at the deployment layer and keep Prometheus/Grafana as the runtime source of truth.

IAM Permissions

Required IAM permissions depend on how you ship logs and whether your AWS deployment also publishes custom application metrics. The common minimum for log shipping is:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
],
"Resource": [
"arn:aws:logs:*:*:log-group:/axonflow/*"
]
}
]
}

What To Send To CloudWatch

The highest-value data to mirror into CloudWatch is usually:

  • structured application logs from Agent and Orchestrator
  • ALB / API Gateway / ECS / EC2 health signals
  • deployment and infrastructure alarms
  • selected business or operations alerts derived from Prometheus metrics

If you need detailed runtime metrics, scrape Prometheus first and then mirror or summarize only the signals your AWS operations team actually pages on.

CloudWatch Alarms

Application Load Balancer 5xx Errors

AWSTemplateFormatVersion: '2010-09-09'
Resources:
HighErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-high-error-rate
AlarmDescription: Target group is returning elevated 5xx responses
Namespace: AWS/ApplicationELB
MetricName: HTTPCode_Target_5XX_Count
Dimensions:
- Name: LoadBalancer
Value: app/axonflow-prod/1234567890abcdef
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertSNSTopic

High Target Latency

  HighLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-high-latency
AlarmDescription: ALB target latency exceeds 5 seconds
Namespace: AWS/ApplicationELB
MetricName: TargetResponseTime
Dimensions:
- Name: LoadBalancer
Value: app/axonflow-prod/1234567890abcdef
ExtendedStatistic: p95
Period: 300
EvaluationPeriods: 3
Threshold: 5000
ComparisonOperator: GreaterThanThreshold

ECS Task Restart Spike

  RestartSpikeAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-ecs-restarts
AlarmDescription: ECS service is churning tasks unexpectedly
Namespace: ECS/ContainerInsights
MetricName: RestartCount
Dimensions:
- Name: ServiceName
Value: axonflow-agent
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 3
ComparisonOperator: GreaterThanThreshold

Log Ingestion Failure Or Silence

  MissingLogsAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-log-silence
AlarmDescription: Expected application logs are missing
Namespace: AWS/Logs
MetricName: IncomingLogEvents
Dimensions:
- Name: LogGroupName
Value: /axonflow/production/agent
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 1
ComparisonOperator: LessThanThreshold

CloudWatch Logs

Log Groups

/axonflow/
├── production/
│ ├── agent/
│ ├── orchestrator/
│ └── customer-portal/
└── staging/
├── agent/
├── orchestrator/
└── customer-portal/

Log Format

AxonFlow logs are most useful in CloudWatch when they remain structured JSON and preserve fields like request ID, service name, latency, model, and policy context:

{
"timestamp": "2025-12-08T10:30:00.123Z",
"level": "info",
"service": "orchestrator",
"request_id": "req-abc123",
"organization_id": "org-xyz",
"user_id": "user-123",
"message": "Request processed successfully",
"latency_ms": 245,
"model": "claude-sonnet-4",
"tokens_used": 1250,
"policy_evaluations": 3,
"blocked": false
}

Log Insights Queries

Error Analysis

fields @timestamp, @message, level, service, error_message
| filter level = "error"
| sort @timestamp desc
| limit 100

Latency Distribution

stats avg(latency_ms) as avg_latency,
percentile(latency_ms, 50) as p50,
percentile(latency_ms, 95) as p95,
percentile(latency_ms, 99) as p99
by bin(5m)
| sort @timestamp desc

Policy Blocks by Type

fields @timestamp, policy_id, blocked_reason
| filter blocked = true
| stats count(*) as block_count by policy_id
| sort block_count desc

Token Usage by Organization

fields @timestamp, organization_id, tokens_used
| stats sum(tokens_used) as total_tokens by organization_id
| sort total_tokens desc

Request Volume

stats count(*) as request_count by bin(1h)
| sort @timestamp desc

Error Rate Over Time

filter level = "error" OR level = "info"
| stats count(*) as total,
sum(case level = 'error' then 1 else 0 end) as errors
by bin(5m)
| display @timestamp, errors, total, (errors * 100.0 / total) as error_rate_pct
| sort @timestamp desc

Latency Percentiles by Model

filter ispresent(latency_ms) and ispresent(model)
| stats percentile(latency_ms, 50) as p50,
percentile(latency_ms, 90) as p90,
percentile(latency_ms, 95) as p95,
percentile(latency_ms, 99) as p99,
count(*) as requests
by model
| sort p95 desc

Policy Violations by Type and Severity

filter blocked = true
| stats count(*) as violations by policy_id, blocked_reason
| sort violations desc
| limit 25

Slow Requests (Above P95 Threshold)

filter latency_ms > 5000
| fields @timestamp, service, model, latency_ms, tokens_used, organization_id
| sort latency_ms desc
| limit 50

Log Retention

Configure log retention to manage costs:

  LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: /axonflow/production
RetentionInDays: 30 # Adjust based on requirements

Retention should follow your operating model. Community teams often keep shorter-lived logs for debugging, while evaluation and enterprise buyers usually align retention to internal policy, incident response, or regulator expectations.

CloudWatch Dashboards

Creating a Dashboard

  AxonFlowDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: AxonFlow-Production
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0,
"width": 12, "height": 6,
"properties": {
"title": "ALB Request Volume",
"region": "${AWS::Region}",
"metrics": [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/axonflow-prod/1234567890abcdef"]
],
"period": 300,
"stat": "Sum"
}
},
{
"type": "metric",
"x": 12, "y": 0,
"width": 12, "height": 6,
"properties": {
"title": "Target Latency",
"region": "${AWS::Region}",
"metrics": [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/axonflow-prod/1234567890abcdef", {"stat": "Average"}],
["...", {"stat": "p95"}],
["...", {"stat": "p99"}]
],
"period": 300
}
}
]
}

Dashboard Widgets

Request Overview

{
"type": "metric",
"properties": {
"title": "Request Overview",
"metrics": [
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/axonflow-prod/1234567890abcdef", {"label": "Total"}],
[".", "HTTPCode_Target_5XX_Count", ".", ".", {"label": "Target 5xx"}],
[".", "HTTPCode_ELB_5XX_Count", ".", ".", {"label": "ELB 5xx"}]
],
"view": "timeSeries",
"period": 300
}
}

Target Latency

{
"type": "metric",
"properties": {
"title": "Target Latency",
"metrics": [
["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/axonflow-prod/1234567890abcdef"]
],
"view": "timeSeries",
"period": 300,
"stat": "p95"
}
}

Error Rate

{
"type": "metric",
"properties": {
"title": "Error Rate (%)",
"metrics": [
[{
"expression": "(m2/m1)*100",
"label": "Error Rate",
"id": "e1"
}],
["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/axonflow-prod/1234567890abcdef", {"id": "m1", "visible": false}],
[".", "HTTPCode_Target_5XX_Count", ".", ".", {"id": "m2", "visible": false}]
],
"view": "timeSeries",
"period": 300
}
}

What To Keep In Grafana Versus CloudWatch

As a practical split:

  • keep runtime counters, latency histograms, policy blocks, connector metrics, and token metrics in Prometheus and Grafana
  • mirror logs, deployment alarms, ALB health, ECS churn, and AWS infrastructure alerts into CloudWatch
  • page only on the derived signals your operations team genuinely needs

That split keeps AxonFlow observability rich without turning CloudWatch into an expensive copy of every runtime dashboard.

Terraform Configuration

Complete CloudWatch Setup

# cloudwatch.tf

# Log Group
resource "aws_cloudwatch_log_group" "axonflow" {
name = "/axonflow/${var.environment}"
retention_in_days = var.log_retention_days

tags = {
Environment = var.environment
Application = "AxonFlow"
}
}

# Metric Alarm - High Error Rate
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
alarm_name = "axonflow-${var.environment}-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "HTTPCode_Target_5XX_Count"
namespace = "AWS/ApplicationELB"
period = 300
statistic = "Sum"
threshold = 10
alarm_description = "High upstream error rate detected"

dimensions = {
Environment = var.environment
}

alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}

# Metric Alarm - High Latency
resource "aws_cloudwatch_metric_alarm" "high_latency" {
alarm_name = "axonflow-${var.environment}-high-latency"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "TargetResponseTime"
namespace = "AWS/ApplicationELB"
period = 300
extended_statistic = "p95"
threshold = 5000
alarm_description = "ALB target latency exceeds threshold"

dimensions = {
Environment = var.environment
}

alarm_actions = [aws_sns_topic.alerts.arn]
}

# SNS Topic for Alerts
resource "aws_sns_topic" "alerts" {
name = "axonflow-${var.environment}-alerts"
}

resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = var.alert_email
}

# Dashboard
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "AxonFlow-${var.environment}"
dashboard_body = templatefile("${path.module}/dashboard.json.tpl", {
environment = var.environment
region = var.aws_region
})
}

Best Practices

Metrics

  1. Use meaningful dimensions - Add context without high cardinality
  2. Set appropriate periods - Balance granularity and cost
  3. Create composite metrics - Calculate error rates, success rates
  4. Monitor trends - Use anomaly detection for baseline deviations

Logging

  1. Structured logging - Use JSON format consistently
  2. Include correlation IDs - Track requests across services
  3. Log levels appropriately - DEBUG in dev, INFO/WARN in prod
  4. Manage retention - Balance debugging needs and costs

Alarms

  1. Avoid alarm fatigue - Set meaningful thresholds
  2. Use multiple periods - Reduce false positives
  3. Configure actions - Automated responses where possible
  4. Document runbooks - Link to resolution steps

Cost Optimization

  1. Filter metrics - Only publish what you need
  2. Aggregate logs - Use Log Insights instead of streaming
  3. Set retention policies - Don't store logs indefinitely
  4. Use metric math - Calculate derived metrics in CloudWatch

Troubleshooting

Metrics Not Appearing

  1. Verify IAM permissions
  2. Check namespace spelling
  3. Confirm region configuration
  4. Wait for metric aggregation (up to 5 minutes)

Logs Not Streaming

  1. Check log group exists
  2. Verify IAM permissions
  3. Confirm log agent is running
  4. Check log format configuration

Alarms Not Triggering

  1. Verify metric data exists
  2. Check dimension spelling
  3. Review threshold values
  4. Confirm alarm state