CloudWatch Integration

CloudWatch is relevant for AxonFlow teams running on AWS, especially when they need centralized log retention, alarms, and handoff into the wider AWS operations toolchain. The core community observability story is still Prometheus plus Grafana. CloudWatch is the AWS-facing layer you add on top when your deployment, security, or platform operations already standardize on CloudWatch.

Overview

Use CloudWatch for:

Logs shipped from Docker, ECS, or surrounding AWS infrastructure
Alarms that route into SNS, PagerDuty, or internal AWS operations flows
Dashboards that combine AxonFlow health with load balancers, databases, and compute
AWS-native visibility when platform teams want one control surface across multiple services

Use Prometheus and Grafana for:

the built-in runtime metrics exposed by Agent and Orchestrator
detailed policy, request, and latency dashboards bundled with the local and staging stack
most day-to-day debugging and regression analysis

When CloudWatch Is The Right Choice

CloudWatch becomes the better front door when:

your AxonFlow deployment already sits inside a broader AWS operations estate
incident response routes through SNS, CloudWatch alarms, or internal AWS workflows
platform teams want one place to correlate AxonFlow health with ALB, ECS, RDS, or VPC signals
security and compliance teams expect standardized AWS-native log retention and access controls

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    AxonFlow CloudWatch Integration               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         │
│  │   Agent     │───▶│ CloudWatch  │───▶│  Alarms &   │         │
│  │             │    │   Agent     │    │  Dashboard  │         │
│  └─────────────┘    └─────────────┘    └─────────────┘         │
│         │                                                        │
│  ┌─────────────┐    ┌─────────────┐                             │
│  │Orchestrator │───▶│ CloudWatch  │                             │
│  │             │    │    Logs     │                             │
│  └─────────────┘    └─────────────┘                             │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Deployment Pattern

The practical AWS pattern looks like this:

AxonFlow Agent / Orchestrator
  ├─ Prometheus scrape targets for runtime metrics
  ├─ Grafana dashboards for platform analysis
  └─ CloudWatch Logs shipping for centralized AWS log retention and alerting

This page intentionally does not document fake “turn on CloudWatch metrics with one env var” behavior because the platform does not expose a generic community AXONFLOW_METRICS_PROVIDER=cloudwatch switch. If you are on AWS, wire CloudWatch in at the deployment layer and keep Prometheus/Grafana as the runtime source of truth.

IAM Permissions

Required IAM permissions depend on how you ship logs and whether your AWS deployment also publishes custom application metrics. The common minimum for log shipping is:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams"
      ],
      "Resource": [
        "arn:aws:logs:*:*:log-group:/axonflow/*"
      ]
    }
  ]
}

What To Send To CloudWatch

The highest-value data to mirror into CloudWatch is usually:

structured application logs from Agent and Orchestrator
ALB / API Gateway / ECS / EC2 health signals
deployment and infrastructure alarms
selected business or operations alerts derived from Prometheus metrics

If you need detailed runtime metrics, scrape Prometheus first and then mirror or summarize only the signals your AWS operations team actually pages on.

CloudWatch Alarms

Recommended Alarms

Application Load Balancer 5xx Errors

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  HighErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: axonflow-high-error-rate
      AlarmDescription: Target group is returning elevated 5xx responses
      Namespace: AWS/ApplicationELB
      MetricName: HTTPCode_Target_5XX_Count
      Dimensions:
        - Name: LoadBalancer
          Value: app/axonflow-prod/1234567890abcdef
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref AlertSNSTopic

High Target Latency

  HighLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: axonflow-high-latency
      AlarmDescription: ALB target latency exceeds 5 seconds
      Namespace: AWS/ApplicationELB
      MetricName: TargetResponseTime
      Dimensions:
        - Name: LoadBalancer
          Value: app/axonflow-prod/1234567890abcdef
      ExtendedStatistic: p95
      Period: 300
      EvaluationPeriods: 3
      Threshold: 5000
      ComparisonOperator: GreaterThanThreshold

ECS Task Restart Spike

  RestartSpikeAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: axonflow-ecs-restarts
      AlarmDescription: ECS service is churning tasks unexpectedly
      Namespace: ECS/ContainerInsights
      MetricName: RestartCount
      Dimensions:
        - Name: ServiceName
          Value: axonflow-agent
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 1
      Threshold: 3
      ComparisonOperator: GreaterThanThreshold

Log Ingestion Failure Or Silence

  MissingLogsAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: axonflow-log-silence
      AlarmDescription: Expected application logs are missing
      Namespace: AWS/Logs
      MetricName: IncomingLogEvents
      Dimensions:
        - Name: LogGroupName
          Value: /axonflow/production/agent
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 1
      Threshold: 1
      ComparisonOperator: LessThanThreshold

CloudWatch Logs

Log Groups

/axonflow/
├── production/
│   ├── agent/
│   ├── orchestrator/
│   └── customer-portal/
└── staging/
    ├── agent/
    ├── orchestrator/
    └── customer-portal/

Log Format

AxonFlow logs are most useful in CloudWatch when they remain structured JSON and preserve fields like request ID, service name, latency, model, and policy context:

{
  "timestamp": "2025-12-08T10:30:00.123Z",
  "level": "info",
  "service": "orchestrator",
  "request_id": "req-abc123",
  "organization_id": "org-xyz",
  "user_id": "user-123",
  "message": "Request processed successfully",
  "latency_ms": 245,
  "model": "claude-sonnet-4",
  "tokens_used": 1250,
  "policy_evaluations": 3,
  "blocked": false
}

Log Insights Queries

Error Analysis

fields @timestamp, @message, level, service, error_message
| filter level = "error"
| sort @timestamp desc
| limit 100

Latency Distribution

stats avg(latency_ms) as avg_latency,
      percentile(latency_ms, 50) as p50,
      percentile(latency_ms, 95) as p95,
      percentile(latency_ms, 99) as p99
by bin(5m)
| sort @timestamp desc

Policy Blocks by Type

fields @timestamp, policy_id, blocked_reason
| filter blocked = true
| stats count(*) as block_count by policy_id
| sort block_count desc

Token Usage by Organization

fields @timestamp, organization_id, tokens_used
| stats sum(tokens_used) as total_tokens by organization_id
| sort total_tokens desc

Request Volume

stats count(*) as request_count by bin(1h)
| sort @timestamp desc

Error Rate Over Time

filter level = "error" OR level = "info"
| stats count(*) as total,
        sum(case level = 'error' then 1 else 0 end) as errors
  by bin(5m)
| display @timestamp, errors, total, (errors * 100.0 / total) as error_rate_pct
| sort @timestamp desc

Latency Percentiles by Model

filter ispresent(latency_ms) and ispresent(model)
| stats percentile(latency_ms, 50) as p50,
        percentile(latency_ms, 90) as p90,
        percentile(latency_ms, 95) as p95,
        percentile(latency_ms, 99) as p99,
        count(*) as requests
  by model
| sort p95 desc

Policy Violations by Type and Severity

filter blocked = true
| stats count(*) as violations by policy_id, blocked_reason
| sort violations desc
| limit 25

Slow Requests (Above P95 Threshold)

filter latency_ms > 5000
| fields @timestamp, service, model, latency_ms, tokens_used, organization_id
| sort latency_ms desc
| limit 50

Log Retention

Configure log retention to manage costs:

  LogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: /axonflow/production
      RetentionInDays: 30  # Adjust based on requirements

Retention should follow your operating model. Community teams often keep shorter-lived logs for debugging, while evaluation and enterprise buyers usually align retention to internal policy, incident response, or regulator expectations.

CloudWatch Dashboards

Creating a Dashboard

  AxonFlowDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: AxonFlow-Production
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "x": 0, "y": 0,
              "width": 12, "height": 6,
              "properties": {
                "title": "ALB Request Volume",
                "region": "${AWS::Region}",
                "metrics": [
                  ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/axonflow-prod/1234567890abcdef"]
                ],
                "period": 300,
                "stat": "Sum"
              }
            },
            {
              "type": "metric",
              "x": 12, "y": 0,
              "width": 12, "height": 6,
              "properties": {
                "title": "Target Latency",
                "region": "${AWS::Region}",
                "metrics": [
                  ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/axonflow-prod/1234567890abcdef", {"stat": "Average"}],
                  ["...", {"stat": "p95"}],
                  ["...", {"stat": "p99"}]
                ],
                "period": 300
              }
            }
          ]
        }

Dashboard Widgets

Request Overview

{
  "type": "metric",
  "properties": {
    "title": "Request Overview",
    "metrics": [
      ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/axonflow-prod/1234567890abcdef", {"label": "Total"}],
      [".", "HTTPCode_Target_5XX_Count", ".", ".", {"label": "Target 5xx"}],
      [".", "HTTPCode_ELB_5XX_Count", ".", ".", {"label": "ELB 5xx"}]
    ],
    "view": "timeSeries",
    "period": 300
  }
}

Target Latency

{
  "type": "metric",
  "properties": {
    "title": "Target Latency",
    "metrics": [
      ["AWS/ApplicationELB", "TargetResponseTime", "LoadBalancer", "app/axonflow-prod/1234567890abcdef"]
    ],
    "view": "timeSeries",
    "period": 300,
    "stat": "p95"
  }
}

Error Rate

{
  "type": "metric",
  "properties": {
    "title": "Error Rate (%)",
    "metrics": [
      [{
        "expression": "(m2/m1)*100",
        "label": "Error Rate",
        "id": "e1"
      }],
      ["AWS/ApplicationELB", "RequestCount", "LoadBalancer", "app/axonflow-prod/1234567890abcdef", {"id": "m1", "visible": false}],
      [".", "HTTPCode_Target_5XX_Count", ".", ".", {"id": "m2", "visible": false}]
    ],
    "view": "timeSeries",
    "period": 300
  }
}

What To Keep In Grafana Versus CloudWatch

As a practical split:

keep runtime counters, latency histograms, policy blocks, connector metrics, and token metrics in Prometheus and Grafana
mirror logs, deployment alarms, ALB health, ECS churn, and AWS infrastructure alerts into CloudWatch
page only on the derived signals your operations team genuinely needs

That split keeps AxonFlow observability rich without turning CloudWatch into an expensive copy of every runtime dashboard.

Terraform Configuration

Complete CloudWatch Setup

# cloudwatch.tf

# Log Group
resource "aws_cloudwatch_log_group" "axonflow" {
  name              = "/axonflow/${var.environment}"
  retention_in_days = var.log_retention_days

  tags = {
    Environment = var.environment
    Application = "AxonFlow"
  }
}

# Metric Alarm - High Error Rate
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "axonflow-${var.environment}-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "HTTPCode_Target_5XX_Count"
  namespace           = "AWS/ApplicationELB"
  period              = 300
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "High upstream error rate detected"

  dimensions = {
    Environment = var.environment
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Metric Alarm - High Latency
resource "aws_cloudwatch_metric_alarm" "high_latency" {
  alarm_name          = "axonflow-${var.environment}-high-latency"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "TargetResponseTime"
  namespace           = "AWS/ApplicationELB"
  period              = 300
  extended_statistic  = "p95"
  threshold           = 5000
  alarm_description   = "ALB target latency exceeds threshold"

  dimensions = {
    Environment = var.environment
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

# SNS Topic for Alerts
resource "aws_sns_topic" "alerts" {
  name = "axonflow-${var.environment}-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = var.alert_email
}

# Dashboard
resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = "AxonFlow-${var.environment}"
  dashboard_body = templatefile("${path.module}/dashboard.json.tpl", {
    environment = var.environment
    region      = var.aws_region
  })
}

Best Practices

Metrics

Use meaningful dimensions - Add context without high cardinality
Set appropriate periods - Balance granularity and cost
Create composite metrics - Calculate error rates, success rates
Monitor trends - Use anomaly detection for baseline deviations

Logging

Structured logging - Use JSON format consistently
Include correlation IDs - Track requests across services
Log levels appropriately - DEBUG in dev, INFO/WARN in prod
Manage retention - Balance debugging needs and costs

Alarms

Avoid alarm fatigue - Set meaningful thresholds
Use multiple periods - Reduce false positives
Configure actions - Automated responses where possible
Document runbooks - Link to resolution steps

Cost Optimization

Filter metrics - Only publish what you need
Aggregate logs - Use Log Insights instead of streaming
Set retention policies - Don't store logs indefinitely
Use metric math - Calculate derived metrics in CloudWatch

Troubleshooting

Metrics Not Appearing

Verify IAM permissions
Check namespace spelling
Confirm region configuration
Wait for metric aggregation (up to 5 minutes)

Logs Not Streaming

Check log group exists
Verify IAM permissions
Confirm log agent is running
Check log format configuration

Alarms Not Triggering

Verify metric data exists
Check dimension spelling
Review threshold values
Confirm alarm state

Monitoring Overview

Overview​

When CloudWatch Is The Right Choice​

Architecture​

Deployment Pattern​

IAM Permissions​

What To Send To CloudWatch​

CloudWatch Alarms​

Recommended Alarms​

Application Load Balancer 5xx Errors​

High Target Latency​

ECS Task Restart Spike​

Log Ingestion Failure Or Silence​

CloudWatch Logs​

Log Groups​

Log Format​

Log Insights Queries​

Error Analysis​

Latency Distribution​

Policy Blocks by Type​

Token Usage by Organization​

Request Volume​

Error Rate Over Time​

Latency Percentiles by Model​

Policy Violations by Type and Severity​

Slow Requests (Above P95 Threshold)​

Log Retention​

CloudWatch Dashboards​

Creating a Dashboard​

Dashboard Widgets​

Request Overview​

Target Latency​

Error Rate​

What To Keep In Grafana Versus CloudWatch​

Related Docs​

Terraform Configuration​

Complete CloudWatch Setup​

Best Practices​

Metrics​

Logging​

Alarms​

Cost Optimization​

Troubleshooting​

Metrics Not Appearing​

Logs Not Streaming​

Alarms Not Triggering​

Related​