CloudWatch Integration

This guide explains how to integrate AxonFlow with Amazon CloudWatch for comprehensive monitoring, alerting, and log management.

Overview

AxonFlow integrates with CloudWatch to provide:

Metrics - Request counts, latencies, error rates
Logs - Structured application logs
Alarms - Automated alerting
Dashboards - Visual monitoring

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    AxonFlow CloudWatch Integration               │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────┐    ┌─────────────┐    ┌─────────────┐         │
│  │   Agent     │───▶│ CloudWatch  │───▶│  Alarms &   │         │
│  │             │    │   Agent     │    │  Dashboard  │         │
│  └─────────────┘    └─────────────┘    └─────────────┘         │
│         │                                                        │
│  ┌─────────────┐    ┌─────────────┐                             │
│  │Orchestrator │───▶│ CloudWatch  │                             │
│  │             │    │    Logs     │                             │
│  └─────────────┘    └─────────────┘                             │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘

Configuration

Environment Variables

# Enable CloudWatch integration
AWS_REGION="us-east-1"
AXONFLOW_METRICS_ENABLED="true"
AXONFLOW_METRICS_PROVIDER="cloudwatch"
AXONFLOW_METRICS_NAMESPACE="AxonFlow"

# Log configuration
AXONFLOW_LOG_LEVEL="info"
AXONFLOW_LOG_FORMAT="json"
AXONFLOW_CLOUDWATCH_LOG_GROUP="/axonflow/production"

IAM Permissions

Required IAM permissions for the AxonFlow task role:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "cloudwatch:PutMetricData",
        "cloudwatch:GetMetricData",
        "cloudwatch:ListMetrics"
      ],
      "Resource": "*",
      "Condition": {
        "StringEquals": {
          "cloudwatch:namespace": "AxonFlow"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams"
      ],
      "Resource": [
        "arn:aws:logs:*:*:log-group:/axonflow/*"
      ]
    }
  ]
}

Metrics

Available Metrics

Metric	Unit	Description
`RequestCount`	Count	Total requests processed
`SuccessCount`	Count	Successful requests
`ErrorCount`	Count	Failed requests
`BlockedCount`	Count	Requests blocked by policy
`Latency`	Milliseconds	Request processing time
`TokensUsed`	Count	LLM tokens consumed
`PolicyEvaluations`	Count	Policy checks performed
`ConnectorCalls`	Count	MCP connector invocations

Metric Dimensions

Metrics are published with the following dimensions:

Dimension	Description
`Environment`	staging, production
`OrganizationId`	Customer organization
`Service`	agent, orchestrator
`Model`	LLM model used
`PolicyId`	Policy that was evaluated

Example Metric Data

{
  "Namespace": "AxonFlow",
  "MetricName": "Latency",
  "Dimensions": [
    {"Name": "Environment", "Value": "production"},
    {"Name": "Service", "Value": "orchestrator"},
    {"Name": "Model", "Value": "claude-sonnet-4"}
  ],
  "Value": 245.5,
  "Unit": "Milliseconds",
  "Timestamp": "2025-12-08T10:30:00Z"
}

CloudWatch Alarms

Recommended Alarms

High Error Rate

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  HighErrorRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: axonflow-high-error-rate
      AlarmDescription: Error rate exceeds 5%
      Namespace: AxonFlow
      MetricName: ErrorCount
      Dimensions:
        - Name: Environment
          Value: production
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 2
      Threshold: 5
      ComparisonOperator: GreaterThanThreshold
      AlarmActions:
        - !Ref AlertSNSTopic

High Latency

  HighLatencyAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: axonflow-high-latency
      AlarmDescription: P95 latency exceeds 5 seconds
      Namespace: AxonFlow
      MetricName: Latency
      Dimensions:
        - Name: Environment
          Value: production
      ExtendedStatistic: p95
      Period: 300
      EvaluationPeriods: 3
      Threshold: 5000
      ComparisonOperator: GreaterThanThreshold

Policy Blocks Spike

  PolicyBlocksSpikeAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: axonflow-policy-blocks-spike
      AlarmDescription: Unusual number of blocked requests
      Namespace: AxonFlow
      MetricName: BlockedCount
      Dimensions:
        - Name: Environment
          Value: production
      Statistic: Sum
      Period: 300
      EvaluationPeriods: 1
      Threshold: 100
      ComparisonOperator: GreaterThanThreshold

Token Budget Alert

  TokenBudgetAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmName: axonflow-token-budget-alert
      AlarmDescription: Daily token usage approaching limit
      Namespace: AxonFlow
      MetricName: TokensUsed
      Dimensions:
        - Name: Environment
          Value: production
      Statistic: Sum
      Period: 86400  # 24 hours
      EvaluationPeriods: 1
      Threshold: 900000  # 90% of 1M limit
      ComparisonOperator: GreaterThanThreshold

CloudWatch Logs

Log Groups

/axonflow/
├── production/
│   ├── agent/
│   ├── orchestrator/
│   └── customer-portal/
└── staging/
    ├── agent/
    ├── orchestrator/
    └── customer-portal/

Log Format

Logs are emitted in JSON format:

{
  "timestamp": "2025-12-08T10:30:00.123Z",
  "level": "info",
  "service": "orchestrator",
  "request_id": "req-abc123",
  "organization_id": "org-xyz",
  "user_id": "user-123",
  "message": "Request processed successfully",
  "latency_ms": 245,
  "model": "claude-sonnet-4",
  "tokens_used": 1250,
  "policy_evaluations": 3,
  "blocked": false
}

Log Insights Queries

Error Analysis

fields @timestamp, @message, level, service, error_message
| filter level = "error"
| sort @timestamp desc
| limit 100

Latency Distribution

stats avg(latency_ms) as avg_latency,
      percentile(latency_ms, 50) as p50,
      percentile(latency_ms, 95) as p95,
      percentile(latency_ms, 99) as p99
by bin(5m)
| sort @timestamp desc

Policy Blocks by Type

fields @timestamp, policy_id, blocked_reason
| filter blocked = true
| stats count(*) as block_count by policy_id
| sort block_count desc

Token Usage by Organization

fields @timestamp, organization_id, tokens_used
| stats sum(tokens_used) as total_tokens by organization_id
| sort total_tokens desc

Request Volume

stats count(*) as request_count by bin(1h)
| sort @timestamp desc

Error Rate Over Time

filter level = "error" OR level = "info"
| stats count(*) as total,
        sum(case level = 'error' then 1 else 0 end) as errors
  by bin(5m)
| display @timestamp, errors, total, (errors * 100.0 / total) as error_rate_pct
| sort @timestamp desc

Latency Percentiles by Model

filter ispresent(latency_ms) and ispresent(model)
| stats percentile(latency_ms, 50) as p50,
        percentile(latency_ms, 90) as p90,
        percentile(latency_ms, 95) as p95,
        percentile(latency_ms, 99) as p99,
        count(*) as requests
  by model
| sort p95 desc

Policy Violations by Type and Severity

filter blocked = true
| stats count(*) as violations by policy_id, blocked_reason
| sort violations desc
| limit 25

Slow Requests (Above P95 Threshold)

filter latency_ms > 5000
| fields @timestamp, service, model, latency_ms, tokens_used, organization_id
| sort latency_ms desc
| limit 50

Log Retention

Configure log retention to manage costs:

  LogGroup:
    Type: AWS::Logs::LogGroup
    Properties:
      LogGroupName: /axonflow/production
      RetentionInDays: 30  # Adjust based on requirements

CloudWatch Dashboards

Creating a Dashboard

  AxonFlowDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: AxonFlow-Production
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "x": 0, "y": 0,
              "width": 12, "height": 6,
              "properties": {
                "title": "Request Volume",
                "region": "${AWS::Region}",
                "metrics": [
                  ["AxonFlow", "RequestCount", "Environment", "production"],
                  [".", "SuccessCount", ".", "."],
                  [".", "ErrorCount", ".", "."]
                ],
                "period": 300,
                "stat": "Sum"
              }
            },
            {
              "type": "metric",
              "x": 12, "y": 0,
              "width": 12, "height": 6,
              "properties": {
                "title": "Latency",
                "region": "${AWS::Region}",
                "metrics": [
                  ["AxonFlow", "Latency", "Environment", "production", {"stat": "Average"}],
                  ["...", {"stat": "p95"}],
                  ["...", {"stat": "p99"}]
                ],
                "period": 300
              }
            }
          ]
        }

Dashboard Widgets

Request Overview

{
  "type": "metric",
  "properties": {
    "title": "Request Overview",
    "metrics": [
      ["AxonFlow", "RequestCount", "Environment", "production", {"label": "Total"}],
      [".", "SuccessCount", ".", ".", {"label": "Success"}],
      [".", "ErrorCount", ".", ".", {"label": "Errors"}],
      [".", "BlockedCount", ".", ".", {"label": "Blocked"}]
    ],
    "view": "timeSeries",
    "period": 300
  }
}

Token Usage

{
  "type": "metric",
  "properties": {
    "title": "Token Usage",
    "metrics": [
      ["AxonFlow", "TokensUsed", "Environment", "production"]
    ],
    "view": "timeSeries",
    "period": 3600,
    "stat": "Sum"
  }
}

Error Rate

{
  "type": "metric",
  "properties": {
    "title": "Error Rate (%)",
    "metrics": [
      [{
        "expression": "(m2/m1)*100",
        "label": "Error Rate",
        "id": "e1"
      }],
      ["AxonFlow", "RequestCount", "Environment", "production", {"id": "m1", "visible": false}],
      [".", "ErrorCount", ".", ".", {"id": "m2", "visible": false}]
    ],
    "view": "timeSeries",
    "period": 300
  }
}

Terraform Configuration

Complete CloudWatch Setup

# cloudwatch.tf

# Log Group
resource "aws_cloudwatch_log_group" "axonflow" {
  name              = "/axonflow/${var.environment}"
  retention_in_days = var.log_retention_days

  tags = {
    Environment = var.environment
    Application = "AxonFlow"
  }
}

# Metric Alarm - High Error Rate
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
  alarm_name          = "axonflow-${var.environment}-high-error-rate"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 2
  metric_name         = "ErrorCount"
  namespace           = "AxonFlow"
  period              = 300
  statistic           = "Sum"
  threshold           = 10
  alarm_description   = "High error rate detected"

  dimensions = {
    Environment = var.environment
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
  ok_actions    = [aws_sns_topic.alerts.arn]
}

# Metric Alarm - High Latency
resource "aws_cloudwatch_metric_alarm" "high_latency" {
  alarm_name          = "axonflow-${var.environment}-high-latency"
  comparison_operator = "GreaterThanThreshold"
  evaluation_periods  = 3
  metric_name         = "Latency"
  namespace           = "AxonFlow"
  period              = 300
  extended_statistic  = "p95"
  threshold           = 5000
  alarm_description   = "P95 latency exceeds 5 seconds"

  dimensions = {
    Environment = var.environment
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

# SNS Topic for Alerts
resource "aws_sns_topic" "alerts" {
  name = "axonflow-${var.environment}-alerts"
}

resource "aws_sns_topic_subscription" "email" {
  topic_arn = aws_sns_topic.alerts.arn
  protocol  = "email"
  endpoint  = var.alert_email
}

# Dashboard
resource "aws_cloudwatch_dashboard" "main" {
  dashboard_name = "AxonFlow-${var.environment}"
  dashboard_body = templatefile("${path.module}/dashboard.json.tpl", {
    environment = var.environment
    region      = var.aws_region
  })
}

Best Practices

Metrics

Use meaningful dimensions - Add context without high cardinality
Set appropriate periods - Balance granularity and cost
Create composite metrics - Calculate error rates, success rates
Monitor trends - Use anomaly detection for baseline deviations

Logging

Structured logging - Use JSON format consistently
Include correlation IDs - Track requests across services
Log levels appropriately - DEBUG in dev, INFO/WARN in prod
Manage retention - Balance debugging needs and costs

Alarms

Avoid alarm fatigue - Set meaningful thresholds
Use multiple periods - Reduce false positives
Configure actions - Automated responses where possible
Document runbooks - Link to resolution steps

Cost Optimization

Filter metrics - Only publish what you need
Aggregate logs - Use Log Insights instead of streaming
Set retention policies - Don't store logs indefinitely
Use metric math - Calculate derived metrics in CloudWatch

Troubleshooting

Metrics Not Appearing

Verify IAM permissions
Check namespace spelling
Confirm region configuration
Wait for metric aggregation (up to 5 minutes)

Logs Not Streaming

Check log group exists
Verify IAM permissions
Confirm log agent is running
Check log format configuration

Alarms Not Triggering

Verify metric data exists
Check dimension spelling
Review threshold values
Confirm alarm state

Monitoring Overview

Overview​

Architecture​

Configuration​

Environment Variables​

IAM Permissions​

Metrics​

Available Metrics​

Metric Dimensions​

Example Metric Data​

CloudWatch Alarms​

Recommended Alarms​

High Error Rate​

High Latency​

Policy Blocks Spike​

Token Budget Alert​

CloudWatch Logs​

Log Groups​

Log Format​

Log Insights Queries​

Error Analysis​

Latency Distribution​

Policy Blocks by Type​

Token Usage by Organization​

Request Volume​

Error Rate Over Time​

Latency Percentiles by Model​

Policy Violations by Type and Severity​

Slow Requests (Above P95 Threshold)​

Log Retention​

CloudWatch Dashboards​

Creating a Dashboard​

Dashboard Widgets​

Request Overview​

Token Usage​

Error Rate​

Terraform Configuration​

Complete CloudWatch Setup​

Best Practices​

Metrics​

Logging​

Alarms​

Cost Optimization​

Troubleshooting​

Metrics Not Appearing​

Logs Not Streaming​

Alarms Not Triggering​

Related​

Overview

Architecture

Configuration

Environment Variables

IAM Permissions

Metrics

Available Metrics

Metric Dimensions

Example Metric Data

CloudWatch Alarms

Recommended Alarms

High Error Rate

High Latency

Policy Blocks Spike

Token Budget Alert

CloudWatch Logs

Log Groups

Log Format

Log Insights Queries

Error Analysis

Latency Distribution

Policy Blocks by Type

Token Usage by Organization

Request Volume

Error Rate Over Time

Latency Percentiles by Model

Policy Violations by Type and Severity

Slow Requests (Above P95 Threshold)

Log Retention

CloudWatch Dashboards

Creating a Dashboard

Dashboard Widgets

Request Overview

Token Usage

Error Rate

Terraform Configuration

Complete CloudWatch Setup

Best Practices

Metrics

Logging

Alarms

Cost Optimization

Troubleshooting

Metrics Not Appearing

Logs Not Streaming

Alarms Not Triggering

Related