Skip to main content

CloudWatch Integration

This guide explains how to integrate AxonFlow with Amazon CloudWatch for comprehensive monitoring, alerting, and log management.

Overview

AxonFlow integrates with CloudWatch to provide:

  • Metrics - Request counts, latencies, error rates
  • Logs - Structured application logs
  • Alarms - Automated alerting
  • Dashboards - Visual monitoring

Architecture

┌─────────────────────────────────────────────────────────────────┐
│ AxonFlow CloudWatch Integration │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Agent │───▶│ CloudWatch │───▶│ Alarms & │ │
│ │ │ │ Agent │ │ Dashboard │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │Orchestrator │───▶│ CloudWatch │ │
│ │ │ │ Logs │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘

Configuration

Environment Variables

# Enable CloudWatch integration
AWS_REGION="us-east-1"
AXONFLOW_METRICS_ENABLED="true"
AXONFLOW_METRICS_PROVIDER="cloudwatch"
AXONFLOW_METRICS_NAMESPACE="AxonFlow"

# Log configuration
AXONFLOW_LOG_LEVEL="info"
AXONFLOW_LOG_FORMAT="json"
AXONFLOW_CLOUDWATCH_LOG_GROUP="/axonflow/production"

IAM Permissions

Required IAM permissions for the AxonFlow task role:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"cloudwatch:GetMetricData",
"cloudwatch:ListMetrics"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"cloudwatch:namespace": "AxonFlow"
}
}
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
],
"Resource": [
"arn:aws:logs:*:*:log-group:/axonflow/*"
]
}
]
}

Metrics

Available Metrics

MetricUnitDescription
RequestCountCountTotal requests processed
SuccessCountCountSuccessful requests
ErrorCountCountFailed requests
BlockedCountCountRequests blocked by policy
LatencyMillisecondsRequest processing time
TokensUsedCountLLM tokens consumed
PolicyEvaluationsCountPolicy checks performed
ConnectorCallsCountMCP connector invocations

Metric Dimensions

Metrics are published with the following dimensions:

DimensionDescription
Environmentstaging, production
OrganizationIdCustomer organization
Serviceagent, orchestrator
ModelLLM model used
PolicyIdPolicy that was evaluated

Example Metric Data

{
"Namespace": "AxonFlow",
"MetricName": "Latency",
"Dimensions": [
{"Name": "Environment", "Value": "production"},
{"Name": "Service", "Value": "orchestrator"},
{"Name": "Model", "Value": "claude-3-sonnet"}
],
"Value": 245.5,
"Unit": "Milliseconds",
"Timestamp": "2025-12-08T10:30:00Z"
}

CloudWatch Alarms

High Error Rate

AWSTemplateFormatVersion: '2010-09-09'
Resources:
HighErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-high-error-rate
AlarmDescription: Error rate exceeds 5%
Namespace: AxonFlow
MetricName: ErrorCount
Dimensions:
- Name: Environment
Value: production
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertSNSTopic

High Latency

  HighLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-high-latency
AlarmDescription: P95 latency exceeds 5 seconds
Namespace: AxonFlow
MetricName: Latency
Dimensions:
- Name: Environment
Value: production
ExtendedStatistic: p95
Period: 300
EvaluationPeriods: 3
Threshold: 5000
ComparisonOperator: GreaterThanThreshold

Policy Blocks Spike

  PolicyBlocksSpikeAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-policy-blocks-spike
AlarmDescription: Unusual number of blocked requests
Namespace: AxonFlow
MetricName: BlockedCount
Dimensions:
- Name: Environment
Value: production
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 100
ComparisonOperator: GreaterThanThreshold

Token Budget Alert

  TokenBudgetAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-token-budget-alert
AlarmDescription: Daily token usage approaching limit
Namespace: AxonFlow
MetricName: TokensUsed
Dimensions:
- Name: Environment
Value: production
Statistic: Sum
Period: 86400 # 24 hours
EvaluationPeriods: 1
Threshold: 900000 # 90% of 1M limit
ComparisonOperator: GreaterThanThreshold

CloudWatch Logs

Log Groups

/axonflow/
├── production/
│ ├── agent/
│ ├── orchestrator/
│ └── customer-portal/
└── staging/
├── agent/
├── orchestrator/
└── customer-portal/

Log Format

Logs are emitted in JSON format:

{
"timestamp": "2025-12-08T10:30:00.123Z",
"level": "info",
"service": "orchestrator",
"request_id": "req-abc123",
"organization_id": "org-xyz",
"user_id": "user-123",
"message": "Request processed successfully",
"latency_ms": 245,
"model": "claude-3-sonnet",
"tokens_used": 1250,
"policy_evaluations": 3,
"blocked": false
}

Log Insights Queries

Error Analysis

fields @timestamp, @message, level, service, error_message
| filter level = "error"
| sort @timestamp desc
| limit 100

Latency Distribution

stats avg(latency_ms) as avg_latency,
percentile(latency_ms, 50) as p50,
percentile(latency_ms, 95) as p95,
percentile(latency_ms, 99) as p99
by bin(5m)
| sort @timestamp desc

Policy Blocks by Type

fields @timestamp, policy_id, blocked_reason
| filter blocked = true
| stats count(*) as block_count by policy_id
| sort block_count desc

Token Usage by Organization

fields @timestamp, organization_id, tokens_used
| stats sum(tokens_used) as total_tokens by organization_id
| sort total_tokens desc

Request Volume

stats count(*) as request_count by bin(1h)
| sort @timestamp desc

Log Retention

Configure log retention to manage costs:

  LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: /axonflow/production
RetentionInDays: 30 # Adjust based on requirements

CloudWatch Dashboards

Creating a Dashboard

  AxonFlowDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: AxonFlow-Production
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0,
"width": 12, "height": 6,
"properties": {
"title": "Request Volume",
"region": "${AWS::Region}",
"metrics": [
["AxonFlow", "RequestCount", "Environment", "production"],
[".", "SuccessCount", ".", "."],
[".", "ErrorCount", ".", "."]
],
"period": 300,
"stat": "Sum"
}
},
{
"type": "metric",
"x": 12, "y": 0,
"width": 12, "height": 6,
"properties": {
"title": "Latency",
"region": "${AWS::Region}",
"metrics": [
["AxonFlow", "Latency", "Environment", "production", {"stat": "Average"}],
["...", {"stat": "p95"}],
["...", {"stat": "p99"}]
],
"period": 300
}
}
]
}

Dashboard Widgets

Request Overview

{
"type": "metric",
"properties": {
"title": "Request Overview",
"metrics": [
["AxonFlow", "RequestCount", "Environment", "production", {"label": "Total"}],
[".", "SuccessCount", ".", ".", {"label": "Success"}],
[".", "ErrorCount", ".", ".", {"label": "Errors"}],
[".", "BlockedCount", ".", ".", {"label": "Blocked"}]
],
"view": "timeSeries",
"period": 300
}
}

Token Usage

{
"type": "metric",
"properties": {
"title": "Token Usage",
"metrics": [
["AxonFlow", "TokensUsed", "Environment", "production"]
],
"view": "timeSeries",
"period": 3600,
"stat": "Sum"
}
}

Error Rate

{
"type": "metric",
"properties": {
"title": "Error Rate (%)",
"metrics": [
[{
"expression": "(m2/m1)*100",
"label": "Error Rate",
"id": "e1"
}],
["AxonFlow", "RequestCount", "Environment", "production", {"id": "m1", "visible": false}],
[".", "ErrorCount", ".", ".", {"id": "m2", "visible": false}]
],
"view": "timeSeries",
"period": 300
}
}

Terraform Configuration

Complete CloudWatch Setup

# cloudwatch.tf

# Log Group
resource "aws_cloudwatch_log_group" "axonflow" {
name = "/axonflow/${var.environment}"
retention_in_days = var.log_retention_days

tags = {
Environment = var.environment
Application = "AxonFlow"
}
}

# Metric Alarm - High Error Rate
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
alarm_name = "axonflow-${var.environment}-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "ErrorCount"
namespace = "AxonFlow"
period = 300
statistic = "Sum"
threshold = 10
alarm_description = "High error rate detected"

dimensions = {
Environment = var.environment
}

alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}

# Metric Alarm - High Latency
resource "aws_cloudwatch_metric_alarm" "high_latency" {
alarm_name = "axonflow-${var.environment}-high-latency"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "Latency"
namespace = "AxonFlow"
period = 300
extended_statistic = "p95"
threshold = 5000
alarm_description = "P95 latency exceeds 5 seconds"

dimensions = {
Environment = var.environment
}

alarm_actions = [aws_sns_topic.alerts.arn]
}

# SNS Topic for Alerts
resource "aws_sns_topic" "alerts" {
name = "axonflow-${var.environment}-alerts"
}

resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = var.alert_email
}

# Dashboard
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "AxonFlow-${var.environment}"
dashboard_body = templatefile("${path.module}/dashboard.json.tpl", {
environment = var.environment
region = var.aws_region
})
}

Best Practices

Metrics

  1. Use meaningful dimensions - Add context without high cardinality
  2. Set appropriate periods - Balance granularity and cost
  3. Create composite metrics - Calculate error rates, success rates
  4. Monitor trends - Use anomaly detection for baseline deviations

Logging

  1. Structured logging - Use JSON format consistently
  2. Include correlation IDs - Track requests across services
  3. Log levels appropriately - DEBUG in dev, INFO/WARN in prod
  4. Manage retention - Balance debugging needs and costs

Alarms

  1. Avoid alarm fatigue - Set meaningful thresholds
  2. Use multiple periods - Reduce false positives
  3. Configure actions - Automated responses where possible
  4. Document runbooks - Link to resolution steps

Cost Optimization

  1. Filter metrics - Only publish what you need
  2. Aggregate logs - Use Log Insights instead of streaming
  3. Set retention policies - Don't store logs indefinitely
  4. Use metric math - Calculate derived metrics in CloudWatch

Troubleshooting

Metrics Not Appearing

  1. Verify IAM permissions
  2. Check namespace spelling
  3. Confirm region configuration
  4. Wait for metric aggregation (up to 5 minutes)

Logs Not Streaming

  1. Check log group exists
  2. Verify IAM permissions
  3. Confirm log agent is running
  4. Check log format configuration

Alarms Not Triggering

  1. Verify metric data exists
  2. Check dimension spelling
  3. Review threshold values
  4. Confirm alarm state