CloudWatch Integration
This guide explains how to integrate AxonFlow with Amazon CloudWatch for comprehensive monitoring, alerting, and log management.
Overview
AxonFlow integrates with CloudWatch to provide:
- Metrics - Request counts, latencies, error rates
- Logs - Structured application logs
- Alarms - Automated alerting
- Dashboards - Visual monitoring
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ AxonFlow CloudWatch Integration │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Agent │───▶│ CloudWatch │───▶│ Alarms & │ │
│ │ │ │ Agent │ │ Dashboard │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │Orchestrator │───▶│ CloudWatch │ │
│ │ │ │ Logs │ │
│ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Configuration
Environment Variables
# Enable CloudWatch integration
AWS_REGION="us-east-1"
AXONFLOW_METRICS_ENABLED="true"
AXONFLOW_METRICS_PROVIDER="cloudwatch"
AXONFLOW_METRICS_NAMESPACE="AxonFlow"
# Log configuration
AXONFLOW_LOG_LEVEL="info"
AXONFLOW_LOG_FORMAT="json"
AXONFLOW_CLOUDWATCH_LOG_GROUP="/axonflow/production"
IAM Permissions
Required IAM permissions for the AxonFlow task role:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"cloudwatch:PutMetricData",
"cloudwatch:GetMetricData",
"cloudwatch:ListMetrics"
],
"Resource": "*",
"Condition": {
"StringEquals": {
"cloudwatch:namespace": "AxonFlow"
}
}
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:DescribeLogGroups",
"logs:DescribeLogStreams"
],
"Resource": [
"arn:aws:logs:*:*:log-group:/axonflow/*"
]
}
]
}
Metrics
Available Metrics
| Metric | Unit | Description |
|---|---|---|
RequestCount | Count | Total requests processed |
SuccessCount | Count | Successful requests |
ErrorCount | Count | Failed requests |
BlockedCount | Count | Requests blocked by policy |
Latency | Milliseconds | Request processing time |
TokensUsed | Count | LLM tokens consumed |
PolicyEvaluations | Count | Policy checks performed |
ConnectorCalls | Count | MCP connector invocations |
Metric Dimensions
Metrics are published with the following dimensions:
| Dimension | Description |
|---|---|
Environment | staging, production |
OrganizationId | Customer organization |
Service | agent, orchestrator |
Model | LLM model used |
PolicyId | Policy that was evaluated |
Example Metric Data
{
"Namespace": "AxonFlow",
"MetricName": "Latency",
"Dimensions": [
{"Name": "Environment", "Value": "production"},
{"Name": "Service", "Value": "orchestrator"},
{"Name": "Model", "Value": "claude-3-sonnet"}
],
"Value": 245.5,
"Unit": "Milliseconds",
"Timestamp": "2025-12-08T10:30:00Z"
}
CloudWatch Alarms
Recommended Alarms
High Error Rate
AWSTemplateFormatVersion: '2010-09-09'
Resources:
HighErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-high-error-rate
AlarmDescription: Error rate exceeds 5%
Namespace: AxonFlow
MetricName: ErrorCount
Dimensions:
- Name: Environment
Value: production
Statistic: Sum
Period: 300
EvaluationPeriods: 2
Threshold: 5
ComparisonOperator: GreaterThanThreshold
AlarmActions:
- !Ref AlertSNSTopic
High Latency
HighLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-high-latency
AlarmDescription: P95 latency exceeds 5 seconds
Namespace: AxonFlow
MetricName: Latency
Dimensions:
- Name: Environment
Value: production
ExtendedStatistic: p95
Period: 300
EvaluationPeriods: 3
Threshold: 5000
ComparisonOperator: GreaterThanThreshold
Policy Blocks Spike
PolicyBlocksSpikeAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-policy-blocks-spike
AlarmDescription: Unusual number of blocked requests
Namespace: AxonFlow
MetricName: BlockedCount
Dimensions:
- Name: Environment
Value: production
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 100
ComparisonOperator: GreaterThanThreshold
Token Budget Alert
TokenBudgetAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: axonflow-token-budget-alert
AlarmDescription: Daily token usage approaching limit
Namespace: AxonFlow
MetricName: TokensUsed
Dimensions:
- Name: Environment
Value: production
Statistic: Sum
Period: 86400 # 24 hours
EvaluationPeriods: 1
Threshold: 900000 # 90% of 1M limit
ComparisonOperator: GreaterThanThreshold
CloudWatch Logs
Log Groups
/axonflow/
├── production/
│ ├── agent/
│ ├── orchestrator/
│ └── customer-portal/
└── staging/
├── agent/
├── orchestrator/
└── customer-portal/
Log Format
Logs are emitted in JSON format:
{
"timestamp": "2025-12-08T10:30:00.123Z",
"level": "info",
"service": "orchestrator",
"request_id": "req-abc123",
"organization_id": "org-xyz",
"user_id": "user-123",
"message": "Request processed successfully",
"latency_ms": 245,
"model": "claude-3-sonnet",
"tokens_used": 1250,
"policy_evaluations": 3,
"blocked": false
}
Log Insights Queries
Error Analysis
fields @timestamp, @message, level, service, error_message
| filter level = "error"
| sort @timestamp desc
| limit 100
Latency Distribution
stats avg(latency_ms) as avg_latency,
percentile(latency_ms, 50) as p50,
percentile(latency_ms, 95) as p95,
percentile(latency_ms, 99) as p99
by bin(5m)
| sort @timestamp desc
Policy Blocks by Type
fields @timestamp, policy_id, blocked_reason
| filter blocked = true
| stats count(*) as block_count by policy_id
| sort block_count desc
Token Usage by Organization
fields @timestamp, organization_id, tokens_used
| stats sum(tokens_used) as total_tokens by organization_id
| sort total_tokens desc
Request Volume
stats count(*) as request_count by bin(1h)
| sort @timestamp desc
Log Retention
Configure log retention to manage costs:
LogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: /axonflow/production
RetentionInDays: 30 # Adjust based on requirements
CloudWatch Dashboards
Creating a Dashboard
AxonFlowDashboard:
Type: AWS::CloudWatch::Dashboard
Properties:
DashboardName: AxonFlow-Production
DashboardBody: !Sub |
{
"widgets": [
{
"type": "metric",
"x": 0, "y": 0,
"width": 12, "height": 6,
"properties": {
"title": "Request Volume",
"region": "${AWS::Region}",
"metrics": [
["AxonFlow", "RequestCount", "Environment", "production"],
[".", "SuccessCount", ".", "."],
[".", "ErrorCount", ".", "."]
],
"period": 300,
"stat": "Sum"
}
},
{
"type": "metric",
"x": 12, "y": 0,
"width": 12, "height": 6,
"properties": {
"title": "Latency",
"region": "${AWS::Region}",
"metrics": [
["AxonFlow", "Latency", "Environment", "production", {"stat": "Average"}],
["...", {"stat": "p95"}],
["...", {"stat": "p99"}]
],
"period": 300
}
}
]
}
Dashboard Widgets
Request Overview
{
"type": "metric",
"properties": {
"title": "Request Overview",
"metrics": [
["AxonFlow", "RequestCount", "Environment", "production", {"label": "Total"}],
[".", "SuccessCount", ".", ".", {"label": "Success"}],
[".", "ErrorCount", ".", ".", {"label": "Errors"}],
[".", "BlockedCount", ".", ".", {"label": "Blocked"}]
],
"view": "timeSeries",
"period": 300
}
}
Token Usage
{
"type": "metric",
"properties": {
"title": "Token Usage",
"metrics": [
["AxonFlow", "TokensUsed", "Environment", "production"]
],
"view": "timeSeries",
"period": 3600,
"stat": "Sum"
}
}
Error Rate
{
"type": "metric",
"properties": {
"title": "Error Rate (%)",
"metrics": [
[{
"expression": "(m2/m1)*100",
"label": "Error Rate",
"id": "e1"
}],
["AxonFlow", "RequestCount", "Environment", "production", {"id": "m1", "visible": false}],
[".", "ErrorCount", ".", ".", {"id": "m2", "visible": false}]
],
"view": "timeSeries",
"period": 300
}
}
Terraform Configuration
Complete CloudWatch Setup
# cloudwatch.tf
# Log Group
resource "aws_cloudwatch_log_group" "axonflow" {
name = "/axonflow/${var.environment}"
retention_in_days = var.log_retention_days
tags = {
Environment = var.environment
Application = "AxonFlow"
}
}
# Metric Alarm - High Error Rate
resource "aws_cloudwatch_metric_alarm" "high_error_rate" {
alarm_name = "axonflow-${var.environment}-high-error-rate"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "ErrorCount"
namespace = "AxonFlow"
period = 300
statistic = "Sum"
threshold = 10
alarm_description = "High error rate detected"
dimensions = {
Environment = var.environment
}
alarm_actions = [aws_sns_topic.alerts.arn]
ok_actions = [aws_sns_topic.alerts.arn]
}
# Metric Alarm - High Latency
resource "aws_cloudwatch_metric_alarm" "high_latency" {
alarm_name = "axonflow-${var.environment}-high-latency"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 3
metric_name = "Latency"
namespace = "AxonFlow"
period = 300
extended_statistic = "p95"
threshold = 5000
alarm_description = "P95 latency exceeds 5 seconds"
dimensions = {
Environment = var.environment
}
alarm_actions = [aws_sns_topic.alerts.arn]
}
# SNS Topic for Alerts
resource "aws_sns_topic" "alerts" {
name = "axonflow-${var.environment}-alerts"
}
resource "aws_sns_topic_subscription" "email" {
topic_arn = aws_sns_topic.alerts.arn
protocol = "email"
endpoint = var.alert_email
}
# Dashboard
resource "aws_cloudwatch_dashboard" "main" {
dashboard_name = "AxonFlow-${var.environment}"
dashboard_body = templatefile("${path.module}/dashboard.json.tpl", {
environment = var.environment
region = var.aws_region
})
}
Best Practices
Metrics
- Use meaningful dimensions - Add context without high cardinality
- Set appropriate periods - Balance granularity and cost
- Create composite metrics - Calculate error rates, success rates
- Monitor trends - Use anomaly detection for baseline deviations
Logging
- Structured logging - Use JSON format consistently
- Include correlation IDs - Track requests across services
- Log levels appropriately - DEBUG in dev, INFO/WARN in prod
- Manage retention - Balance debugging needs and costs
Alarms
- Avoid alarm fatigue - Set meaningful thresholds
- Use multiple periods - Reduce false positives
- Configure actions - Automated responses where possible
- Document runbooks - Link to resolution steps
Cost Optimization
- Filter metrics - Only publish what you need
- Aggregate logs - Use Log Insights instead of streaming
- Set retention policies - Don't store logs indefinitely
- Use metric math - Calculate derived metrics in CloudWatch
Troubleshooting
Metrics Not Appearing
- Verify IAM permissions
- Check namespace spelling
- Confirm region configuration
- Wait for metric aggregation (up to 5 minutes)
Logs Not Streaming
- Check log group exists
- Verify IAM permissions
- Confirm log agent is running
- Check log format configuration
Alarms Not Triggering
- Verify metric data exists
- Check dimension spelling
- Review threshold values
- Confirm alarm state