Skip to main content

Performance Testing Architecture

AxonFlow's performance testing infrastructure uses a three-layer design that balances continuous monitoring with cost efficiency. This guide explains the architecture and how each layer contributes to system reliability.

Three-Layer Design

┌─────────────────────────────────────────────────────────────┐
│ LAYER 1: HEARTBEAT │
│ ───────────────────────────────────────────────────────── │
│ • Continuous health monitoring │
│ • Keeps dashboards active │
│ • Minimal resource usage │
│ • Alternates between test modes │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ LAYER 2: SCHEDULED │
│ ───────────────────────────────────────────────────────── │
│ • Hourly: Baseline verification │
│ • Daily: Extended stress testing │
│ • Weekly: Comprehensive benchmark suite │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ LAYER 3: MANUAL │
│ ───────────────────────────────────────────────────────── │
│ • Ad-hoc testing for specific scenarios │
│ • Custom RPS and duration │
│ • Pre-deployment validation │
│ • Incident investigation │
└─────────────────────────────────────────────────────────────┘

Layer 1: Heartbeat

The heartbeat layer provides continuous, lightweight monitoring.

Purpose

  • Keep observability dashboards populated with data
  • Detect system availability issues quickly
  • Minimal cost (single request every few seconds)
  • Confirm basic connectivity and authentication

Behavior

Mode A: Policy-only requests (no LLM costs)

[Wait interval]

Mode B: Health check requests

[Wait interval]

(repeat)

Characteristics

AspectValue
FrequencyEvery few seconds
Request VolumeMinimal
Cost ImpactNegligible
Data RetentionReal-time only

Layer 2: Scheduled Tests

Scheduled tests run at fixed intervals to establish performance baselines.

Hourly Tests

Purpose: Verify system maintains baseline performance throughout the day.

Characteristics:

  • Short duration
  • Moderate request volume
  • Captures hourly trends
  • Alerts on regression

Daily Tests

Purpose: Extended stress testing during low-traffic windows.

Characteristics:

  • Longer duration
  • Higher request volume
  • Tests sustained load capability
  • Generates daily benchmark data

Weekly Tests

Purpose: Comprehensive benchmark suite for trend analysis.

Characteristics:

  • Full test coverage
  • All test categories
  • Historical comparison
  • Capacity planning data

Scheduling Strategy

Hour   00 01 02 03 04 05 06 07 08 09 10 11 12 ...
│ │ │ │ │ │ │ │ │ │ │ │ │
Hourly ● ● ● ● ● ● ● ● ● ● ● ● ● ...
Daily ● (low traffic window)
Weekly ● (once per week)

Layer 3: Manual Tests

Manual tests are triggered on-demand for specific purposes.

Use Cases

  1. Pre-Deployment Validation

    • Run before production deployments
    • Verify new code doesn't regress performance
    • Gate for release approval
  2. Incident Investigation

    • Reproduce reported performance issues
    • Validate fixes before deployment
    • Compare before/after behavior
  3. Capacity Planning

    • Test higher-than-normal loads
    • Plan for growth scenarios
    • Validate scaling configurations
  4. Custom Scenarios

    • Test specific request patterns
    • Validate new features under load
    • Security testing validation

Execution

# Pre-deployment validation
./perf-test --mode pre-deploy --environment staging

# Capacity test
./perf-test --mode stress --rps 500 --duration 10m

# Custom scenario
./perf-test --mode manual --category policy --rps 100

Cost Optimization

Performance testing can generate significant costs if not managed carefully.

Policy-Only Mode

Most tests run in "policy-only" mode:

Standard Request:
Client → Agent → Orchestrator → LLM Provider → Response
Cost: $0.001-0.01 per request (LLM tokens)

Policy-Only Request:
Client → Agent → Policy Engine → Response
Cost: $0 (no LLM calls)

When to use policy-only:

  • Heartbeat monitoring
  • Policy performance validation
  • Most scheduled tests

When to use full requests:

  • LLM integration testing
  • End-to-end validation
  • Weekly benchmarks (limited scope)

Environment Strategy

Development:
└── Local Docker (free)

Staging:
└── All scheduled tests run here
└── Pre-deployment validation
└── Cost: Included in staging infrastructure

Production:
└── Heartbeat only (minimal)
└── Post-deployment smoke tests
└── No scheduled load tests

Metrics and Observability

Prometheus Metrics

Performance tests export comprehensive metrics:

# Latency percentiles
perf_test_latency_p50_ms
perf_test_latency_p95_ms
perf_test_latency_p99_ms

# Request rates
perf_test_requests_total{status="success"}
perf_test_requests_total{status="blocked"}
perf_test_requests_total{status="error"}

# Test metadata
perf_test_info{layer="scheduled", interval="hourly"}

Grafana Dashboards

Standard dashboards include:

  1. Real-Time Performance

    • Live latency percentiles
    • Request rate graphs
    • Error rate monitoring
  2. Historical Trends

    • Daily/weekly comparisons
    • Regression detection
    • Capacity utilization
  3. Test Execution

    • Scheduled test status
    • Manual test results
    • Alert history

Alerting

Automated Alerts

ConditionSeverityAction
P99 > thresholdWarningInvestigate
Error rate > 0.1%CriticalPage on-call
Heartbeat failureCriticalAuto-escalate
Daily test failureWarningReview before deploy

Alert Routing

Layer 1 (Heartbeat) failures → Immediate page
Layer 2 (Scheduled) failures → Slack notification
Layer 3 (Manual) failures → Test runner notified

Best Practices

1. Start with Staging

Never run load tests against production without explicit approval:

✅ Staging: Full test suite
✅ Production: Heartbeat + smoke tests only
❌ Production: Scheduled load tests

2. Review Before Scaling

Before increasing test intensity:

  • Verify system can handle current load
  • Check infrastructure costs
  • Confirm no external rate limits

3. Maintain Historical Data

Keep test results for trend analysis:

  • Minimum 90 days of daily data
  • Minimum 1 year of weekly data
  • Archive before deleting

4. Document Anomalies

When tests show unexpected results:

  1. Capture full metrics
  2. Note environmental factors
  3. Document investigation findings
  4. Update runbooks if needed

Next Steps