Skip to main content

Performance Testing Architecture

AxonFlow's performance testing infrastructure uses a three-layer design that balances continuous monitoring with cost efficiency. This guide explains the architecture and how each layer contributes to system reliability.

Methodology

Performance testing follows a four-phase cycle:

1. Baseline

Establish current performance numbers under known conditions. Run a standard workload and record P50/P95/P99 latency, throughput, and error rate. This is your reference point for comparison.

2. Benchmark

Run the same workload after a code change, infrastructure change, or scaling event. Compare against the baseline to detect regressions or improvements.

3. Profile

When a regression is detected, use focused benchmarks plus runtime metrics to identify the root cause.

4. Optimize

Make targeted changes based on profiling data. Re-run benchmarks to confirm the improvement and update the baseline.

Baseline → Benchmark → Profile → Optimize → (new Baseline)

Benchmarking And Runtime Analysis

AxonFlow’s current public/community workflow is centered on Go benchmarks plus Prometheus and Grafana, not a documented pprof endpoint story. For most regressions, the fastest path is:

  1. reproduce with a focused benchmark
  2. compare before/after with benchstat
  3. inspect runtime metrics in Prometheus/Grafana while the workload runs

Benchmark Comparison

Comparing Profiles

Use benchstat to compare benchmark results before and after a change:

# Before change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > before.txt

# After change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > after.txt

# Compare
benchstat before.txt after.txt

What To Watch While Benchmarking

  • request latency percentiles in Prometheus and Grafana
  • policy engine time versus end-to-end request time
  • connector error rates and saturation
  • database latency if the benchmark exercises policy storage or execution persistence

Key Metrics

MetricTargetHow to Measure
P50 latency< 3msPrometheus / Grafana latency panels
P95 latency< 10msPrometheus / Grafana latency panels
P99 latency< 25msPrometheus / Grafana latency panels
Allocs per requestTrack trendgo test -benchmem
GC pressureTrack trendbenchmark output plus runtime metrics
Connector saturation0 sustained backlogPrometheus plus dashboard panels

Optimization Checklist

Before and after any performance optimization, verify the following:

  • Baseline benchmark recorded with go test -bench -count=10
  • request and connector metrics captured during the test window
  • allocation trend checked with -benchmem
  • Change made and new benchmark recorded
  • benchstat comparison shows statistically significant improvement
  • No regression in other benchmarks
  • Unit tests still pass
  • Coverage thresholds still met (orchestrator 76%, agent 76%, connectors 76%)

Three-Layer Design

┌─────────────────────────────────────────────────────────────┐
│ LAYER 1: HEARTBEAT │
│ ───────────────────────────────────────────────────────── │
│ • Continuous health monitoring │
│ • Keeps dashboards active │
│ • Minimal resource usage │
│ • Alternates between test modes │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ LAYER 2: SCHEDULED │
│ ───────────────────────────────────────────────────────── │
│ • Hourly: Baseline verification │
│ • Daily: Extended stress testing │
│ • Weekly: Comprehensive benchmark suite │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ LAYER 3: MANUAL │
│ ───────────────────────────────────────────────────────── │
│ • Ad-hoc testing for specific scenarios │
│ • Custom RPS and duration │
│ • Pre-deployment validation │
│ • Incident investigation │
└─────────────────────────────────────────────────────────────┘

Layer 1: Heartbeat

The heartbeat layer provides continuous, lightweight monitoring.

Purpose

  • Keep observability dashboards populated with data
  • Detect system availability issues quickly
  • Minimal cost (single request every few seconds)
  • Confirm basic connectivity and authentication

Behavior

Mode A: Policy-only requests (no LLM costs)

[Wait interval]

Mode B: Health check requests

[Wait interval]

(repeat)

Characteristics

AspectValue
FrequencyEvery few seconds
Request VolumeMinimal
Cost ImpactNegligible
Data RetentionReal-time only

Layer 2: Scheduled Tests

Scheduled tests run at fixed intervals to establish performance baselines.

Hourly Tests

Purpose: Verify system maintains baseline performance throughout the day.

Characteristics:

  • Short duration
  • Moderate request volume
  • Captures hourly trends
  • Alerts on regression

Daily Tests

Purpose: Extended stress testing during low-traffic windows.

Characteristics:

  • Longer duration
  • Higher request volume
  • Tests sustained load capability
  • Generates daily benchmark data

Weekly Tests

Purpose: Comprehensive benchmark suite for trend analysis.

Characteristics:

  • Full test coverage
  • All test categories
  • Historical comparison
  • Capacity planning data

Scheduling Strategy

Hour   00 01 02 03 04 05 06 07 08 09 10 11 12 ...
│ │ │ │ │ │ │ │ │ │ │ │ │
Hourly ● ● ● ● ● ● ● ● ● ● ● ● ● ...
Daily ● (low traffic window)
Weekly ● (once per week)

Layer 3: Manual Tests

Manual tests are triggered on-demand for specific purposes.

Use Cases

  1. Pre-Deployment Validation

    • Run before production deployments
    • Verify new code doesn't regress performance
    • Gate for release approval
  2. Incident Investigation

    • Reproduce reported performance issues
    • Validate fixes before deployment
    • Compare before/after behavior
  3. Capacity Planning

    • Test higher-than-normal loads
    • Plan for growth scenarios
    • Validate scaling configurations
  4. Custom Scenarios

    • Test specific request patterns
    • Validate new features under load
    • Security testing validation

Execution

# Focused benchmark in the Agent
go test -bench=BenchmarkPolicyEvaluation -benchmem ./platform/agent/...

# Focused benchmark in the Orchestrator
go test -bench=. -benchmem ./platform/orchestrator/...

# Compare two benchmark snapshots
benchstat before.txt after.txt

Cost Optimization

Performance testing can generate significant costs if not managed carefully.

Policy-Only Mode

Most tests run in "policy-only" mode:

Standard Request:
Client → Agent → Orchestrator → LLM Provider → Response
Cost: $0.001-0.01 per request (LLM tokens)

Policy-Only Request:
Client → Agent → Policy Engine → Response
Cost: $0 (no LLM calls)

When to use policy-only:

  • Heartbeat monitoring
  • Policy performance validation
  • Most scheduled tests

When to use full requests:

  • LLM integration testing
  • End-to-end validation
  • Weekly benchmarks (limited scope)

Environment Strategy

Development:
└── Local Docker (free)

Staging:
└── All scheduled tests run here
└── Pre-deployment validation
└── Cost: Included in staging infrastructure

Production:
└── Heartbeat only (minimal)
└── Post-deployment smoke tests
└── No scheduled load tests

Metrics and Observability

Prometheus Metrics

For local and staging performance work, rely on the Prometheus endpoints scraped by the bundled Prometheus and Grafana stack. The exact metric inventory evolves with the platform, so treat the dashboards and /metrics output as the source of truth rather than memorizing a frozen list in prose.

Grafana Dashboards

Standard dashboards include:

  1. Real-Time Performance

    • Live latency percentiles
    • Request rate graphs
    • Error rate monitoring
  2. Historical Trends

    • Daily/weekly comparisons
    • Regression detection
    • Capacity utilization
  3. Test Execution

    • Scheduled test status
    • Manual test results
    • Alert history

Alerting

Automated Alerts

Use performance alerts to catch regressions in the environment where you actually run governed traffic:

ConditionSeverityAction
sustained latency regressionWarninginvestigate before release
elevated error rateCriticalpage the owning team
blocked-rate anomalyWarningcheck policy rollout or connector behavior
heartbeat failureCriticaltreat as service health issue

Alert Routing

Layer 1 (Heartbeat) failures → Immediate page
Layer 2 (Scheduled) failures → Slack notification
Layer 3 (Manual) failures → Test runner notified

Best Practices

1. Start with Staging

Never run load tests against production without explicit approval:

✅ Staging: Full test suite
✅ Production: Heartbeat + smoke tests only
❌ Production: Scheduled load tests

2. Review Before Scaling

Before increasing test intensity:

  • Verify system can handle current load
  • Check infrastructure costs
  • Confirm no external rate limits

3. Maintain Historical Data

Keep test results for trend analysis:

  • Minimum 90 days of daily data
  • Minimum 1 year of weekly data
  • Archive before deleting

4. Document Anomalies

When tests show unexpected results:

  1. Capture full metrics
  2. Note environmental factors
  3. Document investigation findings
  4. Update runbooks if needed

Next Steps