Performance Testing Architecture

AxonFlow's performance testing infrastructure uses a three-layer design that balances continuous monitoring with cost efficiency. This guide explains the architecture and how each layer contributes to system reliability.

Methodology

Performance testing follows a four-phase cycle:

1. Baseline

Establish current performance numbers under known conditions. Run a standard workload and record P50/P95/P99 latency, throughput, and error rate. This is your reference point for comparison.

2. Benchmark

Run the same workload after a code change, infrastructure change, or scaling event. Compare against the baseline to detect regressions or improvements.

3. Profile

When a regression is detected, use focused benchmarks plus runtime metrics to identify the root cause.

4. Optimize

Make targeted changes based on profiling data. Re-run benchmarks to confirm the improvement and update the baseline.

Baseline → Benchmark → Profile → Optimize → (new Baseline)

Benchmarking And Runtime Analysis

AxonFlow’s current public/community workflow is centered on Go benchmarks plus Prometheus and Grafana, not a documented pprof endpoint story. For most regressions, the fastest path is:

reproduce with a focused benchmark
compare before/after with benchstat
inspect runtime metrics in Prometheus/Grafana while the workload runs

Benchmark Comparison

Comparing Profiles

Use benchstat to compare benchmark results before and after a change:

# Before change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > before.txt

# After change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > after.txt

# Compare
benchstat before.txt after.txt

What To Watch While Benchmarking

request latency percentiles in Prometheus and Grafana
policy engine time versus end-to-end request time
connector error rates and saturation
database latency if the benchmark exercises policy storage or execution persistence

Key Metrics

Metric	Target	How to Measure
P50 latency	< 3ms	Prometheus / Grafana latency panels
P95 latency	< 10ms	Prometheus / Grafana latency panels
P99 latency	< 25ms	Prometheus / Grafana latency panels
Allocs per request	Track trend	`go test -benchmem`
GC pressure	Track trend	benchmark output plus runtime metrics
Connector saturation	0 sustained backlog	Prometheus plus dashboard panels

Optimization Checklist

Before and after any performance optimization, verify the following:

Baseline benchmark recorded with go test -bench -count=10
request and connector metrics captured during the test window
allocation trend checked with -benchmem
Change made and new benchmark recorded
benchstat comparison shows statistically significant improvement
No regression in other benchmarks
Unit tests still pass
Coverage thresholds still met (orchestrator 76%, agent 76%, connectors 76%)

Three-Layer Design

┌─────────────────────────────────────────────────────────────┐
│                  LAYER 1: HEARTBEAT                         │
│  ───────────────────────────────────────────────────────── │
│  • Continuous health monitoring                             │
│  • Keeps dashboards active                                  │
│  • Minimal resource usage                                   │
│  • Alternates between test modes                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  LAYER 2: SCHEDULED                         │
│  ───────────────────────────────────────────────────────── │
│  • Hourly: Baseline verification                            │
│  • Daily: Extended stress testing                           │
│  • Weekly: Comprehensive benchmark suite                    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  LAYER 3: MANUAL                            │
│  ───────────────────────────────────────────────────────── │
│  • Ad-hoc testing for specific scenarios                    │
│  • Custom RPS and duration                                  │
│  • Pre-deployment validation                                │
│  • Incident investigation                                   │
└─────────────────────────────────────────────────────────────┘

Layer 1: Heartbeat

The heartbeat layer provides continuous, lightweight monitoring.

Purpose

Keep observability dashboards populated with data
Detect system availability issues quickly
Minimal cost (single request every few seconds)
Confirm basic connectivity and authentication

Behavior

Mode A: Policy-only requests (no LLM costs)
        ↓
[Wait interval]
        ↓
Mode B: Health check requests
        ↓
[Wait interval]
        ↓
(repeat)

Characteristics

Aspect	Value
Frequency	Every few seconds
Request Volume	Minimal
Cost Impact	Negligible
Data Retention	Real-time only

Layer 2: Scheduled Tests

Scheduled tests run at fixed intervals to establish performance baselines.

Hourly Tests

Purpose: Verify system maintains baseline performance throughout the day.

Characteristics:

Short duration
Moderate request volume
Captures hourly trends
Alerts on regression

Daily Tests

Purpose: Extended stress testing during low-traffic windows.

Characteristics:

Longer duration
Higher request volume
Tests sustained load capability
Generates daily benchmark data

Weekly Tests

Purpose: Comprehensive benchmark suite for trend analysis.

Characteristics:

Full test coverage
All test categories
Historical comparison
Capacity planning data

Scheduling Strategy

Hour   00 01 02 03 04 05 06 07 08 09 10 11 12 ...
       │  │  │  │  │  │  │  │  │  │  │  │  │
Hourly ●  ●  ●  ●  ●  ●  ●  ●  ●  ●  ●  ●  ● ...
Daily        ●                                    (low traffic window)
Weekly       ●                                    (once per week)

Layer 3: Manual Tests

Manual tests are triggered on-demand for specific purposes.

Use Cases

Pre-Deployment Validation
- Run before production deployments
- Verify new code doesn't regress performance
- Gate for release approval
Incident Investigation
- Reproduce reported performance issues
- Validate fixes before deployment
- Compare before/after behavior
Capacity Planning
- Test higher-than-normal loads
- Plan for growth scenarios
- Validate scaling configurations
Custom Scenarios
- Test specific request patterns
- Validate new features under load
- Security testing validation

Execution

# Focused benchmark in the Agent
go test -bench=BenchmarkPolicyEvaluation -benchmem ./platform/agent/...

# Focused benchmark in the Orchestrator
go test -bench=. -benchmem ./platform/orchestrator/...

# Compare two benchmark snapshots
benchstat before.txt after.txt

Cost Optimization

Performance testing can generate significant costs if not managed carefully.

Policy-Only Mode

Most tests run in "policy-only" mode:

Standard Request:
  Client → Agent → Orchestrator → LLM Provider → Response
  Cost: $0.001-0.01 per request (LLM tokens)

Policy-Only Request:
  Client → Agent → Policy Engine → Response
  Cost: $0 (no LLM calls)

When to use policy-only:

Heartbeat monitoring
Policy performance validation
Most scheduled tests

When to use full requests:

LLM integration testing
End-to-end validation
Weekly benchmarks (limited scope)

Environment Strategy

Development:
  └── Local Docker (free)

Staging:
  └── All scheduled tests run here
  └── Pre-deployment validation
  └── Cost: Included in staging infrastructure

Production:
  └── Heartbeat only (minimal)
  └── Post-deployment smoke tests
  └── No scheduled load tests

Metrics and Observability

Prometheus Metrics

For local and staging performance work, rely on the Prometheus endpoints scraped by the bundled Prometheus and Grafana stack. The exact metric inventory evolves with the platform, so treat the dashboards and /metrics output as the source of truth rather than memorizing a frozen list in prose.

Grafana Dashboards

Standard dashboards include:

Real-Time Performance
- Live latency percentiles
- Request rate graphs
- Error rate monitoring
Historical Trends
- Daily/weekly comparisons
- Regression detection
- Capacity utilization
Test Execution
- Scheduled test status
- Manual test results
- Alert history

Alerting

Automated Alerts

Use performance alerts to catch regressions in the environment where you actually run governed traffic:

Condition	Severity	Action
sustained latency regression	Warning	investigate before release
elevated error rate	Critical	page the owning team
blocked-rate anomaly	Warning	check policy rollout or connector behavior
heartbeat failure	Critical	treat as service health issue

Alert Routing

Layer 1 (Heartbeat) failures → Immediate page
Layer 2 (Scheduled) failures → Slack notification
Layer 3 (Manual) failures → Test runner notified

Best Practices

1. Start with Staging

Never run load tests against production without explicit approval:

✅ Staging: Full test suite
✅ Production: Heartbeat + smoke tests only
❌ Production: Scheduled load tests

2. Review Before Scaling

Before increasing test intensity:

Verify system can handle current load
Check infrastructure costs
Confirm no external rate limits

3. Maintain Historical Data

Keep test results for trend analysis:

Minimum 90 days of daily data
Minimum 1 year of weekly data
Archive before deleting

4. Document Anomalies

When tests show unexpected results:

Capture full metrics
Note environmental factors
Document investigation findings
Update runbooks if needed

Next Steps

Load Testing Methodology - Detailed load testing guide
Testing Overview - Complete testing pyramid

Methodology​

1. Baseline​

2. Benchmark​

3. Profile​

4. Optimize​

Benchmarking And Runtime Analysis​

Benchmark Comparison​

Comparing Profiles​

What To Watch While Benchmarking​

Key Metrics​

Optimization Checklist​

Three-Layer Design​

Layer 1: Heartbeat​

Purpose​

Behavior​

Characteristics​

Layer 2: Scheduled Tests​

Hourly Tests​

Daily Tests​

Weekly Tests​

Scheduling Strategy​

Layer 3: Manual Tests​

Use Cases​

Execution​

Cost Optimization​

Policy-Only Mode​

Environment Strategy​

Metrics and Observability​

Prometheus Metrics​

Grafana Dashboards​

Alerting​

Automated Alerts​

Alert Routing​

Best Practices​

1. Start with Staging​

2. Review Before Scaling​

3. Maintain Historical Data​

4. Document Anomalies​

Next Steps​

Methodology

1. Baseline

2. Benchmark

3. Profile

4. Optimize

Benchmarking And Runtime Analysis

Benchmark Comparison

Comparing Profiles

What To Watch While Benchmarking

Key Metrics

Optimization Checklist

Three-Layer Design

Layer 1: Heartbeat

Purpose

Behavior

Characteristics

Layer 2: Scheduled Tests

Hourly Tests

Daily Tests

Weekly Tests

Scheduling Strategy

Layer 3: Manual Tests

Use Cases

Execution

Cost Optimization

Policy-Only Mode

Environment Strategy

Metrics and Observability

Prometheus Metrics

Grafana Dashboards

Alerting

Automated Alerts

Alert Routing

Best Practices

1. Start with Staging

2. Review Before Scaling

3. Maintain Historical Data

4. Document Anomalies

Next Steps