Performance Testing Architecture

AxonFlow's performance testing infrastructure uses a three-layer design that balances continuous monitoring with cost efficiency. This guide explains the architecture and how each layer contributes to system reliability.

Methodology

Performance testing follows a four-phase cycle:

1. Baseline

Establish current performance numbers under known conditions. Run a standard workload and record P50/P95/P99 latency, throughput, and error rate. This is your reference point for comparison.

2. Benchmark

Run the same workload after a code change, infrastructure change, or scaling event. Compare against the baseline to detect regressions or improvements.

3. Profile

When a regression is detected, use profiling tools to identify the root cause. See the Go Profiling section below.

4. Optimize

Make targeted changes based on profiling data. Re-run benchmarks to confirm the improvement and update the baseline.

Baseline → Benchmark → Profile → Optimize → (new Baseline)

Go Profiling

AxonFlow exposes Go pprof endpoints for CPU and memory profiling on both the Agent (port 8080) and Orchestrator (port 8081).

Enabling pprof Endpoints

Set the environment variable AXONFLOW_PPROF_ENABLED=true to expose profiling endpoints. These are disabled by default in production.

CPU Profiling

# Capture a 30-second CPU profile from the Agent
go tool pprof http://localhost:8080/debug/pprof/profile?seconds=30

# From the Orchestrator
go tool pprof http://localhost:8081/debug/pprof/profile?seconds=30

# Interactive commands inside pprof:
#   top20       — show top 20 functions by CPU time
#   web         — open flame graph in browser (requires graphviz)
#   list <func> — show annotated source for a function

Memory Profiling

# Capture a heap profile
go tool pprof http://localhost:8080/debug/pprof/heap

# Check for goroutine leaks
go tool pprof http://localhost:8080/debug/pprof/goroutine

Trace Analysis

# Capture a 5-second execution trace
curl -o trace.out http://localhost:8080/debug/pprof/trace?seconds=5
go tool trace trace.out

Comparing Profiles

Use benchstat to compare benchmark results before and after a change:

# Before change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > before.txt

# After change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > after.txt

# Compare
benchstat before.txt after.txt

Key Metrics

Metric	Target	How to Measure
P50 latency	< 3ms	Prometheus `perf_test_latency_p50_ms`
P95 latency	< 10ms	Prometheus `perf_test_latency_p95_ms`
P99 latency	< 25ms	Prometheus `perf_test_latency_p99_ms`
CPU per request	< 0.5ms	pprof CPU profile
Allocs per request	< 50	`go test -benchmem`
Goroutine count	< 1000 idle	pprof goroutine profile
GC pause	< 1ms	`GODEBUG=gctrace=1` or `runtime/metrics`

Optimization Checklist

Before and after any performance optimization, verify the following:

Baseline benchmark recorded with go test -bench -count=10
CPU profile captured and hotspot identified
Memory profile checked for excessive allocations
Goroutine profile checked for leaks
Change made and new benchmark recorded
benchstat comparison shows statistically significant improvement
No regression in other benchmarks
Unit tests still pass
Coverage thresholds still met (orchestrator 76%, agent 76%, connectors 76%)

Three-Layer Design

┌─────────────────────────────────────────────────────────────┐
│                  LAYER 1: HEARTBEAT                         │
│  ───────────────────────────────────────────────────────── │
│  • Continuous health monitoring                             │
│  • Keeps dashboards active                                  │
│  • Minimal resource usage                                   │
│  • Alternates between test modes                            │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  LAYER 2: SCHEDULED                         │
│  ───────────────────────────────────────────────────────── │
│  • Hourly: Baseline verification                            │
│  • Daily: Extended stress testing                           │
│  • Weekly: Comprehensive benchmark suite                    │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                  LAYER 3: MANUAL                            │
│  ───────────────────────────────────────────────────────── │
│  • Ad-hoc testing for specific scenarios                    │
│  • Custom RPS and duration                                  │
│  • Pre-deployment validation                                │
│  • Incident investigation                                   │
└─────────────────────────────────────────────────────────────┘

Layer 1: Heartbeat

The heartbeat layer provides continuous, lightweight monitoring.

Purpose

Keep observability dashboards populated with data
Detect system availability issues quickly
Minimal cost (single request every few seconds)
Confirm basic connectivity and authentication

Behavior

Mode A: Policy-only requests (no LLM costs)
        ↓
[Wait interval]
        ↓
Mode B: Health check requests
        ↓
[Wait interval]
        ↓
(repeat)

Characteristics

Aspect	Value
Frequency	Every few seconds
Request Volume	Minimal
Cost Impact	Negligible
Data Retention	Real-time only

Layer 2: Scheduled Tests

Scheduled tests run at fixed intervals to establish performance baselines.

Hourly Tests

Purpose: Verify system maintains baseline performance throughout the day.

Characteristics:

Short duration
Moderate request volume
Captures hourly trends
Alerts on regression

Daily Tests

Purpose: Extended stress testing during low-traffic windows.

Characteristics:

Longer duration
Higher request volume
Tests sustained load capability
Generates daily benchmark data

Weekly Tests

Purpose: Comprehensive benchmark suite for trend analysis.

Characteristics:

Full test coverage
All test categories
Historical comparison
Capacity planning data

Scheduling Strategy

Hour   00 01 02 03 04 05 06 07 08 09 10 11 12 ...
       │  │  │  │  │  │  │  │  │  │  │  │  │
Hourly ●  ●  ●  ●  ●  ●  ●  ●  ●  ●  ●  ●  ● ...
Daily        ●                                    (low traffic window)
Weekly       ●                                    (once per week)

Layer 3: Manual Tests

Manual tests are triggered on-demand for specific purposes.

Use Cases

Pre-Deployment Validation
- Run before production deployments
- Verify new code doesn't regress performance
- Gate for release approval
Incident Investigation
- Reproduce reported performance issues
- Validate fixes before deployment
- Compare before/after behavior
Capacity Planning
- Test higher-than-normal loads
- Plan for growth scenarios
- Validate scaling configurations
Custom Scenarios
- Test specific request patterns
- Validate new features under load
- Security testing validation

Execution

# Pre-deployment validation
./perf-test --mode pre-deploy --environment staging

# Capacity test
./perf-test --mode stress --rps 500 --duration 10m

# Custom scenario
./perf-test --mode manual --category policy --rps 100

Cost Optimization

Performance testing can generate significant costs if not managed carefully.

Policy-Only Mode

Most tests run in "policy-only" mode:

Standard Request:
  Client → Agent → Orchestrator → LLM Provider → Response
  Cost: $0.001-0.01 per request (LLM tokens)

Policy-Only Request:
  Client → Agent → Policy Engine → Response
  Cost: $0 (no LLM calls)

When to use policy-only:

Heartbeat monitoring
Policy performance validation
Most scheduled tests

When to use full requests:

LLM integration testing
End-to-end validation
Weekly benchmarks (limited scope)

Environment Strategy

Development:
  └── Local Docker (free)

Staging:
  └── All scheduled tests run here
  └── Pre-deployment validation
  └── Cost: Included in staging infrastructure

Production:
  └── Heartbeat only (minimal)
  └── Post-deployment smoke tests
  └── No scheduled load tests

Metrics and Observability

Prometheus Metrics

Performance tests export comprehensive metrics:

# Latency percentiles
perf_test_latency_p50_ms
perf_test_latency_p95_ms
perf_test_latency_p99_ms

# Request rates
perf_test_requests_total{status="success"}
perf_test_requests_total{status="blocked"}
perf_test_requests_total{status="error"}

# Test metadata
perf_test_info{layer="scheduled", interval="hourly"}

Grafana Dashboards

Standard dashboards include:

Real-Time Performance
- Live latency percentiles
- Request rate graphs
- Error rate monitoring
Historical Trends
- Daily/weekly comparisons
- Regression detection
- Capacity utilization
Test Execution
- Scheduled test status
- Manual test results
- Alert history

Alerting

Automated Alerts

Condition	Severity	Action
P99 > threshold	Warning	Investigate
Error rate > 0.1%	Critical	Page on-call
Heartbeat failure	Critical	Auto-escalate
Daily test failure	Warning	Review before deploy

Alert Routing

Layer 1 (Heartbeat) failures → Immediate page
Layer 2 (Scheduled) failures → Slack notification
Layer 3 (Manual) failures → Test runner notified

Best Practices

1. Start with Staging

Never run load tests against production without explicit approval:

✅ Staging: Full test suite
✅ Production: Heartbeat + smoke tests only
❌ Production: Scheduled load tests

2. Review Before Scaling

Before increasing test intensity:

Verify system can handle current load
Check infrastructure costs
Confirm no external rate limits

3. Maintain Historical Data

Keep test results for trend analysis:

Minimum 90 days of daily data
Minimum 1 year of weekly data
Archive before deleting

4. Document Anomalies

When tests show unexpected results:

Capture full metrics
Note environmental factors
Document investigation findings
Update runbooks if needed

Next Steps

Load Testing Methodology - Detailed load testing guide
Testing Overview - Complete testing pyramid

Methodology​

1. Baseline​

2. Benchmark​

3. Profile​

4. Optimize​

Go Profiling​

Enabling pprof Endpoints​

CPU Profiling​

Memory Profiling​

Trace Analysis​

Comparing Profiles​

Key Metrics​

Optimization Checklist​

Three-Layer Design​

Layer 1: Heartbeat​

Purpose​

Behavior​

Characteristics​

Layer 2: Scheduled Tests​

Hourly Tests​

Daily Tests​

Weekly Tests​

Scheduling Strategy​

Layer 3: Manual Tests​

Use Cases​

Execution​

Cost Optimization​

Policy-Only Mode​

Environment Strategy​

Metrics and Observability​

Prometheus Metrics​

Grafana Dashboards​

Alerting​

Automated Alerts​

Alert Routing​

Best Practices​

1. Start with Staging​

2. Review Before Scaling​

3. Maintain Historical Data​

4. Document Anomalies​

Next Steps​

Methodology

1. Baseline

2. Benchmark

3. Profile

4. Optimize

Go Profiling

Enabling pprof Endpoints

CPU Profiling

Memory Profiling

Trace Analysis

Comparing Profiles

Key Metrics

Optimization Checklist

Three-Layer Design

Layer 1: Heartbeat

Purpose

Behavior

Characteristics

Layer 2: Scheduled Tests

Hourly Tests

Daily Tests

Weekly Tests

Scheduling Strategy

Layer 3: Manual Tests

Use Cases

Execution

Cost Optimization

Policy-Only Mode

Environment Strategy

Metrics and Observability

Prometheus Metrics

Grafana Dashboards

Alerting

Automated Alerts

Alert Routing

Best Practices

1. Start with Staging

2. Review Before Scaling

3. Maintain Historical Data

4. Document Anomalies

Next Steps