Performance Testing Architecture
AxonFlow's performance testing infrastructure uses a three-layer design that balances continuous monitoring with cost efficiency. This guide explains the architecture and how each layer contributes to system reliability.
Methodology
Performance testing follows a four-phase cycle:
1. Baseline
Establish current performance numbers under known conditions. Run a standard workload and record P50/P95/P99 latency, throughput, and error rate. This is your reference point for comparison.
2. Benchmark
Run the same workload after a code change, infrastructure change, or scaling event. Compare against the baseline to detect regressions or improvements.
3. Profile
When a regression is detected, use profiling tools to identify the root cause. See the Go Profiling section below.
4. Optimize
Make targeted changes based on profiling data. Re-run benchmarks to confirm the improvement and update the baseline.
Baseline → Benchmark → Profile → Optimize → (new Baseline)
Go Profiling
AxonFlow exposes Go pprof endpoints for CPU and memory profiling on both the Agent (port 8080) and Orchestrator (port 8081).
Enabling pprof Endpoints
Set the environment variable AXONFLOW_PPROF_ENABLED=true to expose profiling endpoints. These are disabled by default in production.
CPU Profiling
# Capture a 30-second CPU profile from the Agent
go tool pprof http://localhost:8080/debug/pprof/profile?seconds=30
# From the Orchestrator
go tool pprof http://localhost:8081/debug/pprof/profile?seconds=30
# Interactive commands inside pprof:
# top20 — show top 20 functions by CPU time
# web — open flame graph in browser (requires graphviz)
# list <func> — show annotated source for a function
Memory Profiling
# Capture a heap profile
go tool pprof http://localhost:8080/debug/pprof/heap
# Check for goroutine leaks
go tool pprof http://localhost:8080/debug/pprof/goroutine
Trace Analysis
# Capture a 5-second execution trace
curl -o trace.out http://localhost:8080/debug/pprof/trace?seconds=5
go tool trace trace.out
Comparing Profiles
Use benchstat to compare benchmark results before and after a change:
# Before change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > before.txt
# After change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > after.txt
# Compare
benchstat before.txt after.txt
Key Metrics
| Metric | Target | How to Measure |
|---|---|---|
| P50 latency | < 3ms | Prometheus perf_test_latency_p50_ms |
| P95 latency | < 10ms | Prometheus perf_test_latency_p95_ms |
| P99 latency | < 25ms | Prometheus perf_test_latency_p99_ms |
| CPU per request | < 0.5ms | pprof CPU profile |
| Allocs per request | < 50 | go test -benchmem |
| Goroutine count | < 1000 idle | pprof goroutine profile |
| GC pause | < 1ms | GODEBUG=gctrace=1 or runtime/metrics |
Optimization Checklist
Before and after any performance optimization, verify the following:
- Baseline benchmark recorded with
go test -bench -count=10 - CPU profile captured and hotspot identified
- Memory profile checked for excessive allocations
- Goroutine profile checked for leaks
- Change made and new benchmark recorded
-
benchstatcomparison shows statistically significant improvement - No regression in other benchmarks
- Unit tests still pass
- Coverage thresholds still met (orchestrator 76%, agent 76%, connectors 76%)
Three-Layer Design
┌─────────────────────────────────────────────────────────────┐
│ LAYER 1: HEARTBEAT │
│ ───────────────────────────────────────────────────────── │
│ • Continuous health monitoring │
│ • Keeps dashboards active │
│ • Minimal resource usage │
│ • Alternates between test modes │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LAYER 2: SCHEDULED │
│ ───────────────────────────────────────────────────────── │
│ • Hourly: Baseline verification │
│ • Daily: Extended stress testing │
│ • Weekly: Comprehensive benchmark suite │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LAYER 3: MANUAL │
│ ───────────────────────────────────────────────────────── │
│ • Ad-hoc testing for specific scenarios │
│ • Custom RPS and duration │
│ • Pre-deployment validation │
│ • Incident investigation │
└─────────────────────────────────────────────────────────────┘
Layer 1: Heartbeat
The heartbeat layer provides continuous, lightweight monitoring.
Purpose
- Keep observability dashboards populated with data
- Detect system availability issues quickly
- Minimal cost (single request every few seconds)
- Confirm basic connectivity and authentication
Behavior
Mode A: Policy-only requests (no LLM costs)
↓
[Wait interval]
↓
Mode B: Health check requests
↓
[Wait interval]
↓
(repeat)
Characteristics
| Aspect | Value |
|---|---|
| Frequency | Every few seconds |
| Request Volume | Minimal |
| Cost Impact | Negligible |
| Data Retention | Real-time only |
Layer 2: Scheduled Tests
Scheduled tests run at fixed intervals to establish performance baselines.
Hourly Tests
Purpose: Verify system maintains baseline performance throughout the day.
Characteristics:
- Short duration
- Moderate request volume
- Captures hourly trends
- Alerts on regression
Daily Tests
Purpose: Extended stress testing during low-traffic windows.
Characteristics:
- Longer duration
- Higher request volume
- Tests sustained load capability
- Generates daily benchmark data
Weekly Tests
Purpose: Comprehensive benchmark suite for trend analysis.
Characteristics:
- Full test coverage
- All test categories
- Historical comparison
- Capacity planning data
Scheduling Strategy
Hour 00 01 02 03 04 05 06 07 08 09 10 11 12 ...
│ │ │ │ │ │ │ │ │ │ │ │ │
Hourly ● ● ● ● ● ● ● ● ● ● ● ● ● ...
Daily ● (low traffic window)
Weekly ● (once per week)
Layer 3: Manual Tests
Manual tests are triggered on-demand for specific purposes.
Use Cases
-
Pre-Deployment Validation
- Run before production deployments
- Verify new code doesn't regress performance
- Gate for release approval
-
Incident Investigation
- Reproduce reported performance issues
- Validate fixes before deployment
- Compare before/after behavior
-
Capacity Planning
- Test higher-than-normal loads
- Plan for growth scenarios
- Validate scaling configurations
-
Custom Scenarios
- Test specific request patterns
- Validate new features under load
- Security testing validation
Execution
# Pre-deployment validation
./perf-test --mode pre-deploy --environment staging
# Capacity test
./perf-test --mode stress --rps 500 --duration 10m
# Custom scenario
./perf-test --mode manual --category policy --rps 100
Cost Optimization
Performance testing can generate significant costs if not managed carefully.
Policy-Only Mode
Most tests run in "policy-only" mode:
Standard Request:
Client → Agent → Orchestrator → LLM Provider → Response
Cost: $0.001-0.01 per request (LLM tokens)
Policy-Only Request:
Client → Agent → Policy Engine → Response
Cost: $0 (no LLM calls)
When to use policy-only:
- Heartbeat monitoring
- Policy performance validation
- Most scheduled tests
When to use full requests:
- LLM integration testing
- End-to-end validation
- Weekly benchmarks (limited scope)
Environment Strategy
Development:
└── Local Docker (free)
Staging:
└── All scheduled tests run here
└── Pre-deployment validation
└── Cost: Included in staging infrastructure
Production:
└── Heartbeat only (minimal)
└── Post-deployment smoke tests
└── No scheduled load tests
Metrics and Observability
Prometheus Metrics
Performance tests export comprehensive metrics:
# Latency percentiles
perf_test_latency_p50_ms
perf_test_latency_p95_ms
perf_test_latency_p99_ms
# Request rates
perf_test_requests_total{status="success"}
perf_test_requests_total{status="blocked"}
perf_test_requests_total{status="error"}
# Test metadata
perf_test_info{layer="scheduled", interval="hourly"}
Grafana Dashboards
Standard dashboards include:
-
Real-Time Performance
- Live latency percentiles
- Request rate graphs
- Error rate monitoring
-
Historical Trends
- Daily/weekly comparisons
- Regression detection
- Capacity utilization
-
Test Execution
- Scheduled test status
- Manual test results
- Alert history
Alerting
Automated Alerts
| Condition | Severity | Action |
|---|---|---|
| P99 > threshold | Warning | Investigate |
| Error rate > 0.1% | Critical | Page on-call |
| Heartbeat failure | Critical | Auto-escalate |
| Daily test failure | Warning | Review before deploy |
Alert Routing
Layer 1 (Heartbeat) failures → Immediate page
Layer 2 (Scheduled) failures → Slack notification
Layer 3 (Manual) failures → Test runner notified
Best Practices
1. Start with Staging
Never run load tests against production without explicit approval:
✅ Staging: Full test suite
✅ Production: Heartbeat + smoke tests only
❌ Production: Scheduled load tests
2. Review Before Scaling
Before increasing test intensity:
- Verify system can handle current load
- Check infrastructure costs
- Confirm no external rate limits
3. Maintain Historical Data
Keep test results for trend analysis:
- Minimum 90 days of daily data
- Minimum 1 year of weekly data
- Archive before deleting
4. Document Anomalies
When tests show unexpected results:
- Capture full metrics
- Note environmental factors
- Document investigation findings
- Update runbooks if needed
Next Steps
- Load Testing Methodology - Detailed load testing guide
- Testing Overview - Complete testing pyramid
