Performance Testing Architecture
AxonFlow's performance testing infrastructure uses a three-layer design that balances continuous monitoring with cost efficiency. This guide explains the architecture and how each layer contributes to system reliability.
Methodology
Performance testing follows a four-phase cycle:
1. Baseline
Establish current performance numbers under known conditions. Run a standard workload and record P50/P95/P99 latency, throughput, and error rate. This is your reference point for comparison.
2. Benchmark
Run the same workload after a code change, infrastructure change, or scaling event. Compare against the baseline to detect regressions or improvements.
3. Profile
When a regression is detected, use focused benchmarks plus runtime metrics to identify the root cause.
4. Optimize
Make targeted changes based on profiling data. Re-run benchmarks to confirm the improvement and update the baseline.
Baseline → Benchmark → Profile → Optimize → (new Baseline)
Benchmarking And Runtime Analysis
AxonFlow’s current public/community workflow is centered on Go benchmarks plus Prometheus and Grafana, not a documented pprof endpoint story. For most regressions, the fastest path is:
- reproduce with a focused benchmark
- compare before/after with
benchstat - inspect runtime metrics in Prometheus/Grafana while the workload runs
Benchmark Comparison
Comparing Profiles
Use benchstat to compare benchmark results before and after a change:
# Before change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > before.txt
# After change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > after.txt
# Compare
benchstat before.txt after.txt
What To Watch While Benchmarking
- request latency percentiles in Prometheus and Grafana
- policy engine time versus end-to-end request time
- connector error rates and saturation
- database latency if the benchmark exercises policy storage or execution persistence
Key Metrics
| Metric | Target | How to Measure |
|---|---|---|
| P50 latency | < 3ms | Prometheus / Grafana latency panels |
| P95 latency | < 10ms | Prometheus / Grafana latency panels |
| P99 latency | < 25ms | Prometheus / Grafana latency panels |
| Allocs per request | Track trend | go test -benchmem |
| GC pressure | Track trend | benchmark output plus runtime metrics |
| Connector saturation | 0 sustained backlog | Prometheus plus dashboard panels |
Optimization Checklist
Before and after any performance optimization, verify the following:
- Baseline benchmark recorded with
go test -bench -count=10 - request and connector metrics captured during the test window
- allocation trend checked with
-benchmem - Change made and new benchmark recorded
-
benchstatcomparison shows statistically significant improvement - No regression in other benchmarks
- Unit tests still pass
- Coverage thresholds still met (orchestrator 76%, agent 76%, connectors 76%)
Three-Layer Design
┌─────────────────────────────────────────────────────────────┐
│ LAYER 1: HEARTBEAT │
│ ───────────────────────────────────────────────────────── │
│ • Continuous health monitoring │
│ • Keeps dashboards active │
│ • Minimal resource usage │
│ • Alternates between test modes │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LAYER 2: SCHEDULED │
│ ───────────────────────────────────────────────────────── │
│ • Hourly: Baseline verification │
│ • Daily: Extended stress testing │
│ • Weekly: Comprehensive benchmark suite │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LAYER 3: MANUAL │
│ ───────────────────────────────────────────────────────── │
│ • Ad-hoc testing for specific scenarios │
│ • Custom RPS and duration │
│ • Pre-deployment validation │
│ • Incident investigation │
└─────────────────────────────────────────────────────────────┘
Layer 1: Heartbeat
The heartbeat layer provides continuous, lightweight monitoring.
Purpose
- Keep observability dashboards populated with data
- Detect system availability issues quickly
- Minimal cost (single request every few seconds)
- Confirm basic connectivity and authentication
Behavior
Mode A: Policy-only requests (no LLM costs)
↓
[Wait interval]
↓
Mode B: Health check requests
↓
[Wait interval]
↓
(repeat)
Characteristics
| Aspect | Value |
|---|---|
| Frequency | Every few seconds |
| Request Volume | Minimal |
| Cost Impact | Negligible |
| Data Retention | Real-time only |
Layer 2: Scheduled Tests
Scheduled tests run at fixed intervals to establish performance baselines.
Hourly Tests
Purpose: Verify system maintains baseline performance throughout the day.
Characteristics:
- Short duration
- Moderate request volume
- Captures hourly trends
- Alerts on regression
Daily Tests
Purpose: Extended stress testing during low-traffic windows.
Characteristics:
- Longer duration
- Higher request volume
- Tests sustained load capability
- Generates daily benchmark data
Weekly Tests
Purpose: Comprehensive benchmark suite for trend analysis.
Characteristics:
- Full test coverage
- All test categories
- Historical comparison
- Capacity planning data
Scheduling Strategy
Hour 00 01 02 03 04 05 06 07 08 09 10 11 12 ...
│ │ │ │ │ │ │ │ │ │ │ │ │
Hourly ● ● ● ● ● ● ● ● ● ● ● ● ● ...
Daily ● (low traffic window)
Weekly ● (once per week)
Layer 3: Manual Tests
Manual tests are triggered on-demand for specific purposes.
Use Cases
-
Pre-Deployment Validation
- Run before production deployments
- Verify new code doesn't regress performance
- Gate for release approval
-
Incident Investigation
- Reproduce reported performance issues
- Validate fixes before deployment
- Compare before/after behavior
-
Capacity Planning
- Test higher-than-normal loads
- Plan for growth scenarios
- Validate scaling configurations
-
Custom Scenarios
- Test specific request patterns
- Validate new features under load
- Security testing validation
Execution
# Focused benchmark in the Agent
go test -bench=BenchmarkPolicyEvaluation -benchmem ./platform/agent/...
# Focused benchmark in the Orchestrator
go test -bench=. -benchmem ./platform/orchestrator/...
# Compare two benchmark snapshots
benchstat before.txt after.txt
Cost Optimization
Performance testing can generate significant costs if not managed carefully.
Policy-Only Mode
Most tests run in "policy-only" mode:
Standard Request:
Client → Agent → Orchestrator → LLM Provider → Response
Cost: $0.001-0.01 per request (LLM tokens)
Policy-Only Request:
Client → Agent → Policy Engine → Response
Cost: $0 (no LLM calls)
When to use policy-only:
- Heartbeat monitoring
- Policy performance validation
- Most scheduled tests
When to use full requests:
- LLM integration testing
- End-to-end validation
- Weekly benchmarks (limited scope)
Environment Strategy
Development:
└── Local Docker (free)
Staging:
└── All scheduled tests run here
└── Pre-deployment validation
└── Cost: Included in staging infrastructure
Production:
└── Heartbeat only (minimal)
└── Post-deployment smoke tests
└── No scheduled load tests
Metrics and Observability
Prometheus Metrics
For local and staging performance work, rely on the Prometheus endpoints scraped by the bundled Prometheus and Grafana stack. The exact metric inventory evolves with the platform, so treat the dashboards and /metrics output as the source of truth rather than memorizing a frozen list in prose.
Grafana Dashboards
Standard dashboards include:
-
Real-Time Performance
- Live latency percentiles
- Request rate graphs
- Error rate monitoring
-
Historical Trends
- Daily/weekly comparisons
- Regression detection
- Capacity utilization
-
Test Execution
- Scheduled test status
- Manual test results
- Alert history
Alerting
Automated Alerts
Use performance alerts to catch regressions in the environment where you actually run governed traffic:
| Condition | Severity | Action |
|---|---|---|
| sustained latency regression | Warning | investigate before release |
| elevated error rate | Critical | page the owning team |
| blocked-rate anomaly | Warning | check policy rollout or connector behavior |
| heartbeat failure | Critical | treat as service health issue |
Alert Routing
Layer 1 (Heartbeat) failures → Immediate page
Layer 2 (Scheduled) failures → Slack notification
Layer 3 (Manual) failures → Test runner notified
Best Practices
1. Start with Staging
Never run load tests against production without explicit approval:
✅ Staging: Full test suite
✅ Production: Heartbeat + smoke tests only
❌ Production: Scheduled load tests
2. Review Before Scaling
Before increasing test intensity:
- Verify system can handle current load
- Check infrastructure costs
- Confirm no external rate limits
3. Maintain Historical Data
Keep test results for trend analysis:
- Minimum 90 days of daily data
- Minimum 1 year of weekly data
- Archive before deleting
4. Document Anomalies
When tests show unexpected results:
- Capture full metrics
- Note environmental factors
- Document investigation findings
- Update runbooks if needed
Next Steps
- Load Testing Methodology - Detailed load testing guide
- Testing Overview - Complete testing pyramid
