Skip to main content

Performance Testing Architecture

AxonFlow's performance testing infrastructure uses a three-layer design that balances continuous monitoring with cost efficiency. This guide explains the architecture and how each layer contributes to system reliability.

Methodology

Performance testing follows a four-phase cycle:

1. Baseline

Establish current performance numbers under known conditions. Run a standard workload and record P50/P95/P99 latency, throughput, and error rate. This is your reference point for comparison.

2. Benchmark

Run the same workload after a code change, infrastructure change, or scaling event. Compare against the baseline to detect regressions or improvements.

3. Profile

When a regression is detected, use profiling tools to identify the root cause. See the Go Profiling section below.

4. Optimize

Make targeted changes based on profiling data. Re-run benchmarks to confirm the improvement and update the baseline.

Baseline → Benchmark → Profile → Optimize → (new Baseline)

Go Profiling

AxonFlow exposes Go pprof endpoints for CPU and memory profiling on both the Agent (port 8080) and Orchestrator (port 8081).

Enabling pprof Endpoints

Set the environment variable AXONFLOW_PPROF_ENABLED=true to expose profiling endpoints. These are disabled by default in production.

CPU Profiling

# Capture a 30-second CPU profile from the Agent
go tool pprof http://localhost:8080/debug/pprof/profile?seconds=30

# From the Orchestrator
go tool pprof http://localhost:8081/debug/pprof/profile?seconds=30

# Interactive commands inside pprof:
# top20 — show top 20 functions by CPU time
# web — open flame graph in browser (requires graphviz)
# list <func> — show annotated source for a function

Memory Profiling

# Capture a heap profile
go tool pprof http://localhost:8080/debug/pprof/heap

# Check for goroutine leaks
go tool pprof http://localhost:8080/debug/pprof/goroutine

Trace Analysis

# Capture a 5-second execution trace
curl -o trace.out http://localhost:8080/debug/pprof/trace?seconds=5
go tool trace trace.out

Comparing Profiles

Use benchstat to compare benchmark results before and after a change:

# Before change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > before.txt

# After change
go test -bench=BenchmarkPolicyEvaluation -count=10 ./platform/agent/... > after.txt

# Compare
benchstat before.txt after.txt

Key Metrics

MetricTargetHow to Measure
P50 latency< 3msPrometheus perf_test_latency_p50_ms
P95 latency< 10msPrometheus perf_test_latency_p95_ms
P99 latency< 25msPrometheus perf_test_latency_p99_ms
CPU per request< 0.5mspprof CPU profile
Allocs per request< 50go test -benchmem
Goroutine count< 1000 idlepprof goroutine profile
GC pause< 1msGODEBUG=gctrace=1 or runtime/metrics

Optimization Checklist

Before and after any performance optimization, verify the following:

  • Baseline benchmark recorded with go test -bench -count=10
  • CPU profile captured and hotspot identified
  • Memory profile checked for excessive allocations
  • Goroutine profile checked for leaks
  • Change made and new benchmark recorded
  • benchstat comparison shows statistically significant improvement
  • No regression in other benchmarks
  • Unit tests still pass
  • Coverage thresholds still met (orchestrator 76%, agent 76%, connectors 76%)

Three-Layer Design

┌─────────────────────────────────────────────────────────────┐
│ LAYER 1: HEARTBEAT │
│ ───────────────────────────────────────────────────────── │
│ • Continuous health monitoring │
│ • Keeps dashboards active │
│ • Minimal resource usage │
│ • Alternates between test modes │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ LAYER 2: SCHEDULED │
│ ───────────────────────────────────────────────────────── │
│ • Hourly: Baseline verification │
│ • Daily: Extended stress testing │
│ • Weekly: Comprehensive benchmark suite │
└─────────────────────────────────────────────────────────────┘


┌─────────────────────────────────────────────────────────────┐
│ LAYER 3: MANUAL │
│ ───────────────────────────────────────────────────────── │
│ • Ad-hoc testing for specific scenarios │
│ • Custom RPS and duration │
│ • Pre-deployment validation │
│ • Incident investigation │
└─────────────────────────────────────────────────────────────┘

Layer 1: Heartbeat

The heartbeat layer provides continuous, lightweight monitoring.

Purpose

  • Keep observability dashboards populated with data
  • Detect system availability issues quickly
  • Minimal cost (single request every few seconds)
  • Confirm basic connectivity and authentication

Behavior

Mode A: Policy-only requests (no LLM costs)

[Wait interval]

Mode B: Health check requests

[Wait interval]

(repeat)

Characteristics

AspectValue
FrequencyEvery few seconds
Request VolumeMinimal
Cost ImpactNegligible
Data RetentionReal-time only

Layer 2: Scheduled Tests

Scheduled tests run at fixed intervals to establish performance baselines.

Hourly Tests

Purpose: Verify system maintains baseline performance throughout the day.

Characteristics:

  • Short duration
  • Moderate request volume
  • Captures hourly trends
  • Alerts on regression

Daily Tests

Purpose: Extended stress testing during low-traffic windows.

Characteristics:

  • Longer duration
  • Higher request volume
  • Tests sustained load capability
  • Generates daily benchmark data

Weekly Tests

Purpose: Comprehensive benchmark suite for trend analysis.

Characteristics:

  • Full test coverage
  • All test categories
  • Historical comparison
  • Capacity planning data

Scheduling Strategy

Hour   00 01 02 03 04 05 06 07 08 09 10 11 12 ...
│ │ │ │ │ │ │ │ │ │ │ │ │
Hourly ● ● ● ● ● ● ● ● ● ● ● ● ● ...
Daily ● (low traffic window)
Weekly ● (once per week)

Layer 3: Manual Tests

Manual tests are triggered on-demand for specific purposes.

Use Cases

  1. Pre-Deployment Validation

    • Run before production deployments
    • Verify new code doesn't regress performance
    • Gate for release approval
  2. Incident Investigation

    • Reproduce reported performance issues
    • Validate fixes before deployment
    • Compare before/after behavior
  3. Capacity Planning

    • Test higher-than-normal loads
    • Plan for growth scenarios
    • Validate scaling configurations
  4. Custom Scenarios

    • Test specific request patterns
    • Validate new features under load
    • Security testing validation

Execution

# Pre-deployment validation
./perf-test --mode pre-deploy --environment staging

# Capacity test
./perf-test --mode stress --rps 500 --duration 10m

# Custom scenario
./perf-test --mode manual --category policy --rps 100

Cost Optimization

Performance testing can generate significant costs if not managed carefully.

Policy-Only Mode

Most tests run in "policy-only" mode:

Standard Request:
Client → Agent → Orchestrator → LLM Provider → Response
Cost: $0.001-0.01 per request (LLM tokens)

Policy-Only Request:
Client → Agent → Policy Engine → Response
Cost: $0 (no LLM calls)

When to use policy-only:

  • Heartbeat monitoring
  • Policy performance validation
  • Most scheduled tests

When to use full requests:

  • LLM integration testing
  • End-to-end validation
  • Weekly benchmarks (limited scope)

Environment Strategy

Development:
└── Local Docker (free)

Staging:
└── All scheduled tests run here
└── Pre-deployment validation
└── Cost: Included in staging infrastructure

Production:
└── Heartbeat only (minimal)
└── Post-deployment smoke tests
└── No scheduled load tests

Metrics and Observability

Prometheus Metrics

Performance tests export comprehensive metrics:

# Latency percentiles
perf_test_latency_p50_ms
perf_test_latency_p95_ms
perf_test_latency_p99_ms

# Request rates
perf_test_requests_total{status="success"}
perf_test_requests_total{status="blocked"}
perf_test_requests_total{status="error"}

# Test metadata
perf_test_info{layer="scheduled", interval="hourly"}

Grafana Dashboards

Standard dashboards include:

  1. Real-Time Performance

    • Live latency percentiles
    • Request rate graphs
    • Error rate monitoring
  2. Historical Trends

    • Daily/weekly comparisons
    • Regression detection
    • Capacity utilization
  3. Test Execution

    • Scheduled test status
    • Manual test results
    • Alert history

Alerting

Automated Alerts

ConditionSeverityAction
P99 > thresholdWarningInvestigate
Error rate > 0.1%CriticalPage on-call
Heartbeat failureCriticalAuto-escalate
Daily test failureWarningReview before deploy

Alert Routing

Layer 1 (Heartbeat) failures → Immediate page
Layer 2 (Scheduled) failures → Slack notification
Layer 3 (Manual) failures → Test runner notified

Best Practices

1. Start with Staging

Never run load tests against production without explicit approval:

✅ Staging: Full test suite
✅ Production: Heartbeat + smoke tests only
❌ Production: Scheduled load tests

2. Review Before Scaling

Before increasing test intensity:

  • Verify system can handle current load
  • Check infrastructure costs
  • Confirm no external rate limits

3. Maintain Historical Data

Keep test results for trend analysis:

  • Minimum 90 days of daily data
  • Minimum 1 year of weekly data
  • Archive before deleting

4. Document Anomalies

When tests show unexpected results:

  1. Capture full metrics
  2. Note environmental factors
  3. Document investigation findings
  4. Update runbooks if needed

Next Steps