Load Testing Methodology
AxonFlow includes a custom load testing framework designed to validate system performance under realistic production conditions. This guide covers our load testing methodology and principles.
Tools
AxonFlow supports load testing with several industry-standard tools:
| Tool | Best For | Installation |
|---|---|---|
| k6 (recommended) | Scriptable load tests, CI integration | brew install k6 or k6.io |
| wrk | Simple HTTP benchmarking | brew install wrk |
| hey | Quick one-liner load tests | brew install hey |
| AxonFlow built-in | Custom framework with policy-aware test categories | Included in repository (./load-test) |
Recommendation: Use k6 for scripted, reproducible load tests. Use the built-in framework for policy-specific validation.
Target Metrics
These are the target performance numbers that AxonFlow should meet under load:
| Metric | Target | Degraded | Critical |
|---|---|---|---|
| P50 latency | < 3ms | 3-5ms | > 5ms |
| P95 latency | < 10ms | 10-20ms | > 20ms |
| P99 latency | < 25ms | 25-50ms | > 50ms |
| Throughput | > 1,000 RPS per node | 500-1,000 RPS | < 500 RPS |
| Error rate | 0% | < 0.1% | > 0.1% |
| Policy block rate | ~20% (depends on test mix) | Varies | 0% (policy inactive) |
Example k6 Load Test
The following k6 script targets the AxonFlow Agent execute endpoint with a ramp-up pattern:
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';
const errorRate = new Rate('errors');
const latency = new Trend('axonflow_latency');
export const options = {
stages: [
{ duration: '30s', target: 10 }, // ramp up to 10 RPS
{ duration: '1m', target: 50 }, // hold at 50 RPS
{ duration: '2m', target: 100 }, // push to 100 RPS
{ duration: '30s', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(95)<10', 'p(99)<25'],
errors: ['rate<0.001'],
},
};
const AGENT_URL = __ENV.AXONFLOW_ENDPOINT || 'http://localhost:8080';
const API_KEY = __ENV.AXONFLOW_CLIENT_ID || 'demo-org';
export default function () {
const res = http.post(
`${AGENT_URL}/v1/execute`,
JSON.stringify({ prompt: 'What is 2+2?', mode: 'chat' }),
{
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${API_KEY}`,
},
}
);
check(res, {
'status is 200': (r) => r.status === 200,
'latency < 10ms': (r) => r.timings.duration < 10,
});
errorRate.add(res.status !== 200);
latency.add(res.timings.duration);
sleep(0.1);
}
Run with:
k6 run --env AXONFLOW_ENDPOINT=http://localhost:8080 load-test.js
Baseline Results
Reference baseline from a standard deployment (2 Agent nodes, 2 Orchestrator nodes, db.t3.medium RDS, policy-only mode):
| RPS | P50 | P95 | P99 | Error Rate | Notes |
|---|---|---|---|---|---|
| 10 | 1.2ms | 2.8ms | 4.1ms | 0% | Warmup |
| 50 | 1.8ms | 4.2ms | 7.5ms | 0% | Normal load |
| 100 | 2.4ms | 4.8ms | 8.2ms | 0% | Target load |
| 200 | 3.1ms | 8.5ms | 15.3ms | 0% | Stress test |
| 500 | 5.2ms | 18.7ms | 42.1ms | 0.02% | Near saturation |
Note: These numbers are for policy-only requests (no LLM calls). End-to-end latency with LLM providers will be dominated by provider response time (typically 200ms-2s).
Design Principles
Even Distribution
Our load generator uses a ticker-based approach rather than burst patterns:
Burst Pattern (Wrong): Even Distribution (Correct):
───────────────────── ─────────────────────────────
│████████░░░░░░░░░░│ │█░█░█░█░█░█░█░█░█░█░█░█░█░│
│░░░░░░░░████████░░│ │█░█░█░█░█░█░█░█░█░█░█░█░█░│
│░░░░░░░░░░░░░░████│ │█░█░█░█░█░█░█░█░█░█░█░█░█░│
───────────────────── ─────────────────────────────
Unrealistic spikes Real client behavior
Why it matters: Burst patterns can mask performance issues that only appear under sustained load. Even distribution simulates actual production traffic patterns.
Percentile Accuracy
We track latency percentiles using mathematically correct calculations:
| Percentile | Meaning |
|---|---|
| P50 | Median - 50% of requests complete faster |
| P95 | 95th percentile - 19 out of 20 requests |
| P99 | 99th percentile - 99 out of 100 requests |
Why percentiles matter:
- Averages hide outliers
- P95/P99 reveal tail latency issues
- SLAs are typically defined using percentiles
Realistic Client Behavior
Our load generator simulates real production clients:
- Connection pooling: Reuses connections like production clients
- HTTP/2: Modern protocol with multiplexing
- TLS 1.3: Full encryption overhead included
- Keep-alive: Long-lived connections
Test Categories
Load tests are organized into categories that validate different behaviors:
1. Normal Queries
Standard requests that should succeed:
Category: normal
Expected: Success (200 OK)
Purpose: Validate happy path performance
2. Security Violation Tests
Requests that should be rejected by security policies:
Category: security
Expected: Rejection (403 Forbidden)
Purpose: Validate security rules are enforced
3. Policy Violation Tests
Requests that violate governance policies:
Category: policy
Expected: Blocking (varies by policy)
Purpose: Validate policy engine correctness
4. LLM Integration Tests
Requests that involve LLM providers:
Category: llm
Expected: Success (200 OK)
Purpose: Validate LLM routing and response handling
Metrics Collection
Prometheus Integration
Load test results are exported to Prometheus-compatible format:
# Latency histogram
axonflow_load_test_latency_ms{quantile="0.5"} 2.4
axonflow_load_test_latency_ms{quantile="0.95"} 4.8
axonflow_load_test_latency_ms{quantile="0.99"} 8.2
# Request counters
axonflow_load_test_requests_total{status="success"} 15000
axonflow_load_test_requests_total{status="blocked"} 3500
axonflow_load_test_requests_total{status="error"} 0
Labels and Dimensions
Results are tagged with context:
| Label | Purpose |
|---|---|
client | Client identifier |
test_type | Test category (normal, security, etc.) |
environment | Target environment |
provider | LLM provider (if applicable) |
Running Load Tests
Prerequisites
- Access to target environment
- Valid authentication credentials
- Prometheus Pushgateway (for metrics export)
Basic Execution
# Run load test against staging
./load-test \
--target https://staging.example.com \
--rps 50 \
--duration 60s \
--category normal
# Run with metrics export
./load-test \
--target https://staging.example.com \
--rps 100 \
--duration 300s \
--pushgateway http://prometheus:9091
Test Parameters
| Parameter | Description | Example |
|---|---|---|
--target | Target endpoint URL | https://api.example.com |
--rps | Requests per second | 50, 100, 200 |
--duration | Test duration | 30s, 5m, 1h |
--category | Test category | normal, security, policy |
--workers | Concurrent workers | 10, 50 |
Results Interpretation
Healthy Results
✅ P50: <5ms (excellent)
✅ P95: <10ms (within target)
✅ Error Rate: 0%
✅ Blocked Rate: ~20% (policy working)
Warning Signs
⚠️ P95 > 20ms: Investigate latency issues
⚠️ Error Rate > 0.1%: Check system logs
⚠️ P50 increasing: Possible degradation
⚠️ Blocked Rate 0%: Policy might not be active
Best Practices
1. Start Small
Begin with low RPS and increase gradually:
Stage 1: 10 RPS for 30s (warmup)
Stage 2: 50 RPS for 60s (baseline)
Stage 3: 100 RPS for 60s (target load)
Stage 4: 200 RPS for 60s (stress test)
2. Test Staging First
Always validate changes in staging before production:
1. Deploy to staging
2. Run load tests
3. Review metrics
4. If passing, deploy to production
5. Run lighter validation tests
3. Monitor During Tests
Watch these metrics during load tests:
- CPU and memory utilization
- Database connection pool
- Error rates and logs
- Latency percentiles
4. Clean Up After Tests
- Scale down test infrastructure
- Archive test results
- Document any anomalies
Next Steps
- Performance Testing Architecture - Scheduled testing infrastructure
- Testing Overview - Complete testing pyramid
