Load Testing Methodology
AxonFlow includes a custom load testing framework designed to validate system performance under realistic production conditions. This guide covers our load testing methodology and principles.
Tools
AxonFlow supports load testing with several industry-standard tools:
| Tool | Best For | Installation |
|---|---|---|
| k6 (recommended) | Scriptable load tests, CI integration | brew install k6 or k6.io |
| wrk | Simple HTTP benchmarking | brew install wrk |
| hey | Quick one-liner load tests | brew install hey |
| AxonFlow internal harness | Sustained-load and staging validation used by the platform team | Available to internal teams; environment-specific |
Recommendation: Use k6 for scripted, reproducible community load tests. The internal harness is environment-specific and is not the right starting point for users outside the platform team.
Target Metrics
These are the target performance numbers that AxonFlow should meet under load. Targets assume a test client co-located in the agent's VPC with skip_llm=true unless otherwise noted. Internet-facing topologies and full-LLM runs add network and provider latency on top of these numbers.
| Metric | Target | Degraded | Critical |
|---|---|---|---|
| P50 latency | < 3ms | 3-5ms | > 5ms |
| P95 latency | < 10ms | 10-20ms | > 20ms |
| P99 latency | < 25ms | 25-50ms | > 50ms |
| Throughput | > 1,000 RPS per node | 500-1,000 RPS | < 500 RPS |
| Error rate | 0% | < 0.1% | > 0.1% |
| Policy block rate | ~20% (depends on test mix) | Varies | 0% (policy inactive) |
Example k6 Load Test
The following k6 script targets the current Agent request entrypoint, POST /api/request, with a simple ramp-up pattern. It is intentionally community-friendly and uses skip_llm: true so you can measure governance and routing overhead without paying provider latency.
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';
const errorRate = new Rate('errors');
const latency = new Trend('axonflow_latency');
export const options = {
stages: [
{ duration: '30s', target: 10 }, // ramp up to 10 RPS
{ duration: '1m', target: 50 }, // hold at 50 RPS
{ duration: '2m', target: 100 }, // push to 100 RPS
{ duration: '30s', target: 0 }, // ramp down
],
thresholds: {
http_req_duration: ['p(95)<10', 'p(99)<25'],
errors: ['rate<0.001'],
},
};
const AGENT_URL = __ENV.AXONFLOW_ENDPOINT || 'http://localhost:8080';
const CLIENT_ID = __ENV.AXONFLOW_CLIENT_ID || 'community';
export default function () {
const res = http.post(
`${AGENT_URL}/api/request`,
JSON.stringify({
query: 'What is 2+2?',
client_id: CLIENT_ID,
request_type: 'chat',
user_token: 'load-test-user',
skip_llm: true,
}),
{
headers: {
'Content-Type': 'application/json',
},
}
);
check(res, {
'status is 200': (r) => r.status === 200,
'latency < 10ms': (r) => r.timings.duration < 10,
});
errorRate.add(res.status !== 200);
latency.add(res.timings.duration);
sleep(0.1);
}
Run with:
k6 run --env AXONFLOW_ENDPOINT=http://localhost:8080 load-test.js
Measured Results
These are the latest published benchmark numbers as of May 2, 2026. The runtime and policy mix evolve between releases — if the date above is more than a few months old when you read this, treat the numbers as a floor rather than a current guarantee.
Summary — both topologies, HTTP/2 + TLS 1.3
The May 2, 2026 sweep ran against two purpose-built ephemeral benchmark stacks: an in-VPC topology (internal ALB) and a SaaS-shaped topology (internet-facing ALB). Both stacks share the same 5 agent + 10 orchestrator Fargate shape (1 vCPU / 2 GB each, multi-AZ) and db.t3.large Multi-AZ; the only varied dimension is the ALB scheme. HTTPS:443 listener with TLS 1.3 + ALPN h2. Test client co-located in the same VPC for both. Numbers below are steady-state (second-run) measurements — first-run numbers paid TLS handshake + cache warmup cost that dominated the tail at low RPS.
| RPS | Topology | LLM Mode | P50 | P95 | P99 | HTTP success | Policy correctness |
|---|---|---|---|---|---|---|---|
| 20 | in-VPC (internal ALB) | skip_llm=true | 6.30 ms | 8.98 ms | 24.80 ms | 100 % | 100 % |
| 20 | SaaS (internet-facing ALB) | skip_llm=true | 6.33 ms | 9.95 ms | 27.46 ms | 100 % | 100 % |
| 50 | in-VPC | skip_llm=true | 6.32 ms | 12.05 ms | 38.62 ms | 100 % | 100 % |
| 50 | SaaS | skip_llm=true | 6.50 ms | 10.30 ms | 26.12 ms | 100 % | 100 % |
| 100 | in-VPC | skip_llm=true | 6.35 ms | 13.61 ms | 46.94 ms | 100 % | 100 % |
| 100 | SaaS | skip_llm=true | 6.43 ms | 18.46 ms | 76.43 ms | 100 % | 100 % |
Policy correctness counts requests whose actual policy verdict matched the scenario's expected verdict — BLOCK on sql_injection_* and destructive-DML scenarios, ALLOW on normal and pii scenarios. HTTP-2xx alone is not sufficient for a benchmark headline; the May 2 sweep verifies both ends.
P50 stays within ~7 ms across both topologies at every RPS level — the platform's policy + orchestration cost is independent of ALB scheme, as expected. P95 grows monotonically with RPS as it should. 20 RPS holds sub-10 ms P95 on both topologies; 50–100 RPS pays a few extra ms at P95 versus the November 2025 baseline, reflecting the cumulative cost of new policy-engine features and middleware shipped over the past 5 months.
Why these are second-run (warm) numbers
The first run of each profile against a freshly-deployed stack pays a cold-start tax that lands almost entirely in the tail percentiles: the initial TLS-session cache is empty, agent and orchestrator DB connection pools have to fill, and the policy engine's regex caches haven't been touched yet. With only 600 samples in the 20 RPS / 30 s profile, those first ~30 outlier handshakes land directly in the P95 number.
Re-running the same profiles on the now-warm stack measured the actual steady-state cost. The drop at the tail is dramatic and isolates the warmup tax cleanly:
| Profile | First run (cold stack) | Second run (warm stack) | Drop |
|---|---|---|---|
| 20 RPS Light, P95 | 91.81 ms | 9.95 ms | 9.2× |
| 20 RPS Light, P99 | 917.32 ms | 27.46 ms | 33× |
| 50 RPS Hourly, P95 | 37.91 ms | 10.30 ms | 3.7× |
| 50 RPS Hourly, P99 | 747.23 ms | 26.12 ms | 29× |
(SaaS topology shown — same pattern on in-VPC, smaller in absolute terms because the internal ALB has lighter cold-start cost than the public one.)
P50 was unaffected on every row (~6.5 ms cold, ~6.3 ms warm) — cold-start cost lives entirely in the tail, not the median. The published numbers above are warm-stack measurements. A real customer's traffic, hitting a stack that has been up for more than a few minutes, sees the warm percentiles continuously; cold-start cost only surfaces on the first few requests after a fresh deployment or autoscale-up event.
This is why benchmark methodology matters: a single-run-on-a-fresh-stack measurement makes P95 look 9× worse than steady-state, and presenting that as a headline would be misleading. Publishing the warm-stack numbers — clearly labelled as such — and showing the first-vs-second-run delta side-by-side is the honest framing.
What was running on the hot path
Every request flowed through inline policy enforcement before progressing — SQL-injection, PII, and prompt-injection regex patterns evaluated synchronously against the seeded system-policy library. Pattern counts vary by build, so the numbers above describe the cost of the inline-evaluation step, not a frozen pattern inventory. At 100 RPS the per-category breakdown was: 4,236 normal queries ALLOWed, 2,824 sql_injection payloads BLOCKed, 3,530 destructive-DML queries (DROP / TRUNCATE / DELETE-without-WHERE / ALTER / GRANT) BLOCKed, 1,410 pii queries ALLOWed. Zero unexpected verdicts across all 12,000 requests.
Test conditions
- Duration: 30 s (light, 20 RPS), 60 s (hourly, 50 RPS), and 120 s (capacity, 100 RPS)
- Connection: HTTP/2 + TLS 1.3 on the ALB HTTPS:443 listener (
ELBSecurityPolicy-TLS13-1-2-2021-06) for both in-VPC (internal scheme) and SaaS (internet-facing scheme) topologies - Request mix: 17 test cases across
normal,sql_injection,dangerous, andpiifor theskip_llm=truecapacity sweep; 20 test cases for theskip_llm=falserealistic-load sweep once the 3llmscenarios are added - Test client placement: ECS Fargate task in the same VPC as the agent (true backend latency, no client-side internet hop in either topology)
- Stack warmup: numbers are second-run (warm-stack) measurements, see "Why these are second-run (warm) numbers" above
- Detection actions:
SQLI_ACTION=block,SENSITIVE_DATA_ACTION=block,DANGEROUS_QUERY_ACTION=blockon the agent. The default product profile (post-v6.2.0) relaxessecurity-sqlitowarn; the benchmark overrides toblockso the 100 % policy-correctness column above reflects an enforced — not warned — verdict.
What the May 2, 2026 sweep also surfaced
- LLM-routing path saturates above ~5 RPS when the configured LLM provider reports unhealthy. The
skip_llm=truecapacity numbers above do not include LLM time-on-wire. Askip_llm=falserealistic-load run at 20 RPS during the same sweep returned successful responses only for requests that BLOCKED at the policy layer before reaching the LLM (45 % of total = 9 of 20 scenarios, thesql_injection+dangerouscategories). We treat LLM-path capacity as a separate measurement from policy-enforcement capacity for this reason — bring your own healthy LLM provider for the realistic-load number. - Tail-latency at sustained 100 RPS stays tight on both topologies once warm: P99 46.94 ms (in-VPC) / 76.43 ms (SaaS) across 12,000 requests, both well inside the operational threshold expected of a synchronous policy-enforcement gateway.
Reproducing these numbers
The k6 script earlier on this page produces a comparable workload against a local stack. To get production-comparable numbers, run the test client inside the same VPC as the agent on a Fargate task — internet-routed RPS is dominated by network and load-balancer behavior, not by the runtime under measurement. Match the topology described above (5+10 replicas, ALB with HTTPS:443 + TLS 1.3, co-located client) and warm the stack with one discard-pass before measuring; otherwise the first-run cold-start tax dominates the tail at low RPS.
Capturing Your Own Baseline
For your own environment, generate baseline numbers and keep them with the workload definition you use. Internal sustained-load harnesses exist for staging validation by the platform team, but those are environment-specific and should not be copied blindly into public capacity commitments.
Capture at least:
- target RPS versus achieved RPS
- P50, P95, and P99 latency
- success, blocked, and unexpected-response counts
- connector and database saturation indicators when applicable
Design Principles
Even Distribution
Our load generator uses a ticker-based approach rather than burst patterns:
Burst Pattern (Wrong): Even Distribution (Correct):
───────────────────── ─────────────────────────────
│████████░░░░░░░░░░│ │█░█░█░█░█░█░█░█░█░█░█░█░█░│
│░░░░░░░░████████░░│ │█░█░█░█░█░█░█░█░█░█░█░█░█░│
│░░░░░░░░░░░░░░████│ │█░█░█░█░█░█░█░█░█░█░█░█░█░│
───────────────────── ─────────────────────────────
Unrealistic spikes Real client behavior
Why it matters: Burst patterns can mask performance issues that only appear under sustained load. Even distribution simulates actual production traffic patterns.
Percentile Accuracy
We track latency percentiles using mathematically correct calculations:
| Percentile | Meaning |
|---|---|
| P50 | Median - 50% of requests complete faster |
| P95 | 95th percentile - 19 out of 20 requests |
| P99 | 99th percentile - 99 out of 100 requests |
Why percentiles matter:
- Averages hide outliers
- P95/P99 reveal tail latency issues
- SLAs are typically defined using percentiles
Realistic Client Behavior
Our load generator simulates real production clients:
- Connection pooling: Reuses connections like production clients
- HTTP/2: Modern protocol with multiplexing
- TLS 1.3: Full encryption overhead included
- Keep-alive: Long-lived connections
Test Categories
Load tests are organized into categories that validate different behaviors:
1. Normal Queries
Standard requests that should succeed:
Category: normal
Expected: Success (200 OK)
Purpose: Validate happy path performance
2. Security Violation Tests
Requests that should be rejected by security policies:
Category: security
Expected: Rejection (403 Forbidden)
Purpose: Validate security rules are enforced
3. Policy Violation Tests
Requests that violate governance policies:
Category: policy
Expected: Blocking (varies by policy)
Purpose: Validate policy engine correctness
4. LLM Integration Tests
Requests that involve LLM providers:
Category: llm
Expected: Success (200 OK)
Purpose: Validate LLM routing and response handling
Metrics Collection
Prometheus Integration
Load-test results can be exported to Prometheus-compatible metrics or written to JSON for Grafana dashboards, depending on the harness you use:
# Latency histogram
load_test_latency_ms{quantile="0.5"} 2.4
load_test_latency_ms{quantile="0.95"} 4.8
load_test_latency_ms{quantile="0.99"} 8.2
# Request counters
load_test_requests_total{status="success"} 15000
load_test_requests_total{status="blocked"} 3500
load_test_requests_total{status="error"} 0
Labels and Dimensions
Results are tagged with context:
| Label | Purpose |
|---|---|
client | Client identifier |
test_type | Test category (normal, security, etc.) |
environment | Target environment |
provider | LLM provider (if applicable) |
Running Load Tests
Prerequisites
- Access to target environment
- Valid authentication credentials if your environment enforces them
- Prometheus Pushgateway if you want to export ad-hoc run results
Basic Execution
# Run the k6 script against local community mode
k6 run --env AXONFLOW_ENDPOINT=http://localhost:8080 --env AXONFLOW_CLIENT_ID=community load-test.js
Test Parameters
If you use the internal sustained-load harness, read its source and wrapper scripts first. That tooling expects a specific environment shape and is better treated as a platform-team tool than as a generic end-user CLI.
Results Interpretation
Healthy Results
✅ P50: <5ms (excellent)
✅ P95: <10ms (within target)
✅ Error Rate: 0%
✅ Blocked Rate: ~20% (policy working)
Warning Signs
⚠️ P95 > 20ms: Investigate latency issues
⚠️ Error Rate > 0.1%: Check system logs
⚠️ P50 increasing: Possible degradation
⚠️ Blocked Rate 0%: Policy might not be active
Best Practices
1. Start Small
Begin with low RPS and increase gradually:
Stage 1: 10 RPS for 30s (warmup)
Stage 2: 50 RPS for 60s (baseline)
Stage 3: 100 RPS for 60s (target load)
Stage 4: 200 RPS for 60s (stress test)
2. Test Staging First
Always validate changes in staging before production:
1. Deploy to staging
2. Run load tests
3. Review metrics
4. If passing, deploy to production
5. Run lighter validation tests
3. Monitor During Tests
Watch these metrics during load tests:
- CPU and memory utilization
- Database connection pool
- Error rates and logs
- Latency percentiles
4. Clean Up After Tests
- Scale down test infrastructure
- Archive test results
- Document any anomalies
Next Steps
- Performance Testing Architecture - Scheduled testing infrastructure
- Testing Overview - Complete testing pyramid
From Local Work To Review
When local development becomes a shared rollout, carry the same work into:
- Runtime Surface Map for integration boundaries
- Deployment Mode Matrix for the target environment
- Security Control Matrix before sensitive data or regulated workflows enter the test path
