Load Testing Methodology

AxonFlow includes a custom load testing framework designed to validate system performance under realistic production conditions. This guide covers our load testing methodology and principles.

Tools

AxonFlow supports load testing with several industry-standard tools:

Tool	Best For	Installation
k6 (recommended)	Scriptable load tests, CI integration	`brew install k6` or k6.io
wrk	Simple HTTP benchmarking	`brew install wrk`
hey	Quick one-liner load tests	`brew install hey`
AxonFlow built-in	Custom framework with policy-aware test categories	Included in repository (`./load-test`)

Recommendation: Use k6 for scripted, reproducible load tests. Use the built-in framework for policy-specific validation.

Target Metrics

These are the target performance numbers that AxonFlow should meet under load:

Metric	Target	Degraded	Critical
P50 latency	< 3ms	3-5ms	> 5ms
P95 latency	< 10ms	10-20ms	> 20ms
P99 latency	< 25ms	25-50ms	> 50ms
Throughput	> 1,000 RPS per node	500-1,000 RPS	< 500 RPS
Error rate	0%	< 0.1%	> 0.1%
Policy block rate	~20% (depends on test mix)	Varies	0% (policy inactive)

Example k6 Load Test

The following k6 script targets the AxonFlow Agent execute endpoint with a ramp-up pattern:

import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const errorRate = new Rate('errors');
const latency = new Trend('axonflow_latency');

export const options = {
  stages: [
    { duration: '30s', target: 10 },   // ramp up to 10 RPS
    { duration: '1m',  target: 50 },   // hold at 50 RPS
    { duration: '2m',  target: 100 },  // push to 100 RPS
    { duration: '30s', target: 0 },    // ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<10', 'p(99)<25'],
    errors: ['rate<0.001'],
  },
};

const AGENT_URL = __ENV.AXONFLOW_ENDPOINT || 'http://localhost:8080';
const API_KEY = __ENV.AXONFLOW_CLIENT_ID || 'demo-org';

export default function () {
  const res = http.post(
    `${AGENT_URL}/v1/execute`,
    JSON.stringify({ prompt: 'What is 2+2?', mode: 'chat' }),
    {
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${API_KEY}`,
      },
    }
  );

  check(res, {
    'status is 200': (r) => r.status === 200,
    'latency < 10ms': (r) => r.timings.duration < 10,
  });

  errorRate.add(res.status !== 200);
  latency.add(res.timings.duration);

  sleep(0.1);
}

Run with:

k6 run --env AXONFLOW_ENDPOINT=http://localhost:8080 load-test.js

Baseline Results

Reference baseline from a standard deployment (2 Agent nodes, 2 Orchestrator nodes, db.t3.medium RDS, policy-only mode):

RPS	P50	P95	P99	Error Rate	Notes
10	1.2ms	2.8ms	4.1ms	0%	Warmup
50	1.8ms	4.2ms	7.5ms	0%	Normal load
100	2.4ms	4.8ms	8.2ms	0%	Target load
200	3.1ms	8.5ms	15.3ms	0%	Stress test
500	5.2ms	18.7ms	42.1ms	0.02%	Near saturation

Note: These numbers are for policy-only requests (no LLM calls). End-to-end latency with LLM providers will be dominated by provider response time (typically 200ms-2s).

Design Principles

Even Distribution

Our load generator uses a ticker-based approach rather than burst patterns:

Burst Pattern (Wrong):          Even Distribution (Correct):
─────────────────────           ─────────────────────────────
│████████░░░░░░░░░░│           │█░█░█░█░█░█░█░█░█░█░█░█░█░│
│░░░░░░░░████████░░│           │█░█░█░█░█░█░█░█░█░█░█░█░█░│
│░░░░░░░░░░░░░░████│           │█░█░█░█░█░█░█░█░█░█░█░█░█░│
─────────────────────           ─────────────────────────────
   Unrealistic spikes              Real client behavior

Why it matters: Burst patterns can mask performance issues that only appear under sustained load. Even distribution simulates actual production traffic patterns.

Percentile Accuracy

We track latency percentiles using mathematically correct calculations:

Percentile	Meaning
P50	Median - 50% of requests complete faster
P95	95th percentile - 19 out of 20 requests
P99	99th percentile - 99 out of 100 requests

Why percentiles matter:

Averages hide outliers
P95/P99 reveal tail latency issues
SLAs are typically defined using percentiles

Realistic Client Behavior

Our load generator simulates real production clients:

Connection pooling: Reuses connections like production clients
HTTP/2: Modern protocol with multiplexing
TLS 1.3: Full encryption overhead included
Keep-alive: Long-lived connections

Test Categories

Load tests are organized into categories that validate different behaviors:

1. Normal Queries

Standard requests that should succeed:

Category: normal
Expected: Success (200 OK)
Purpose: Validate happy path performance

2. Security Violation Tests

Requests that should be rejected by security policies:

Category: security
Expected: Rejection (403 Forbidden)
Purpose: Validate security rules are enforced

3. Policy Violation Tests

Requests that violate governance policies:

Category: policy
Expected: Blocking (varies by policy)
Purpose: Validate policy engine correctness

4. LLM Integration Tests

Requests that involve LLM providers:

Category: llm
Expected: Success (200 OK)
Purpose: Validate LLM routing and response handling

Metrics Collection

Prometheus Integration

Load test results are exported to Prometheus-compatible format:

# Latency histogram
axonflow_load_test_latency_ms{quantile="0.5"} 2.4
axonflow_load_test_latency_ms{quantile="0.95"} 4.8
axonflow_load_test_latency_ms{quantile="0.99"} 8.2

# Request counters
axonflow_load_test_requests_total{status="success"} 15000
axonflow_load_test_requests_total{status="blocked"} 3500
axonflow_load_test_requests_total{status="error"} 0

Labels and Dimensions

Results are tagged with context:

Label	Purpose
`client`	Client identifier
`test_type`	Test category (normal, security, etc.)
`environment`	Target environment
`provider`	LLM provider (if applicable)

Running Load Tests

Prerequisites

Access to target environment
Valid authentication credentials
Prometheus Pushgateway (for metrics export)

Basic Execution

# Run load test against staging
./load-test \
  --target https://staging.example.com \
  --rps 50 \
  --duration 60s \
  --category normal

# Run with metrics export
./load-test \
  --target https://staging.example.com \
  --rps 100 \
  --duration 300s \
  --pushgateway http://prometheus:9091

Test Parameters

Parameter	Description	Example
`--target`	Target endpoint URL	`https://api.example.com`
`--rps`	Requests per second	`50`, `100`, `200`
`--duration`	Test duration	`30s`, `5m`, `1h`
`--category`	Test category	`normal`, `security`, `policy`
`--workers`	Concurrent workers	`10`, `50`

Results Interpretation

Healthy Results

✅ P50: &lt;5ms (excellent)
✅ P95: &lt;10ms (within target)
✅ Error Rate: 0%
✅ Blocked Rate: ~20% (policy working)

Warning Signs

⚠️ P95 > 20ms: Investigate latency issues
⚠️ Error Rate > 0.1%: Check system logs
⚠️ P50 increasing: Possible degradation
⚠️ Blocked Rate 0%: Policy might not be active

Best Practices

1. Start Small

Begin with low RPS and increase gradually:

Stage 1: 10 RPS for 30s (warmup)
Stage 2: 50 RPS for 60s (baseline)
Stage 3: 100 RPS for 60s (target load)
Stage 4: 200 RPS for 60s (stress test)

2. Test Staging First

Always validate changes in staging before production:

Deploy to staging
Run load tests
Review metrics
If passing, deploy to production
Run lighter validation tests

3. Monitor During Tests

Watch these metrics during load tests:

CPU and memory utilization
Database connection pool
Error rates and logs
Latency percentiles

4. Clean Up After Tests

Scale down test infrastructure
Archive test results
Document any anomalies

Next Steps

Performance Testing Architecture - Scheduled testing infrastructure
Testing Overview - Complete testing pyramid

Tools​

Target Metrics​

Example k6 Load Test​

Baseline Results​

Design Principles​

Even Distribution​

Percentile Accuracy​

Realistic Client Behavior​

Test Categories​

1. Normal Queries​

2. Security Violation Tests​

3. Policy Violation Tests​

4. LLM Integration Tests​

Metrics Collection​

Prometheus Integration​

Labels and Dimensions​

Running Load Tests​

Prerequisites​

Basic Execution​

Test Parameters​

Results Interpretation​

Healthy Results​

Warning Signs​

Best Practices​

1. Start Small​

2. Test Staging First​

3. Monitor During Tests​

4. Clean Up After Tests​

Next Steps​