Capacity Planning And Sizing

AxonFlow is a control plane, not just a single inference endpoint. Capacity planning should reflect that. The platform is handling:

policy evaluation
request auditing
provider routing
workflow execution
MCP connector access
streaming and UI-style event delivery

That means sizing decisions should be driven by workflow shape, not only request count.

The Three Main Load Drivers

1. Request volume

How many governed requests hit the Agent and Orchestrator per second?

2. Execution complexity

How many steps, tools, or external systems are involved in each request?

3. Connector and provider fan-out

How many downstream calls does one user-facing request trigger?

For many teams, the biggest scaling surprise is not raw HTTP QPS. It is fan-out from multi-agent planning, workflow branching, or connector-heavy requests.

Community Baseline

The docs already recommend:

Minimum for local evaluation

2 vCPU
4 GB RAM
10 GB free disk

Recommended for more serious team usage

4+ vCPU
8-16 GB RAM
persistent PostgreSQL storage
Prometheus and Grafana retained outside a laptop

That remains the right starting point. But once you move into shared environments, the next question is not “what is the minimum?” It is “what is the bottleneck first?”

What Usually Bottlenecks First

Provider latency

External model calls often dominate total request time. This affects:

end-user latency
concurrency pressure
queue depth in workflows

Database pressure

PostgreSQL is central to audit, workflow state, policy state, and portal-backed runtime data. Slow database behavior can make the whole platform feel degraded even when inference is healthy.

Connector fan-out

MCP-heavy systems can turn one request into many downstream queries or file operations. This amplifies:

latency
egress pressure
response-size risk

SSE and execution visibility

Execution streaming, dashboards, and UI listeners create connection pressure that differs from normal request-response traffic. That is why SSE limits appear explicitly in the tier model.

Practical Sizing Questions

Ask these before choosing your deployment shape:

How many concurrent executions do we expect?
How many providers will be active at once?
Will requests hit only one model or several models and connectors per run?
Do we expect large connector payloads or high response-redaction costs?
Will operators rely heavily on streaming execution views?

Monitoring Signals To Watch

Use Monitoring Overview and the built-in telemetry to watch:

request latency
provider latency
execution backlog
database health
blocked-request counts
token and cost activity
connector error rates
SSE or streaming connection pressure

If those signals trend poorly during a pilot, the answer is usually to scale the environment or simplify the workload shape before blaming the governance layer itself.

When To Move Up Tiers

Capacity planning is one of the strongest reasons to move from Community to Evaluation or Enterprise. The tier change does not only unlock more features. It also unlocks room for:

more providers
more concurrent executions
more plans and history
more approvals and simulation runs

The Three Main Load Drivers​

1. Request volume​

2. Execution complexity​

3. Connector and provider fan-out​

Community Baseline​

Minimum for local evaluation​

Recommended for more serious team usage​

What Usually Bottlenecks First​

Provider latency​

Database pressure​

Connector fan-out​

SSE and execution visibility​

Practical Sizing Questions​

Monitoring Signals To Watch​

When To Move Up Tiers​

Related Docs​