Capacity Planning And Sizing
AxonFlow is a control plane, not just a single inference endpoint. Capacity planning should reflect that. The platform is handling:
- policy evaluation
- request auditing
- provider routing
- workflow execution
- MCP connector access
- streaming and UI-style event delivery
That means sizing decisions should be driven by workflow shape, not only request count.
The Three Main Load Drivers
1. Request volume
How many governed requests hit the Agent and Orchestrator per second?
2. Execution complexity
How many steps, tools, or external systems are involved in each request?
3. Connector and provider fan-out
How many downstream calls does one user-facing request trigger?
For many teams, the biggest scaling surprise is not raw HTTP QPS. It is fan-out from multi-agent planning, workflow branching, or connector-heavy requests.
Community Baseline
The docs already recommend:
Minimum for local evaluation
- 2 vCPU
- 4 GB RAM
- 10 GB free disk
Recommended for more serious team usage
- 4+ vCPU
- 8-16 GB RAM
- persistent PostgreSQL storage
- Prometheus and Grafana retained outside a laptop
That remains the right starting point. But once you move into shared environments, the next question is not “what is the minimum?” It is “what is the bottleneck first?”
What Usually Bottlenecks First
Provider latency
External model calls often dominate total request time. This affects:
- end-user latency
- concurrency pressure
- queue depth in workflows
Database pressure
PostgreSQL is central to audit, workflow state, policy state, and portal-backed runtime data. Slow database behavior can make the whole platform feel degraded even when inference is healthy.
Connector fan-out
MCP-heavy systems can turn one request into many downstream queries or file operations. This amplifies:
- latency
- egress pressure
- response-size risk
SSE and execution visibility
Execution streaming, dashboards, and UI listeners create connection pressure that differs from normal request-response traffic. That is why SSE limits appear explicitly in the tier model.
Practical Sizing Questions
Ask these before choosing your deployment shape:
- How many concurrent executions do we expect?
- How many providers will be active at once?
- Will requests hit only one model or several models and connectors per run?
- Do we expect large connector payloads or high response-redaction costs?
- Will operators rely heavily on streaming execution views?
Monitoring Signals To Watch
Use Monitoring Overview and the built-in telemetry to watch:
- request latency
- provider latency
- execution backlog
- database health
- blocked-request counts
- token and cost activity
- connector error rates
- SSE or streaming connection pressure
If those signals trend poorly during a pilot, the answer is usually to scale the environment or simplify the workload shape before blaming the governance layer itself.
When To Move Up Tiers
Capacity planning is one of the strongest reasons to move from Community to Evaluation or Enterprise. The tier change does not only unlock more features. It also unlocks room for:
- more providers
- more concurrent executions
- more plans and history
- more approvals and simulation runs
