Skip to main content

Capacity Planning And Sizing

AxonFlow is a control plane, not just a single inference endpoint. Capacity planning should reflect that. The platform is handling:

  • policy evaluation
  • request auditing
  • provider routing
  • workflow execution
  • MCP connector access
  • streaming and UI-style event delivery

That means sizing decisions should be driven by workflow shape, not only request count.

The Three Main Load Drivers

1. Request volume

How many governed requests hit the Agent and Orchestrator per second?

2. Execution complexity

How many steps, tools, or external systems are involved in each request?

3. Connector and provider fan-out

How many downstream calls does one user-facing request trigger?

For many teams, the biggest scaling surprise is not raw HTTP QPS. It is fan-out from multi-agent planning, workflow branching, or connector-heavy requests.

Community Baseline

The docs already recommend:

Minimum for local evaluation

  • 2 vCPU
  • 4 GB RAM
  • 10 GB free disk
  • 4+ vCPU
  • 8-16 GB RAM
  • persistent PostgreSQL storage
  • Prometheus and Grafana retained outside a laptop

That remains the right starting point. But once you move into shared environments, the next question is not “what is the minimum?” It is “what is the bottleneck first?”

What Usually Bottlenecks First

Provider latency

External model calls often dominate total request time. This affects:

  • end-user latency
  • concurrency pressure
  • queue depth in workflows

Database pressure

PostgreSQL is central to audit, workflow state, policy state, and portal-backed runtime data. Slow database behavior can make the whole platform feel degraded even when inference is healthy.

Connector fan-out

MCP-heavy systems can turn one request into many downstream queries or file operations. This amplifies:

  • latency
  • egress pressure
  • response-size risk

SSE and execution visibility

Execution streaming, dashboards, and UI listeners create connection pressure that differs from normal request-response traffic. That is why SSE limits appear explicitly in the tier model.

Practical Sizing Questions

Ask these before choosing your deployment shape:

  1. How many concurrent executions do we expect?
  2. How many providers will be active at once?
  3. Will requests hit only one model or several models and connectors per run?
  4. Do we expect large connector payloads or high response-redaction costs?
  5. Will operators rely heavily on streaming execution views?

Monitoring Signals To Watch

Use Monitoring Overview and the built-in telemetry to watch:

  • request latency
  • provider latency
  • execution backlog
  • database health
  • blocked-request counts
  • token and cost activity
  • connector error rates
  • SSE or streaming connection pressure

If those signals trend poorly during a pilot, the answer is usually to scale the environment or simplify the workload shape before blaming the governance layer itself.

When To Move Up Tiers

Capacity planning is one of the strongest reasons to move from Community to Evaluation or Enterprise. The tier change does not only unlock more features. It also unlocks room for:

  • more providers
  • more concurrent executions
  • more plans and history
  • more approvals and simulation runs