Architecture Review Checklist

This page is intentionally a checklist, not a scorecard. Different teams run AxonFlow in different environments, so public docs should not claim one universal AWS review score or one fixed production SLO. Instead, use this page to assess whether your deployment is ready for serious AI workloads.

Operational Excellence

Ask:

Can you deploy Agent and Orchestrator changes safely and roll them back?
Do you have dashboards for request rate, policy outcomes, errors, and latency?
Can operators tell whether failures are in the Agent, Orchestrator, providers, connectors, PostgreSQL, or Redis?
Do you have a documented path for rotating secrets and provider credentials?

Signals of a strong setup:

versioned infrastructure and application configs
health checks and dashboards in place
alerting on error rate, queueing, and dependency failures
repeatable deployment process for config and runtime changes
documented runbooks for /health, /prometheus, provider incidents, and connector failures

Security

Ask:

Is the Agent the only public-facing entry point?
Are PostgreSQL and Redis on private connectivity only?
Are provider, database, and connector credentials stored in a secret manager or equivalent secure store?
Are you reviewing policy outcomes and audit records for suspicious behavior?

Signals of a strong setup:

TLS termination and authenticated ingress
no plaintext secrets in repo configs
least-privilege credentials for connectors and providers
clear separation between public and internal service boundaries
explicit handling for AXONFLOW_INTERNAL_SERVICE_SECRET and provider secret rotation

Reliability

Ask:

What happens if PostgreSQL is unavailable?
What happens if Redis is unavailable?
Can you restart one service without losing the whole workflow path?
Are you backing up durable state that matters to your audit or workflow history?

Signals of a strong setup:

redundant runtime instances where needed
monitored database health and backup posture
dependency failure handling tested in staging
health checks wired into your deployment platform
a plan for what happens when an expired license degrades higher-tier features back to Community limits

Performance Efficiency

Ask:

Are you measuring request latency separately for the Agent and Orchestrator?
Do you understand whether latency is dominated by policy checks, connector calls, or LLM providers?
Are you sizing the Agent and Orchestrator independently?
Are long-running workflows isolated from low-latency policy and gateway traffic?

Signals of a strong setup:

Prometheus and dashboard coverage for request timing
separate scaling decisions for ingress and orchestration paths
connector and provider latencies monitored independently
load tests for the traffic shape you actually expect
clear understanding of whether slowness originates in policy evaluation, providers, workflows, or MCP calls

Cost Optimization

Ask:

Are you tracking provider usage, token spend, and request volume over time?
Do you know which workflows create the highest connector and LLM cost?
Are you using the right integration mode for the job instead of routing everything through the most expensive path?
Do you have retention and logging settings that match real operational needs?

Signals of a strong setup:

clear cost attribution by workflow, tenant, or environment
provider-routing decisions informed by workload needs
infrastructure sized to real traffic rather than guesses
storage and observability retention reviewed periodically
a deliberate decision on when Community limits are enough and when Evaluation or Enterprise becomes operationally justified

Community vs Enterprise Considerations

Community is enough to build and validate real governed AI applications, but a hardened production posture often requires more than the base runtime:

broader governance workflows
regulated-process modules
enterprise operations tooling
protected deployment guidance

For that reason, this checklist is useful in both community and enterprise contexts, but the remediation path often differs. Community users may harden the deployment with their own platform tooling. Enterprise customers may rely on protected operational guidance and additional modules.

Suggested Review Cadence

Use this checklist:

before your first production launch
after major topology changes
before onboarding regulated or high-risk workflows
after incidents involving providers, connectors, or policy failures

Operational Excellence​

Security​

Reliability​

Performance Efficiency​

Cost Optimization​

Community vs Enterprise Considerations​

Suggested Review Cadence​

Related Docs​

Operational Excellence

Security

Reliability

Performance Efficiency

Cost Optimization

Community vs Enterprise Considerations

Suggested Review Cadence

Related Docs