Architecture Review Checklist
This page is intentionally a checklist, not a scorecard. Different teams run AxonFlow in different environments, so public docs should not claim one universal AWS review score or one fixed production SLO. Instead, use this page to assess whether your deployment is ready for serious AI workloads.
Operational Excellence
Ask:
- Can you deploy Agent and Orchestrator changes safely and roll them back?
- Do you have dashboards for request rate, policy outcomes, errors, and latency?
- Can operators tell whether failures are in the Agent, Orchestrator, providers, connectors, PostgreSQL, or Redis?
- Do you have a documented path for rotating secrets and provider credentials?
Signals of a strong setup:
- versioned infrastructure and application configs
- health checks and dashboards in place
- alerting on error rate, queueing, and dependency failures
- repeatable deployment process for config and runtime changes
- documented runbooks for
/health,/prometheus, provider incidents, and connector failures
Security
Ask:
- Is the Agent the only public-facing entry point?
- Are PostgreSQL and Redis on private connectivity only?
- Are provider, database, and connector credentials stored in a secret manager or equivalent secure store?
- Are you reviewing policy outcomes and audit records for suspicious behavior?
Signals of a strong setup:
- TLS termination and authenticated ingress
- no plaintext secrets in repo configs
- least-privilege credentials for connectors and providers
- clear separation between public and internal service boundaries
- explicit handling for
AXONFLOW_INTERNAL_SERVICE_SECRETand provider secret rotation
Reliability
Ask:
- What happens if PostgreSQL is unavailable?
- What happens if Redis is unavailable?
- Can you restart one service without losing the whole workflow path?
- Are you backing up durable state that matters to your audit or workflow history?
Signals of a strong setup:
- redundant runtime instances where needed
- monitored database health and backup posture
- dependency failure handling tested in staging
- health checks wired into your deployment platform
- a plan for what happens when an expired license degrades higher-tier features back to Community limits
Performance Efficiency
Ask:
- Are you measuring request latency separately for the Agent and Orchestrator?
- Do you understand whether latency is dominated by policy checks, connector calls, or LLM providers?
- Are you sizing the Agent and Orchestrator independently?
- Are long-running workflows isolated from low-latency policy and gateway traffic?
Signals of a strong setup:
- Prometheus and dashboard coverage for request timing
- separate scaling decisions for ingress and orchestration paths
- connector and provider latencies monitored independently
- load tests for the traffic shape you actually expect
- clear understanding of whether slowness originates in policy evaluation, providers, workflows, or MCP calls
Cost Optimization
Ask:
- Are you tracking provider usage, token spend, and request volume over time?
- Do you know which workflows create the highest connector and LLM cost?
- Are you using the right integration mode for the job instead of routing everything through the most expensive path?
- Do you have retention and logging settings that match real operational needs?
Signals of a strong setup:
- clear cost attribution by workflow, tenant, or environment
- provider-routing decisions informed by workload needs
- infrastructure sized to real traffic rather than guesses
- storage and observability retention reviewed periodically
- a deliberate decision on when Community limits are enough and when Evaluation or Enterprise becomes operationally justified
Community vs Enterprise Considerations
Community is enough to build and validate real governed AI applications, but a hardened production posture often requires more than the base runtime:
- broader governance workflows
- regulated-process modules
- enterprise operations tooling
- protected deployment guidance
For that reason, this checklist is useful in both community and enterprise contexts, but the remediation path often differs. Community users may harden the deployment with their own platform tooling. Enterprise customers may rely on protected operational guidance and additional modules.
Suggested Review Cadence
Use this checklist:
- before your first production launch
- after major topology changes
- before onboarding regulated or high-risk workflows
- after incidents involving providers, connectors, or policy failures
