Failure Modes And Recovery
AxonFlow is most useful when it makes AI systems more predictable. That means the docs should explain not only the happy path, but also how the control plane fails and how operators recover.
This page is the cross-system troubleshooting view for the most common failure patterns in production and pre-production environments.
Failure Mode 1: Provider Failure
Typical symptoms:
- routed requests time out or fail
- one provider becomes unhealthy while others still work
- latency spikes even though the application code did not change
What to check:
- provider credentials and configuration
- routing strategy and strict-provider settings
- provider-specific setup pages such as OpenAI or Azure OpenAI
- Provider Routing if requests are supposed to fail over
Failure Mode 2: Connector Failure
Typical symptoms:
- MCP tool calls fail or hang
- one data source is unhealthy while the rest of the platform looks fine
- response size or query limits are hit
What to check:
- connector credentials and network reachability
- connector-specific configuration
- Runtime Configuration
- Connector Capability Matrix
Failure Mode 3: Request Blocked By Policy
Typical symptoms:
- the runtime returns a governance denial
- the request never reaches the provider or connector
- the application reports a policy block instead of a provider error
What to check:
- whether the block happened on input or output
- current system and tenant policy coverage
- whether the correct tenant, organization, or user context reached the runtime
- Policy-as-Code and Configuring Policies
Failure Mode 4: Approval Stall
Typical symptoms:
- a workflow appears stuck
- execution status is pending approval
- users experience delays without a clean failure
What to check:
- whether the workflow or policy intentionally produced
require_approval - whether you are on a tier that supports the approval workflow you expect
- HITL Approval Gates and Workflow Control Plane
Failure Mode 5: Execution Failure
Typical symptoms:
- a MAP plan or workflow run aborts
- one step fails and the whole run stops
- execution history shows retries or partial progress
What to check:
- Execution Viewer
- provider and connector dependencies touched by the failing step
- request context such as tenant or org headers for workflow APIs
Failure Mode 6: Deployment Or Monitoring Drift
Typical symptoms:
- services are up but dashboards are missing data
- health endpoints pass but workflows behave inconsistently
- local setups work while shared environments degrade
What to check:
- service-to-service connectivity
- Prometheus scraping and Grafana dashboards
- PostgreSQL and Redis availability
- Monitoring Overview and Deployment Troubleshooting
Recovery Mindset
The best recovery path is usually to identify which layer failed first:
- request auth or tenant context
- policy enforcement
- provider or connector dependency
- workflow execution state
- deployment or observability plumbing
That ordering makes incident triage faster because it follows the real control-plane path instead of jumping randomly between services.
