Failure Modes And Recovery

AxonFlow is most useful when it makes AI systems more predictable. That means the docs should explain not only the happy path, but also how the control plane fails and how operators recover.

This page is the cross-system troubleshooting view for the most common failure patterns in production and pre-production environments.

Failure Mode 1: Provider Failure

Typical symptoms:

routed requests time out or fail
one provider becomes unhealthy while others still work
latency spikes even though the application code did not change

What to check:

provider credentials and configuration
routing strategy and strict-provider settings
provider-specific setup pages such as OpenAI or Azure OpenAI
Provider Routing if requests are supposed to fail over

Failure Mode 2: Connector Failure

Typical symptoms:

MCP tool calls fail or hang
one data source is unhealthy while the rest of the platform looks fine
response size or query limits are hit

What to check:

connector credentials and network reachability
connector-specific configuration
Runtime Configuration
Connector Capability Matrix

Failure Mode 3: Request Blocked By Policy

Typical symptoms:

the runtime returns a governance denial
the request never reaches the provider or connector
the application reports a policy block instead of a provider error

What to check:

whether the block happened on input or output
current system and tenant policy coverage
whether the correct tenant, organization, or user context reached the runtime
Policy-as-Code and Configuring Policies

Failure Mode 4: Approval Stall

Typical symptoms:

a workflow appears stuck
execution status is pending approval
users experience delays without a clean failure

What to check:

whether the workflow or policy intentionally produced require_approval
whether you are on a tier that supports the approval workflow you expect
HITL Approval Gates and Workflow Control Plane

Failure Mode 5: Execution Failure

Typical symptoms:

a MAP plan or workflow run aborts
one step fails and the whole run stops
execution history shows retries or partial progress

What to check:

Execution Viewer
provider and connector dependencies touched by the failing step
request context such as tenant or org headers for workflow APIs

Failure Mode 6: Deployment Or Monitoring Drift

Typical symptoms:

services are up but dashboards are missing data
health endpoints pass but workflows behave inconsistently
local setups work while shared environments degrade

What to check:

service-to-service connectivity
Prometheus scraping and Grafana dashboards
PostgreSQL and Redis availability
Monitoring Overview and Deployment Troubleshooting

Recovery Mindset

The best recovery path is usually to identify which layer failed first:

request auth or tenant context
policy enforcement
provider or connector dependency
workflow execution state
deployment or observability plumbing

That ordering makes incident triage faster because it follows the real control-plane path instead of jumping randomly between services.

Failure Mode 1: Provider Failure​

Failure Mode 2: Connector Failure​

Failure Mode 3: Request Blocked By Policy​

Failure Mode 4: Approval Stall​

Failure Mode 5: Execution Failure​

Failure Mode 6: Deployment Or Monitoring Drift​

Recovery Mindset​

Related Docs​

Failure Mode 1: Provider Failure

Failure Mode 2: Connector Failure

Failure Mode 3: Request Blocked By Policy

Failure Mode 4: Approval Stall

Failure Mode 5: Execution Failure

Failure Mode 6: Deployment Or Monitoring Drift

Recovery Mindset

Related Docs