Skip to main content

Failure Modes And Recovery

AxonFlow is most useful when it makes AI systems more predictable. That means the docs should explain not only the happy path, but also how the control plane fails and how operators recover.

This page is the cross-system troubleshooting view for the most common failure patterns in production and pre-production environments.

Failure Mode 1: Provider Failure

Typical symptoms:

  • routed requests time out or fail
  • one provider becomes unhealthy while others still work
  • latency spikes even though the application code did not change

What to check:

  • provider credentials and configuration
  • routing strategy and strict-provider settings
  • provider-specific setup pages such as OpenAI or Azure OpenAI
  • Provider Routing if requests are supposed to fail over

Failure Mode 2: Connector Failure

Typical symptoms:

  • MCP tool calls fail or hang
  • one data source is unhealthy while the rest of the platform looks fine
  • response size or query limits are hit

What to check:

Failure Mode 3: Request Blocked By Policy

Typical symptoms:

  • the runtime returns a governance denial
  • the request never reaches the provider or connector
  • the application reports a policy block instead of a provider error

What to check:

  • whether the block happened on input or output
  • current system and tenant policy coverage
  • whether the correct tenant, organization, or user context reached the runtime
  • Policy-as-Code and Configuring Policies

Failure Mode 4: Approval Stall

Typical symptoms:

  • a workflow appears stuck
  • execution status is pending approval
  • users experience delays without a clean failure

What to check:

Failure Mode 5: Execution Failure

Typical symptoms:

  • a MAP plan or workflow run aborts
  • one step fails and the whole run stops
  • execution history shows retries or partial progress

What to check:

  • Execution Viewer
  • provider and connector dependencies touched by the failing step
  • request context such as tenant or org headers for workflow APIs

Failure Mode 6: Deployment Or Monitoring Drift

Typical symptoms:

  • services are up but dashboards are missing data
  • health endpoints pass but workflows behave inconsistently
  • local setups work while shared environments degrade

What to check:

Recovery Mindset

The best recovery path is usually to identify which layer failed first:

  1. request auth or tenant context
  2. policy enforcement
  3. provider or connector dependency
  4. workflow execution state
  5. deployment or observability plumbing

That ordering makes incident triage faster because it follows the real control-plane path instead of jumping randomly between services.