Skip to main content

Execution Operations Playbook

Once AxonFlow is running real multi-agent workflows, execution data stops being just a debugging aid. It becomes part of day-two operations. Operators need to know which workflow is stuck, which step failed, whether a policy blocked the run, whether a reviewer is holding it up, and how to prove what happened afterward.

This playbook explains how teams usually operate that execution surface in practice.

The Three Main Execution Surfaces

AxonFlow gives operators three complementary ways to work with executions:

  • the Execution Viewer for visual inspection
  • axonctl executions for terminal-based list, get, replay, and export workflows
  • execution APIs for application dashboards, operator tooling, and automation

The core API families are:

  • replay and export APIs under /api/v1/executions
  • unified status and streaming APIs under /api/v1/unified/executions

The right operating model is usually to use the UI for triage, the CLI for incident handling, and the APIs for automation.

What Operators Need To Watch

The most useful execution signals are:

  • workflow status and duration
  • blocked or approval-pending steps
  • provider failures and retry loops
  • connector latency and connector error spikes
  • token and cost growth by workflow or provider
  • policy decisions, including redactions and block reasons

That is why execution operations usually sit next to Grafana Dashboard, Token Usage, and WCP Tracing and Audit.

The Standard Incident Loop

When a workflow misbehaves, most strong teams follow the same sequence.

1. Find The Run

Start with one of:

  • GET /api/v1/executions
  • GET /api/v1/unified/executions
  • axonctl executions list
  • the execution viewer UI

Use filters such as status, workflow ID, time range, or execution type to narrow the search quickly.

2. Inspect The Timeline

Then inspect:

  • GET /api/v1/executions/{id}
  • GET /api/v1/executions/{id}/steps
  • GET /api/v1/executions/{id}/timeline

This is where you answer:

  • which step failed or stalled?
  • did a policy block it?
  • did it enter require_approval?
  • did a provider fail?
  • did a connector return bad or redacted data?

3. Decide Which Failure Class You Are In

In practice, most incidents fall into one of five buckets:

  • policy block: the control plane is behaving correctly, but the request violated a rule
  • approval pending: the run is waiting on a reviewer, not on infrastructure
  • provider failure: upstream LLM or routing behavior is degraded
  • connector failure: the orchestration is healthy, but the integration target is not
  • workflow logic bug: the agent or orchestrator state machine is wrong

That classification matters because the right next action is different in each case.

4. Take The Correct Action

Typical actions are:

  • approve or reject the blocked step if the workflow is waiting for review
  • cancel a long-running or unsafe execution through POST /api/v1/unified/executions/{id}/cancel
  • replay or export the run for deeper debugging
  • re-run the workflow from the application side once the upstream issue is fixed

The execution control plane is strongest when teams use it to shorten mean time to understanding, not only mean time to restart.

Replay And Export

Replay and export are what make the execution surface operationally useful rather than just visually interesting.

Use replay when you need to understand:

  • how a run reached a bad state
  • which step changed the trajectory
  • whether the issue is deterministic

Use export when you need:

  • audit evidence
  • a debugging artifact to share with another engineer
  • a review package for compliance, governance, or product stakeholders

The most common tools are:

  • axonctl executions replay <id>
  • axonctl executions export <id>
  • GET /api/v1/executions/{id}/export

Approval-Aware Execution Handling

Approval-driven workflows need a slightly different operating loop. A pending execution is not always broken. It may be doing exactly what the governance design intended.

Operators should distinguish between:

  • a workflow waiting normally in an approval state
  • a queue that is backing up because reviewer operations are weak
  • a workflow that should never have required approval in the first place

That is why execution operations and Approvals And Exception Handling Patterns belong together.

What Good Teams Automate

Once the execution surface becomes important, teams usually automate:

  • live dashboards based on /api/v1/unified/executions/{id}/stream
  • alerts for abnormal failure or pending-approval spikes
  • exports for regulated review workflows
  • links from app incidents directly into execution detail pages

That is the point where AxonFlow starts to behave like a control plane platform rather than a single-service runtime.

A Practical Maturity Curve

  • early stage: engineers use the execution viewer and axonctl directly
  • shared-team stage: dashboards and alerts are built around the unified execution APIs
  • enterprise stage: execution operations are combined with approvals, audit exports, portal workflows, and broader governance processes

Each stage builds on the same execution primitives. The difference is how much operator workflow grows around them.