Execution Operations Playbook
Once AxonFlow is running real multi-agent workflows, execution data stops being just a debugging aid. It becomes part of day-two operations. Operators need to know which workflow is stuck, which step failed, whether a policy blocked the run, whether a reviewer is holding it up, and how to prove what happened afterward.
This playbook explains how teams usually operate that execution surface in practice.
The Three Main Execution Surfaces
AxonFlow gives operators three complementary ways to work with executions:
- the Execution Viewer for visual inspection
axonctl executionsfor terminal-based list, get, replay, and export workflows- execution APIs for application dashboards, operator tooling, and automation
The core API families are:
- replay and export APIs under
/api/v1/executions - unified status and streaming APIs under
/api/v1/unified/executions
The right operating model is usually to use the UI for triage, the CLI for incident handling, and the APIs for automation.
What Operators Need To Watch
The most useful execution signals are:
- workflow status and duration
- blocked or approval-pending steps
- provider failures and retry loops
- connector latency and connector error spikes
- token and cost growth by workflow or provider
- policy decisions, including redactions and block reasons
That is why execution operations usually sit next to Grafana Dashboard, Token Usage, and WCP Tracing and Audit.
The Standard Incident Loop
When a workflow misbehaves, most strong teams follow the same sequence.
1. Find The Run
Start with one of:
GET /api/v1/executionsGET /api/v1/unified/executionsaxonctl executions list- the execution viewer UI
Use filters such as status, workflow ID, time range, or execution type to narrow the search quickly.
2. Inspect The Timeline
Then inspect:
GET /api/v1/executions/{id}GET /api/v1/executions/{id}/stepsGET /api/v1/executions/{id}/timeline
This is where you answer:
- which step failed or stalled?
- did a policy block it?
- did it enter
require_approval? - did a provider fail?
- did a connector return bad or redacted data?
3. Decide Which Failure Class You Are In
In practice, most incidents fall into one of five buckets:
- policy block: the control plane is behaving correctly, but the request violated a rule
- approval pending: the run is waiting on a reviewer, not on infrastructure
- provider failure: upstream LLM or routing behavior is degraded
- connector failure: the orchestration is healthy, but the integration target is not
- workflow logic bug: the agent or orchestrator state machine is wrong
That classification matters because the right next action is different in each case.
4. Take The Correct Action
Typical actions are:
- approve or reject the blocked step if the workflow is waiting for review
- cancel a long-running or unsafe execution through
POST /api/v1/unified/executions/{id}/cancel - replay or export the run for deeper debugging
- re-run the workflow from the application side once the upstream issue is fixed
The execution control plane is strongest when teams use it to shorten mean time to understanding, not only mean time to restart.
Replay And Export
Replay and export are what make the execution surface operationally useful rather than just visually interesting.
Use replay when you need to understand:
- how a run reached a bad state
- which step changed the trajectory
- whether the issue is deterministic
Use export when you need:
- audit evidence
- a debugging artifact to share with another engineer
- a review package for compliance, governance, or product stakeholders
The most common tools are:
axonctl executions replay <id>axonctl executions export <id>GET /api/v1/executions/{id}/export
Approval-Aware Execution Handling
Approval-driven workflows need a slightly different operating loop. A pending execution is not always broken. It may be doing exactly what the governance design intended.
Operators should distinguish between:
- a workflow waiting normally in an approval state
- a queue that is backing up because reviewer operations are weak
- a workflow that should never have required approval in the first place
That is why execution operations and Approvals And Exception Handling Patterns belong together.
What Good Teams Automate
Once the execution surface becomes important, teams usually automate:
- live dashboards based on
/api/v1/unified/executions/{id}/stream - alerts for abnormal failure or pending-approval spikes
- exports for regulated review workflows
- links from app incidents directly into execution detail pages
That is the point where AxonFlow starts to behave like a control plane platform rather than a single-service runtime.
A Practical Maturity Curve
- early stage: engineers use the execution viewer and
axonctldirectly - shared-team stage: dashboards and alerts are built around the unified execution APIs
- enterprise stage: execution operations are combined with approvals, audit exports, portal workflows, and broader governance processes
Each stage builds on the same execution primitives. The difference is how much operator workflow grows around them.
