Execution Operations Playbook

Once AxonFlow is running real multi-agent workflows, execution data stops being just a debugging aid. It becomes part of day-two operations. Operators need to know which workflow is stuck, which step failed, whether a policy blocked the run, whether a reviewer is holding it up, and how to prove what happened afterward.

This playbook explains how teams usually operate that execution surface in practice.

The Three Main Execution Surfaces

AxonFlow gives operators three complementary ways to work with executions:

the Execution Viewer for visual inspection
axonctl executions for terminal-based list, get, replay, and export workflows
execution APIs for application dashboards, operator tooling, and automation

The core API families are:

replay and export APIs under /api/v1/executions
unified status and streaming APIs under /api/v1/unified/executions

The right operating model is usually to use the UI for triage, the CLI for incident handling, and the APIs for automation.

What Operators Need To Watch

The most useful execution signals are:

workflow status and duration
blocked or approval-pending steps
provider failures and retry loops
connector latency and connector error spikes
token and cost growth by workflow or provider
policy decisions, including redactions and block reasons

That is why execution operations usually sit next to Grafana Dashboard, Token Usage, and WCP Tracing and Audit.

The Standard Incident Loop

When a workflow misbehaves, most strong teams follow the same sequence.

1. Find The Run

Start with one of:

GET /api/v1/executions
GET /api/v1/unified/executions
axonctl executions list
the execution viewer UI

Use filters such as status, workflow ID, time range, or execution type to narrow the search quickly.

2. Inspect The Timeline

Then inspect:

GET /api/v1/executions/{id}
GET /api/v1/executions/{id}/steps
GET /api/v1/executions/{id}/timeline

This is where you answer:

which step failed or stalled?
did a policy block it?
did it enter require_approval?
did a provider fail?
did a connector return bad or redacted data?

3. Decide Which Failure Class You Are In

In practice, most incidents fall into one of five buckets:

policy block: the control plane is behaving correctly, but the request violated a rule
approval pending: the run is waiting on a reviewer, not on infrastructure
provider failure: upstream LLM or routing behavior is degraded
connector failure: the orchestration is healthy, but the integration target is not
workflow logic bug: the agent or orchestrator state machine is wrong

That classification matters because the right next action is different in each case.

4. Take The Correct Action

Typical actions are:

approve or reject the blocked step if the workflow is waiting for review
cancel a long-running or unsafe execution through POST /api/v1/unified/executions/{id}/cancel
replay or export the run for deeper debugging
re-run the workflow from the application side once the upstream issue is fixed

The execution control plane is strongest when teams use it to shorten mean time to understanding, not only mean time to restart.

Replay And Export

Replay and export are what make the execution surface operationally useful rather than just visually interesting.

Use replay when you need to understand:

how a run reached a bad state
which step changed the trajectory
whether the issue is deterministic

Use export when you need:

audit evidence
a debugging artifact to share with another engineer
a review package for compliance, governance, or product stakeholders

The most common tools are:

axonctl executions replay <id>
axonctl executions export <id>
GET /api/v1/executions/{id}/export

Approval-Aware Execution Handling

Approval-driven workflows need a slightly different operating loop. A pending execution is not always broken. It may be doing exactly what the governance design intended.

Operators should distinguish between:

a workflow waiting normally in an approval state
a queue that is backing up because reviewer operations are weak
a workflow that should never have required approval in the first place

That is why execution operations and Approvals And Exception Handling Patterns belong together.

What Good Teams Automate

Once the execution surface becomes important, teams usually automate:

live dashboards based on /api/v1/unified/executions/{id}/stream
alerts for abnormal failure or pending-approval spikes
exports for regulated review workflows
links from app incidents directly into execution detail pages

That is the point where AxonFlow starts to behave like a control plane platform rather than a single-service runtime.

A Practical Maturity Curve

early stage: engineers use the execution viewer and axonctl directly
shared-team stage: dashboards and alerts are built around the unified execution APIs
enterprise stage: execution operations are combined with approvals, audit exports, portal workflows, and broader governance processes

Each stage builds on the same execution primitives. The difference is how much operator workflow grows around them.

The Three Main Execution Surfaces​

What Operators Need To Watch​

The Standard Incident Loop​

1. Find The Run​

2. Inspect The Timeline​

3. Decide Which Failure Class You Are In​

4. Take The Correct Action​

Replay And Export​

Approval-Aware Execution Handling​

What Good Teams Automate​

A Practical Maturity Curve​

Related Docs​