Skip to main content

Observability Exporters

AxonFlow's decision tracer emits OpenTelemetry spans for every policy decision. Each span carries structured attributes -- verdict, stage, latency, policy IDs, org, and tenant -- that make governance decisions observable across your existing monitoring infrastructure.

The tracer ships spans to any OTLP/gRPC collector. This page covers how to route those spans to common backends and how to generate Prometheus RED metrics from them.

OTel is opt-in

AxonFlow runs fine without OTel. When AXONFLOW_OTEL_ENDPOINT is unset (the default), the agent uses a no-op tracer and emits nothing. All configurations on this page are additive overlays.

Span Attributes

Every axonflow.decision span carries these attributes:

AttributeTypeExample
decision.idstring01J5K... (ULID)
decision.stagestringllm, tool, or agent
decision.verdictstringallow, deny, or needs_approval
decision.policy_idsstring[]["p_pii_us", "p_sqli"]
decision.latency_msint647
decision.reasonsstring"clean" or policy match detail
org.idstringacme-prod
tenant.idstringtenant-rocket

Environment Variables

VariableDefaultDescription
AXONFLOW_OTEL_ENDPOINT(empty)OTLP/gRPC endpoint (e.g. otel-collector:4317). Empty disables tracing.
AXONFLOW_OTEL_SERVICE_NAMEaxonflow-agentservice.name resource attribute for dashboard keying.
AXONFLOW_OTEL_SAMPLE_RATE1.0Head sampling ratio [0.0, 1.0]. Reduce to 0.1 in high-RPS environments.

Jaeger (Built-in Overlay)

The repository ships a Jaeger overlay out of the box:

docker compose -f docker-compose.yml -f docker-compose.otel.yml up -d
open http://localhost:16686

This uses the OTel Collector to forward spans to Jaeger's OTLP receiver. See the existing OTel tracing setup for trace correlation with WCP workflows.

Datadog APM

Route decision spans to Datadog APM via the OTel Collector's Datadog exporter.

Quick Start

export DD_API_KEY=<your-datadog-api-key>
export DD_SITE=datadoghq.com # or datadoghq.eu, us3.datadoghq.com, etc.

docker compose -f docker-compose.yml \
-f docker-compose.otel-datadog.yml up -d

What Gets Exported

The Collector config at config/otel-collector-datadog.yaml maps all decision.*, org.id, and tenant.id span attributes to Datadog APM tags. In Datadog:

  • APM > Traces: filter by service:axonflow-agent, then facet on decision.verdict, decision.stage, or tenant.id.
  • Monitors: alert on decision.verdict:deny rate exceeding a threshold.
  • Dashboards: build widgets using the decision.* tags.

Collector Config

The config uses the attributes/datadog processor to ensure all AxonFlow-specific attributes are forwarded as APM tags. The span_name_as_resource_name: true setting maps the OTel span name (axonflow.decision) to the Datadog resource name, making it filterable in the trace explorer.

Requirements

  • A Datadog account with APM enabled.
  • DD_API_KEY set as an environment variable or in a .env file.
  • DD_SITE set to your Datadog region (defaults to datadoghq.com).

Grafana Tempo + Prometheus

Route traces to Grafana Tempo and generate RED (Rate, Error, Duration) metrics via the OTel Collector's spanmetrics connector.

Quick Start

docker compose -f docker-compose.yml \
-f docker-compose.otel-grafana.yml up -d

open http://localhost:3000 # Grafana (admin/admin)

This brings up five services:

ServicePortPurpose
OTel Collector4317, 4318Receives OTLP spans, exports to Tempo + generates metrics
Tempo3200Trace storage and query
Prometheus9090Scrapes spanmetrics from Collector on :8889
Grafana3000Dashboards for traces + metrics
AxonFlow Agent8080Sends spans to Collector

Decision Mode Dashboard

The repository ships a pre-provisioned Grafana dashboard at grafana/dashboards/decision-mode-overview.json with nine panels:

  1. Decision Rate -- decisions per second over time.
  2. Verdict Distribution -- donut chart of allow/deny/needs_approval.
  3. Error Rate -- ratio of STATUS_CODE_ERROR spans to total.
  4. Decision Latency (P50/P95/P99) -- histogram quantiles from duration_milliseconds_bucket.
  5. Decisions by Stage -- breakdown by llm, tool, agent.
  6. Policy Trigger Rate -- stacked bar chart by verdict.
  7. Per-Tenant Decision Volume -- per-tenant.id rate.
  8. Deny Rate by Tenant -- table of deny counts per tenant over the selected range.
  9. Latency Heatmap -- heatmap of duration_milliseconds_bucket.

Template variables $org_id and $tenant_id filter all panels.

Spanmetrics Details

The spanmetrics connector in config/otel-collector-grafana.yaml generates three metric families from axonflow.decision spans:

MetricTypeDescription
calls_totalcounterTotal span count, labeled by decision_verdict, decision_stage, org_id, tenant_id
duration_milliseconds_buckethistogramSpan duration distribution with buckets at 1, 2, 5, 10, 25, 50, 100, 250, 500, 1000 ms
duration_milliseconds_sum / _countcounterSum and count for average latency calculation

These are standard Prometheus metrics, queryable with PromQL and usable in any Prometheus-compatible alerting system.

Using with Existing Prometheus

If you already run Prometheus, add the OTel Collector's metrics endpoint to your scrape config:

scrape_configs:
- job_name: 'otel-spanmetrics'
static_configs:
- targets: ['otel-collector:8889']

LangSmith Trace Correlation

AxonFlow's decision tracer generates a W3C-compliant trace_id that is returned in every API response. If you use LangSmith for LLM observability, you can correlate AxonFlow governance decisions with LangSmith runs by propagating the trace_id as LangSmith run metadata.

How It Works

  1. Your application calls the Decision API (POST /api/v1/decide) before forwarding a request.
  2. AxonFlow returns a trace_id in the response (32-character lowercase hex W3C format).
  3. Your application passes that trace_id as metadata to LangSmith when starting the LLM run.
  4. In LangSmith, search by trace_id to see both the LLM execution and the governance decision side by side.

Get the trace_id

The Decision API response includes trace_id directly:

curl -X POST http://localhost:8080/api/v1/decide \
-H "Content-Type: application/json" \
-d '{
"stage": "llm",
"query": "What is the user SSN?",
"target": "gpt-4o",
"caller_id": "support-agent"
}'

Response:

{
"verdict": "deny",
"decision_id": "01J5K...",
"trace_id": "a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6",
"reasons": ["PII detection: SSN pattern matched"],
"evaluated_policies": ["p_pii_us"],
"stage": "llm"
}

Pass trace_id to LangSmith (Python)

import requests
from langsmith import traceable

decision = requests.post(
"http://localhost:8080/api/v1/decide",
json={"stage": "llm", "query": query, "target": "gpt-4o", "caller_id": "support-agent"},
).json()

trace_id = decision["trace_id"]

@traceable(metadata={"axonflow_trace_id": trace_id})
def run_llm(query: str):
pass

if decision["verdict"] == "allow":
run_llm(query)

Pass trace_id to LangSmith (TypeScript)

const decision = await fetch("http://localhost:8080/api/v1/decide", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
stage: "llm",
query: "What is the user SSN?",
target: "gpt-4o",
caller_id: "support-agent",
}),
}).then((r) => r.json());

// Pass trace_id to your LangSmith-instrumented function
await tracedLlmCall("What is the user SSN?", {
metadata: { axonflow_trace_id: decision.trace_id },
});

No code changes are needed on the AxonFlow side. The trace_id is emitted by the decision tracer and included in the Decision API response when AXONFLOW_OTEL_ENDPOINT is configured.