AI Agent Observability and Evaluation Blueprint: Tracing, Testing, and Monitoring Multi-Step Workflows (2026)

A 30/60/90 execution blueprint for instrumenting, evaluating, and monitoring multi-step AI agent workflows in production through structured tracing, offline and online evaluation pipelines, and feedback-driven quality loops.

T
Published

TL;DR for Engineering Leaders

Agent observability is the practice of capturing structured trace data at every decision point in a multi-step AI workflow - planning, tool selection, action execution, and result processing - so that teams can reconstruct what happened, why it happened, and what it cost. This is distinct from traditional application performance monitoring (APM), which tracks request-response pairs and infrastructure health but cannot answer whether an agent completed a task correctly or spent three times the expected token budget doing so. Teams that treat agent observability as a monitoring add-on rather than a purpose-built discipline will miss the failure modes that matter most: silent tool-call retries, reasoning drift across long chains, and cost amplification that only surfaces in monthly billing.

The operational gap is stark. According to the LangChain State of Agent Engineering survey (1,300+ respondents), 89% of organizations have implemented some form of observability for their agents, yet only 52.4% run offline evaluations on test sets. That 37-point gap means most production agents have visibility into what is happening but no regression detection for whether quality is degrading. Closing that gap requires three capabilities built in sequence: structured tracing (days 0-30), offline evaluation with CI/CD integration (days 31-60), and online evaluation with production feedback loops (days 61-90).

This blueprint provides the execution plan, metric framework, failure mode catalog, and team-level checklists for building all three capabilities. It is written for teams that already have agents in production or within one quarter of production deployment, and it assumes familiarity with distributed tracing concepts and CI/CD pipeline design.

Executive Context

The market signal is clear: organizations are deploying AI agents faster than they are building the infrastructure to measure whether those agents work. Gartner predicts that over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls. The common thread across these cancellation risks is measurement failure - teams cannot demonstrate that agents deliver value because they lack the instrumentation to prove it.

Among organizations with agents already in production, the numbers improve but still reveal gaps. The LangChain survey shows 94% of production teams have some form of observability and 71.5% have full tracing that allows inspection of individual agent steps and tool calls. These are encouraging adoption numbers, but tracing alone does not answer the question leadership cares about: is agent quality stable, improving, or degrading over time? Only evaluation pipelines - both offline against curated test sets and online against live production traffic - can answer that question with evidence.

The cost dimension adds urgency. Multi-agent coordination introduces multiplicative cost patterns where, in naive configurations, each agent receives accumulated conversation history, driving costs that scale with both agent count and conversation length. Teams that monitor token usage per request but not cost per completed task will miss amplification patterns until they appear as billing anomalies. StackAuthority's analysis of production agent deployments suggests that cost-per-task tracking from day one is the single most effective early-warning system for the cost escalation that drives project cancellation.

This blueprint addresses the full instrumentation-to-evaluation pipeline. It does not cover agent framework selection, prompt engineering, or model fine-tuning. Those are design-time concerns. This is an operations-time blueprint for teams that need to know whether their agents are working, how much they cost, and whether quality is holding steady.

Why Agent Observability Is Not Application Monitoring

Agent observability is the discipline of capturing, storing, and analyzing structured trace data from multi-step AI workflows where each step involves a decision - which tool to call, what parameters to use, whether to retry, when to stop. Traditional APM captures request latency, error rates, and infrastructure metrics at the service boundary. Agent observability captures the internal reasoning and action chain within a single user-facing task, where a request may trigger dozens of LLM invocations, tool calls, and intermediate decisions before producing an output.

The difference is structural, not cosmetic. A conventional web service handles a request and returns a response; the quality of that response is determined by the code path it followed, which is deterministic and testable with standard integration tests. An agent handles a task and may follow different tool-call sequences, retry failed operations, adjust its approach based on intermediate results, and produce outputs that require semantic evaluation rather than schema validation. APM can tell you that an agent request took 12 seconds and returned HTTP 200; it cannot tell you that the agent called the wrong API twice, recovered on the third attempt, and returned a plausible but incorrect answer.

The span model for agents differs from service-level tracing in important ways. OpenTelemetry semantic conventions for generative AI define standardized attributes for model parameters, response metadata, token usage, and agent operation spans. These conventions cover spans for individual model operations and agent operations, events for inputs and outputs, and metrics for request volume, latency, and token counts. An emerging proposal in the OpenTelemetry community extends these conventions specifically to agentic systems with multi-step workflows, though this proposal has not yet been ratified. Teams building trace infrastructure today should adopt the existing GenAI semantic conventions and watch the agentic extension proposal for future alignment.

The evaluation dimension is what separates agent observability from application monitoring entirely. Agent evaluation measures performance across every decision point - planning, tool selection, action execution, and result processing - rather than measuring a single response against an expected output. This requires assertion types that APM frameworks do not provide: semantic similarity scoring, tool-call sequence correctness, cost-per-task thresholds, and reasoning coherence across chain steps. Without these evaluation capabilities, teams have monitoring without quality measurement.

Teams that attempt to build agent observability on top of existing APM tooling will encounter three friction points that usually force a rebuild. First, APM span models do not capture LLM-specific metadata (model ID, token counts, prompt content) without custom instrumentation that defeats the purpose of using an existing tool. Second, APM dashboards answer infrastructure questions (is the service up? is latency acceptable?) rather than task questions (did the agent complete the task? was the answer correct?). Third, APM alerting thresholds are based on statistical patterns in request metrics, not on evaluation scores that require semantic judgment. These gaps do not mean APM is useless for agent systems - infrastructure monitoring still matters - but they mean APM alone is insufficient.

Scope and Non-Scope

In scope

This blueprint covers trace instrumentation for multi-step agent workflows, offline evaluation pipeline design and CI/CD integration, online evaluation and production quality monitoring, cost-per-task tracking and cost regression detection, feedback loops between production failures and evaluation datasets, and metric frameworks for measuring agent reliability and quality over time.

These capabilities form a single operating system for agent quality. Implementing tracing without evaluation gives visibility without accountability. Implementing evaluation without production feedback gives test coverage without drift detection. The value compounds only when all three layers operate together.

Out of scope

Out of scope are agent framework selection and architecture design, prompt engineering and model fine-tuning, security controls for agent tool execution (covered in the LLM Runtime Security Blueprint), and sector-specific compliance requirements for AI systems.

Out-of-scope does not mean unimportant. Security controls and observability share the tracing layer - both need structured span data from agent operations. The difference is purpose: security tracing asks whether the agent violated policy, while quality tracing asks whether the agent completed the task correctly. Teams building both should design a shared trace infrastructure with separate analysis pipelines.

Methodology Snapshot

This blueprint follows suitability-based guidance principles. Recommendations are framed around organizational context, team capability, and deployment maturity rather than absolute claims. The evaluation criteria, metric thresholds, and timeline targets in this blueprint should be adapted to each organization's agent complexity, risk tolerance, and operational maturity. For the full methodology behind StackAuthority's editorial approach, see Methodology.

Reference Architecture

Trace Pipeline

[Agent Framework]
      |
      v
[OTel-Compatible Trace Exporter]
      |
      +-- one span per: agent step, tool call, LLM invocation
      |-- attributes: model ID, token counts, tool parameters, step outcome
      |-- events: prompt input, model output, tool response
      |
      v
[Trace Storage]
      |
      +-- query by: workflow type, time range, outcome, cost range, user ID
      |
      v
[Query and Analysis Layer]
      |
      +-- platforms such as Langfuse, Arize Phoenix, Braintrust, or custom

The trace pipeline is the foundation layer. Every subsequent capability - evaluation, monitoring, cost tracking, feedback loops - depends on structured trace data being available and queryable. The span-per-step model means each agent operation emits its own span with parent-child relationships that reconstruct the full decision chain. An agent task that involves three LLM calls, two tool invocations, and one retry produces seven spans linked by trace context, not one aggregate request span.

Trace storage must support both real-time querying (for debugging production issues) and batch analysis (for evaluation pipeline execution). Tools such as Langfuse provide open-source LLM observability with agent graph visualization and OpenTelemetry-native ingestion. Arize Phoenix offers agent-level observability with visualization of agent behavior including prompts, tools, memory, routing, and LLM outputs. Datadog natively supports OpenTelemetry GenAI semantic conventions for LLM observability, which matters for teams already invested in that ecosystem. The choice of trace storage should be driven by existing infrastructure, query requirements, and whether the team needs self-hosted or managed deployment.

Token usage and cost attribution must be captured at the span level, not aggregated at the request level. A single agent task may invoke multiple models at different price points, call tools with varying latency costs, and retry operations that multiply both token and compute expenses. Per-span cost attribution is what makes cost-per-task tracking possible in later phases.

Evaluation Pipeline

[Golden Datasets]                [Agent Code Change (PR/Merge)]
      |                                    |
      +-- per workflow type                |
      |-- versioned alongside code         |
      |                                    |
      +------------------+-----------------+
                         |
                         v
                  [Eval Runner (CI/CD)]
                         |
                         v
                  [Assertion Engine]
                         |
         +---------------+---------------+
         |               |               |
         v               v               v
   [Structural]    [Semantic]      [Cost/Behavioral]
   output schema   answer relevance   tool-call sequence
   field presence  factual accuracy   cost-per-task threshold
   type correctness coherence score   retry count limits
         |               |               |
         +---------------+---------------+
                         |
                         v
                  [Pass/Fail Gate]
                         |
              +----------+----------+
              |                     |
              v                     v
        [Deploy Pipeline]    [Block + Report Regression]

The evaluation pipeline converts trace data and agent outputs into quality measurements that can gate deployment. Golden datasets are curated collections of inputs, expected outcomes, and evaluation criteria per workflow type. A golden dataset is the source of truth for measuring quality across the AI lifecycle - it defines what correct looks like for each workflow the agent handles.

The assertion engine runs four types of checks. Structural assertions verify output schema correctness and required field presence. Semantic assertions score answer relevance and factual accuracy using LLM-as-judge or embedding-similarity methods. Behavioral assertions verify tool-call sequence correctness - did the agent call the right tools in the right order? Cost assertions verify that cost-per-task and token usage stay within configured thresholds. All four assertion types must pass for a deployment to proceed.

Integration with CI/CD means the eval runner executes on every pull request or merge that touches agent code, prompt templates, or model configuration. This catches regressions before they reach production. Teams that run evaluations only manually or on a schedule will discover regressions after users do.

Production Monitoring and Feedback Loop

[Production Traffic]
      |
      v
[Sampled Traces] (configurable rate, start 5-10%)
      |
      v
[Online Eval Scorer]
      |
      +-- task success/failure classification
      |-- output quality scoring
      |-- tool-call correctness on sampled traces
      |
      v
[Dashboard + Alerting]
      |
      +-- task success rate by workflow type
      |-- cost trends by workflow and step
      |-- error categorization and trend
      |-- latency distributions (p50/p95/p99)
      |
      v
[Divergence Detector]
      |
      +-- compares online eval scores vs. offline eval baselines
      |
      v
[Feedback Loop]
      |
      +-- failed production traces reviewed weekly
      |-- confirmed failures added to golden datasets
      |-- eval thresholds adjusted based on production signal
      |
      v
[Updated Golden Datasets] --> [Eval Runner] (cycle continues)

The production monitoring layer closes the gap between test-time quality and runtime quality. Online evaluation applies the same scoring logic used in offline evaluation to a configurable sample of live production traces. The sample rate should start at 5-10% and adjust based on cost and signal value - higher rates give faster signal but increase evaluation compute costs.

The divergence detector is critical infrastructure. When offline evaluation scores are stable but online scores drop, it means the golden datasets no longer represent production traffic patterns. When online scores are stable but offline scores drop after a code change, it means the change introduces a regression on known test cases. Both signals require different responses: divergence in the first direction triggers golden dataset updates, divergence in the second direction blocks deployment.

The feedback loop is what separates teams that improve agent quality over time from teams that fight the same failures repeatedly. Production failures feed back into golden datasets, which means the offline evaluation pipeline becomes more comprehensive with every production incident. This creates a quality flywheel: more production experience produces more evaluation coverage, which catches more regressions before production, which reduces production failures. Without this loop, golden datasets become stale and evaluation pipelines provide false confidence.

Metric Framework

MetricDefinitionMeasurement methodWhat good looks likeDegradation signal
Task completion ratePercentage of agent tasks that produce a correct, complete outputOnline eval scoring on sampled traces; production outcome classificationAbove 90% for well-defined workflows; above 75% for exploratory tasksDrop of more than 5 percentage points over a rolling 7-day window
Tool-call success ratePercentage of tool invocations that return a valid result on first attemptSpan-level tool-call outcome trackingAbove 95% for stable APIs; above 85% for external servicesDrop below 80% or sustained increase in retry rate
Cost-per-taskTotal token and compute cost for one complete agent task executionSum of per-span cost attribution across all spans in a task traceWithin 1.5x of baseline for each workflow typeAny workflow exceeding 2x baseline cost sustained over 24 hours
Reasoning coherence scoreQuality of agent reasoning across multi-step chainsLLM-as-judge scoring on sampled traces; per-step output quality annotationAbove 0.8 on a 0-1 scale for production workflowsNegative correlation between chain length and coherence score
Retry ratePercentage of agent steps that required at least one retrySpan-level retry count trackingBelow 10% of total steps across all workflowsAbove 15% or concentrated retries in a single tool or step type
Eval-production divergenceGap between offline eval scores and online eval scores for the same workflow typeComparison of offline eval pass rate vs. online eval pass rate per workflowWithin 5 percentage pointsDivergence exceeding 10 percentage points sustained over 1 week
Latency-per-stepWall-clock time for individual agent steps (LLM calls, tool calls)Span duration tracking with p50/p95/p99 breakdownsp95 below 5 seconds for LLM calls; p95 below 2 seconds for tool callsp95 exceeding 2x baseline for any step type

Days 0-30: Instrument and Define

Goal

Establish trace infrastructure, define evaluation metrics, and build a failure taxonomy that maps agent failure modes to observable signals.

Workstream A: Trace instrumentation

Deploy distributed tracing with a span-per-step model where each agent step, tool call, and LLM invocation produces its own span with parent-child relationships linking them into a complete task trace. Instrument token usage and cost attribution at the span level so that cost-per-task can be computed by summing across all spans in a trace. Implement trace context propagation across agent chains so that multi-step workflows produce coherent traces even when steps cross service boundaries.

Trace storage must support filtering by workflow type, time range, outcome status, and cost range within minutes of trace ingestion. Teams that defer query capability until after instrumentation is complete lose the debugging value of early traces when instrumentation itself introduces bugs. Start with a small number of representative workflows (3-5) rather than instrumenting everything simultaneously.

The span naming convention is an architectural decision, not a cosmetic one. Inconsistent span names create cardinality problems in trace storage and make dashboards unusable within weeks. Define a naming schema before the first span is emitted: agent.{workflow_type}.{step_type}.{operation} is a common pattern that balances specificity with queryability.

Workstream B: Metric definition

Define evaluation metrics across four dimensions: quality (task completion rate, reasoning coherence), reliability (tool-call success rate, retry rate), cost (cost-per-task, token usage per step), and latency (latency-per-step at p50/p95/p99). Each metric needs an owner, a measurement method, a baseline value established from production traffic, and a degradation threshold that triggers investigation.

Establish baseline measurements on 3-5 representative production workflows before defining thresholds. Baselines set from test environments will underestimate production variance and produce noisy alerts. Run baselines for at least two weeks to capture weekly traffic patterns.

Document metric ownership explicitly: which team reviews which metric, at what cadence, and what action is expected when a threshold is breached. Metrics without owners become dashboard decorations.

Workstream C: Failure taxonomy

Categorize agent failure modes into operational categories: tool-call failures (silent retries, wrong tool selection, parameter errors), reasoning failures (drift, hallucination, scope escape), cost failures (retry amplification, unnecessary tool calls, verbose chains), and infrastructure failures (timeout, rate limiting, context window overflow). Map each failure mode to the trace signals that would detect it.

Define alert thresholds for critical failure modes based on business impact, not technical severity. A tool-call failure that silently degrades output quality but returns HTTP 200 may have higher business impact than a timeout that returns an explicit error. Priority should follow user impact, not error severity.

This taxonomy will evolve as production experience accumulates. Treat the day-30 version as a starting framework, not a complete catalog. The feedback loop in days 61-90 will add failure modes discovered in production.

Exit criteria

  • All agent steps emit structured spans with tool-call metadata, token counts, and cost attribution
  • Metric definitions documented and baselined on at least 3 live production workflows
  • Failure taxonomy with trace-signal mapping complete and reviewed by platform and application teams
  • Trace data queryable within minutes for any production workflow run
  • Span naming convention documented and enforced in code review

Treat day-30 exit as an instrumentation-readiness decision. If traces are incomplete, metrics lack baselines, or the failure taxonomy exists only as a document without trace-signal mapping, the evaluation pipeline built in days 31-60 will produce unreliable results.

Days 31-60: Build Offline Evaluation and CI/CD Integration

Goal

Build an automated evaluation pipeline that detects quality and cost regressions before code reaches production.

Workstream A: Golden dataset construction

Build curated test sets for each major workflow type with a minimum of 50 test cases per workflow. Each test case should include the input (user request or trigger), expected output characteristics (not a single correct answer, but evaluation criteria), expected tool-call sequence (which tools should be called, in what order), and cost bounds (maximum acceptable cost-per-task for this input). Include edge cases: multi-tool chains, long-running workflows, ambiguous inputs, and adversarial inputs that test guardrail behavior.

Version golden datasets alongside agent code in the same repository. Dataset drift is as dangerous as code regression - a test set that no longer represents production traffic provides false confidence. Each dataset should include metadata: creation date, last update date, production traffic coverage estimate, and known gaps.

Agent evaluation follows three phases as organizations mature: early development uses manual tracing and spot-checking, scaling uses offline evaluation with golden datasets, and production operations uses online evaluation with feedback loops. This blueprint covers the transition from phase one to phase three. Teams still in the manual tracing phase should complete days 0-30 before attempting golden dataset construction.

Workstream B: Evaluation pipeline

Build an eval runner that executes agent workflows against golden datasets and applies four assertion types. Structural assertions verify output format, required fields, and type correctness. Semantic assertions score answer relevance and factual accuracy using either embedding similarity or LLM-as-judge approaches. Behavioral assertions verify tool-call sequences against expected patterns. Cost assertions verify that cost-per-task stays within configured thresholds per workflow type.

Integrate the eval runner into the CI/CD pipeline so that every pull request touching agent code, prompt templates, or model configuration triggers a full evaluation run. Define regression thresholds per assertion type: a structural assertion failure should block deployment immediately, while a 2% drop in semantic similarity may trigger a warning rather than a block. These thresholds should be configurable per workflow type because tolerance for quality variation differs across use cases.

Store eval results with run-over-run comparison capability. A single eval run tells you current quality; a sequence of runs tells you whether quality is trending up, down, or stable. Trend data is what makes evaluation useful for engineering leadership, not just for debugging.

Workstream C: Cost regression detection

Build cost-per-task tracking per workflow type that compares each run against the established baseline. Alert when cost-per-task drifts beyond threshold - a 2x baseline breach sustained over 24 hours is a common starting threshold, though teams with tight cost constraints may use 1.5x. Implement cost regression as a CI/CD gate alongside quality regression so that a code change that improves answer quality but triples cost-per-task is flagged before deployment.

Track token usage distribution across agent steps to identify cost hotspots. Retry amplification in agentic workflows creates a feedback loop where the agent executes, observes, and adjusts, and under backpressure this becomes a cost-multiplying cycle. Per-step token tracking makes these loops visible in trace data before they appear in monthly billing.

Teams must shift from request-based cost monitoring to workflow-based cost tracking. Token cost per API call is a vendor metric; cost per completed task is a business metric. The gap between these two views is where cost amplification hides.

Exit criteria

  • Golden datasets exist for the top 5 production workflows with minimum 50 test cases each
  • Eval pipeline runs in CI/CD on every PR that touches agent code or configuration
  • Deployment is blocked on quality or cost regression beyond configured thresholds
  • Eval results stored with run-over-run comparison and trend visualization
  • Cost-per-task tracked per workflow type with alerting on threshold breach

Days 61-90: Online Evaluation, Production Monitoring, and Feedback Loops

Goal

Deploy production quality monitoring and close the gap between offline evaluation scores and actual production behavior.

Workstream A: Online evaluation

Deploy sampling-based quality scoring on live production traffic at a configurable sample rate, starting at 5-10%. Apply the same scoring dimensions used in offline evaluation - task success classification, output quality scoring, and tool-call correctness - to sampled production traces. This creates a direct comparison between offline and online quality scores.

Compare online eval scores against offline eval baselines weekly. Divergence in either direction is a signal: if online scores are lower, golden datasets do not represent production traffic; if online scores are higher, golden datasets may include overly difficult edge cases that inflate perceived regression risk. Both conditions require golden dataset adjustment.

Online evaluation adds compute cost proportional to sample rate and scoring complexity. Teams should track evaluation cost as a line item alongside agent compute cost to avoid the ironic outcome of observability infrastructure costing more than the agents it monitors.

Workstream B: Production monitoring dashboards

Build dashboards organized around five primary views: task success rate by workflow type (the primary quality signal), error categorization and trend (to detect new failure modes), cost trends by workflow and agent step (to catch amplification early), latency distributions at p50/p95/p99 per step (to identify bottlenecks), and retry rate with retry cost attribution (to quantify recovery expense).

Implement alerting on four critical signals: task success rate drop below SLO threshold, cost-per-task spike above 2x baseline for any workflow, latency spike above 2x baseline at p95 for any step type, and new error category appearing that does not match the existing failure taxonomy. Alert routing should follow metric ownership defined in days 0-30 - cost alerts to the platform team, quality alerts to the application team, infrastructure alerts to SRE.

Dashboard design should answer one question for leadership: are agents working correctly? If the dashboard requires interpretation by an engineer to answer that question, it is a debugging tool, not a monitoring tool. Leadership dashboards should show task success rate and cost-per-task trends with clear SLO lines.

Workstream C: Feedback loop

Build the pipeline that connects production failures back to offline evaluation datasets. Production failure traces are sampled, reviewed weekly by the application team, and confirmed failures are added to golden datasets as new test cases. This ensures the evaluation pipeline grows more comprehensive with every production incident.

Implement a triage workflow: failed production traces are categorized by failure mode (from the taxonomy built in days 0-30), reviewed for root cause, and either added to golden datasets (if they represent a reproducible quality issue) or to the failure taxonomy (if they represent a new failure mode). Eval thresholds should be adjusted quarterly based on accumulated production signal.

Define SLOs for agent reliability per workflow type: task completion SLO (e.g., 95% for well-defined workflows), cost-per-task SLO (e.g., within 1.5x baseline), and latency SLO (e.g., p95 below 10 seconds end-to-end). SLOs should be reviewed in a weekly operations cadence alongside the feedback loop triage. Without SLOs, quality targets are implicit and non-enforceable.

Exit criteria

  • Online eval running on production traffic at 5-10% sample rate with configurable adjustment
  • Dashboards live with alerting on task success rate, cost, latency, and new error categories
  • Feedback loop operational: production failures flowing into golden datasets weekly
  • SLOs defined, tracked, and reviewed in weekly operations cadence
  • Online-to-offline eval divergence measured and reviewed weekly

By day 90, the system should be self-improving: production experience feeds evaluation coverage, evaluation coverage catches regressions earlier, and fewer regressions reach production. If teams cannot show this feedback cycle operating without manual intervention beyond the weekly triage review, the system is not yet in steady state.

Failure Modes

Failure mode 1: Silent tool-call failures

Silent tool-call failures occur when an agent retries a failed tool call internally, burns tokens on retry attempts, and eventually returns a degraded output without raising an error signal visible to monitoring. The user receives a response, the HTTP status code is 200, and infrastructure metrics show no anomaly. The degradation is invisible to everything except span-level tool-call tracking.

Detection requires span-level instrumentation that records the outcome of each tool call individually, including retries. Track retry count per span, tool-call success/failure status per span, and the delta between first-attempt cost and total cost including retries. A workflow where tool-call retries account for more than 20% of total cost is a candidate for investigation.

Mitigation involves explicit failure spans (a retry should produce its own span, not be hidden inside the parent span), retry budgets per tool call (maximum 2-3 retries before the agent escalates or fails explicitly), and circuit-breaker patterns that prevent repeated calls to a tool that is consistently failing. Teams should also instrument the quality difference between first-attempt outputs and post-retry outputs to measure whether retries actually improve results.

Failure mode 2: Evaluation-production divergence

Evaluation-production divergence is the condition where offline evaluations pass on curated test data while production quality degrades on real user inputs. This happens because production data distributions drift from test data over time, and golden datasets that are not updated with production signals become increasingly unrepresentative.

Detection requires running the same evaluation scoring on both offline test sets and sampled production traces, then comparing scores per workflow type. A divergence of more than 10 percentage points sustained over one week indicates that offline evaluations are no longer an accurate proxy for production quality. Track this divergence as a first-class metric, not as an occasional audit.

Mitigation is the feedback loop described in days 61-90: confirmed production failures are added to golden datasets, expanding test coverage to include the input patterns that caused real failures. Teams should also audit golden datasets quarterly for stale test cases that no longer represent current production traffic patterns. A golden dataset that has not been updated in 90 days should be treated as a risk item, not a stable asset.

Failure mode 3: Cost amplification

Cost amplification occurs when retry loops, verbose reasoning chains, or unnecessary tool calls drive cost-per-task to 5-10x the expected baseline without triggering alerts. In multi-agent configurations, cost scales multiplicatively: each agent may receive accumulated conversation history, driving costs proportional to both agent count and conversation length.

Detection requires cost-per-task tracking with alerting on threshold breach, not just aggregate token monitoring. Per-step token tracking reveals which steps contribute disproportionately to total cost. Common amplification patterns include: retry loops where each retry includes the full conversation history, chains where the agent calls informational tools repeatedly to gather context it could have cached, and multi-agent handoffs where context is duplicated across agents.

Mitigation involves per-step token budgets (each step has a maximum token allocation), retry limits with exponential backoff, cost regression gates in CI/CD that block deployments when cost-per-task increases beyond threshold, and context management strategies that prevent history accumulation across long chains. Organizations must shift from per-request cost monitoring to per-session and per-task cost tracking from day one.

Failure mode 4: Trace cardinality explosion

Trace cardinality explosion happens when over-instrumentation creates so many unique span names, attribute values, or tag combinations that trace storage becomes expensive and query performance degrades before teams extract value from the data. This is the observability equivalent of logging everything and reading nothing.

Detection involves monitoring trace storage growth rate, query latency trends, and the ratio of unique span names to total span volume. If unique span names grow linearly with traffic rather than staying constant, the naming convention includes high-cardinality values (such as user IDs or request parameters) in span names. Query latency above 10 seconds for simple trace lookups indicates a cardinality problem.

Mitigation starts with the span naming convention defined in days 0-30: structured names with controlled vocabulary, not variable strings built from request data. Implement sampling strategies that reduce trace volume for high-frequency, low-value workflows while keeping 100% sampling for error traces and high-cost traces. Define retention policies that age out detailed trace data after 30-90 days while keeping aggregated metrics indefinitely. Teams should set a storage budget for trace data and treat budget breach as a design problem, not an infrastructure scaling problem.

Failure mode 5: Reasoning drift

Reasoning drift occurs when agent reasoning quality degrades across long chains as context accumulates errors, irrelevant information, or conflicting signals from earlier steps. The agent's early steps may be correct, but by step eight or ten, accumulated context noise degrades output quality in ways that are not detectable from the final output alone.

Detection requires per-step output quality scoring, not just end-of-chain evaluation. Track the correlation between chain length and output quality score - a negative correlation indicates reasoning drift. Monitor for patterns where the same workflow type produces lower quality scores when it requires more steps to complete.

Mitigation involves context summarization checkpoints (periodically summarize accumulated context to remove noise), chain-length limits (fail explicitly rather than producing low-quality output from an excessively long chain), and intermediate quality gates (evaluate reasoning quality at checkpoints within the chain and abort or restart if quality drops below threshold). Teams should also measure whether splitting long workflows into shorter sub-workflows with explicit handoff points improves end-to-end quality.

Failure mode 6: Metric theater

Metric theater is the condition where teams track and dashboard vanity metrics - token counts, request latency, invocation volume - while missing the metric that matters: did the agent complete the task correctly? Teams in this failure mode can produce impressive monitoring dashboards that cannot answer a simple question from leadership about agent reliability.

Detection is straightforward: review whether existing dashboards can answer "what percentage of agent tasks completed successfully this week?" without additional data processing. If the answer requires manual trace review, log analysis, or custom queries, task-level success metrics are missing. Another signal: the team discusses agent performance in terms of infrastructure metrics (uptime, latency) rather than outcome metrics (task success rate, cost per completed task).

Mitigation requires making task-level success rate the primary metric on every agent dashboard, not a derived metric buried in a secondary view. Define what success means for each workflow type in terms the business cares about (order completed, question answered correctly, document generated with required content), and instrument that definition as a production metric. Infrastructure metrics remain important for debugging but should not be the primary lens through which leadership evaluates agent performance.

Implementation Checklist by Team

Platform engineering team

The platform team owns trace infrastructure, evaluation pipeline tooling, and cost tracking systems. Responsibilities include deploying and maintaining the trace exporter and storage layer, building the eval runner and assertion engine, implementing cost-per-task computation from span-level data, and maintaining the CI/CD integration that blocks deployment on regression. The platform team should also own the span naming convention and enforce it through linting or code review automation. Without platform ownership of these foundational capabilities, application teams will build ad hoc solutions that fragment observability across the organization.

Application engineering team

The application team owns golden dataset construction, assertion definitions per workflow type, agent instrumentation within application code, and the weekly triage of production failures from the feedback loop. Application engineers know which tool-call sequences are correct, which outputs constitute success, and which edge cases matter - this domain knowledge cannot be delegated to the platform team. Each workflow type should have a named application engineer responsible for maintaining its golden dataset and reviewing its evaluation results.

SRE team

The SRE team owns production monitoring dashboards, alerting configuration, SLO tracking, and incident response for agent reliability issues. Responsibilities include defining alert thresholds based on SLOs, routing alerts to the correct team based on metric ownership, maintaining dashboard infrastructure, and conducting incident reviews for agent failures. SRE should also own trace retention policies and storage budgets to prevent cardinality explosion from becoming an unplanned cost center.

Governance and leadership team

The governance team owns metric review cadence, SLO approval, cost budget approval, and the decision framework for when agent quality issues require escalation. Responsibilities include quarterly review of SLOs and evaluation thresholds, approval of cost budgets for agent operations and observability infrastructure, and maintaining the relationship between agent quality metrics and business outcomes. Without governance ownership, quality targets remain engineering preferences rather than organizational commitments.

Decision Metrics for Leadership

MetricOperational interpretation
Task completion rate by workflow typeThe primary quality metric. If this is below SLO, agents are not delivering value regardless of other metrics. Review weekly, investigate any drop above 5 percentage points.
Cost-per-task vs. baselineTracks whether agent operations are economically sustainable. Cost above 2x baseline indicates amplification. Review weekly alongside task completion rate - a quality improvement that triples cost may not be acceptable.
Eval-production divergenceMeasures whether the evaluation pipeline is trustworthy. Divergence above 10 points means offline evals are not predicting production quality. Treat as an infrastructure reliability issue, not a testing issue.
Golden dataset coveragePercentage of production workflow types with maintained golden datasets and active evaluation. Below 80% means significant production traffic has no regression detection.
Feedback loop throughputNumber of production failures added to golden datasets per week. Zero throughput for two or more weeks suggests the feedback loop is not operational, regardless of what documentation says.
Online eval sample rateThe percentage of production traffic receiving quality scoring. Below 5% produces noisy signal; above 20% may cost more than the value it provides. Adjust based on traffic volume and quality stability.
Mean time to regression detectionTime between a quality regression being deployed and the evaluation or monitoring system detecting it. Above 48 hours means the detection pipeline is too slow to prevent user impact at scale.

Review these metrics in the same meeting as agent development velocity and incident metrics. Isolated reporting allows tradeoffs between speed and quality to remain invisible until they cause a production incident.

Test Catalog

Test family A: Trace completeness and accuracy

Goal

Prove that trace instrumentation captures all agent steps with correct metadata and cost attribution.

  • Execute a known multi-step workflow and verify that the trace contains one span per agent step, tool call, and LLM invocation with correct parent-child relationships
  • Verify that each span includes required attributes: model ID, token input count, token output count, cost attribution, step outcome status, and tool parameters where applicable
  • Execute a workflow that triggers retries and verify that retry spans are distinct from original attempt spans with their own metadata
  • Execute a workflow across service boundaries and verify trace context propagation produces a single coherent trace

Expected evidence

  • Complete trace with span count matching expected step count
  • Span attribute validation report showing all required fields present and correctly typed
  • Cost-per-task computation matching expected value within 5% tolerance
  • Cross-service trace showing unbroken parent-child chain

Test family B: Evaluation pipeline correctness

Goal

Prove that the evaluation pipeline detects known regressions and does not produce false passes.

  • Submit a known-good agent output against golden dataset and verify pass result across all assertion types
  • Submit a known-bad agent output with incorrect tool-call sequence and verify behavioral assertion failure
  • Submit an agent output with correct content but schema violation and verify structural assertion failure
  • Submit an agent output that exceeds cost-per-task threshold and verify cost assertion failure
  • Introduce a code change that degrades semantic similarity by 10% and verify the CI/CD gate blocks deployment

Expected evidence

  • Eval runner results showing pass/fail aligned with expected outcomes for each test case
  • CI/CD pipeline log showing deployment blocked on regression detection
  • Run-over-run comparison showing the introduced regression as a trend break
  • Alert or notification generated for the blocking regression

Test family C: Online evaluation and feedback loop

Goal

Prove that production quality scoring operates correctly and that the feedback loop adds production failures to golden datasets.

  • Run online evaluation on a set of production traces with known quality scores and verify scorer accuracy within 5% of expected values
  • Inject a production trace with a confirmed failure and verify it surfaces in the weekly triage queue
  • Complete the triage workflow for an injected failure trace and verify it appears in the golden dataset for the appropriate workflow type
  • Trigger an online-to-offline divergence above threshold and verify alerting fires

Expected evidence

  • Online eval scorer accuracy report against labeled production traces
  • Triage queue showing injected failure traces with correct categorization
  • Golden dataset diff showing newly added test cases from production failures
  • Divergence alert log with correct threshold breach details

Test family D: Cost and resource controls

Goal

Prove that cost tracking detects amplification patterns and that resource controls prevent runaway execution.

  • Execute a workflow that triggers retry amplification and verify cost-per-task alert fires before budget is exceeded
  • Execute a workflow that exceeds per-step token budget and verify the budget enforcement mechanism activates
  • Simulate a tool that consistently fails and verify the circuit breaker activates after the configured retry limit
  • Execute a workflow where cost-per-task exceeds the CI/CD gate threshold and verify the gate blocks deployment

Expected evidence

  • Cost-per-task alert log with threshold breach details and time-to-detection
  • Token budget enforcement log showing step termination at budget limit
  • Circuit breaker activation log with retry count and tool failure details
  • CI/CD gate log showing deployment blocked on cost regression

Decision Questions for Leadership

What question should a CTO ask first when evaluating agent observability maturity?

Ask whether the team can show task completion rate by workflow type for the past 30 days without manual data processing. If that metric is not available as a production metric, observability is infrastructure monitoring, not agent quality measurement. Task-level success rate is the single metric that connects engineering operations to business outcomes.

How do we know if our evaluation pipeline is actually catching regressions?

Track mean time to regression detection - the time between a quality-degrading code change being deployed and the evaluation system detecting the degradation. If this number is measured in days rather than hours, the pipeline is too slow. Also verify that the pipeline has blocked at least one deployment in the past quarter; a gate that never blocks is either perfectly calibrated or not functioning.

When should we invest in online evaluation versus relying on offline evaluation alone?

Invest in online evaluation when golden datasets are stable and offline evaluation has been running for at least 30 days. Online evaluation without offline evaluation is premature - you need a baseline to detect divergence. The trigger for investment is usually the first production incident where the failure mode was not covered by existing golden datasets, proving that offline evaluation alone has coverage gaps.

What is the right cost budget for agent observability infrastructure?

A reasonable starting target is 5-10% of total agent compute cost allocated to observability and evaluation infrastructure. This includes trace storage, evaluation compute (both offline and online), and dashboard infrastructure. If observability costs exceed 15% of agent compute costs, the team is likely over-instrumenting or running online evaluation at too high a sample rate. If costs are below 3%, coverage is likely insufficient.

How do we avoid metric theater - tracking numbers that look good but do not reflect agent quality?

Require that every agent dashboard answers the question "did agents complete tasks correctly?" before any infrastructure metric appears. Task completion rate, cost-per-task, and eval-production divergence should be the top three metrics visible to leadership. Token counts, request latency, and invocation volume are debugging tools, not quality indicators.

What organizational structure supports sustained agent quality improvement?

Agent quality requires three ownership lanes operating in coordination: platform engineering owns trace infrastructure and evaluation tooling, application engineering owns golden datasets and assertion definitions, and SRE owns production monitoring and SLO enforcement. The feedback loop requires weekly coordination across all three teams. Organizations that assign agent observability entirely to one team will discover that the other two teams treat it as someone else's problem.

External References

Related Reading

Limitations

This blueprint defines an execution framework for agent observability and evaluation infrastructure, not sector-specific compliance guidance, procurement advice, or legal counsel. The metric thresholds, timeline targets, and tooling references should be adapted to each organization's agent complexity, traffic volume, risk tolerance, and regulatory obligations. The framework is strongest when paired with regular production incident reviews, disciplined golden dataset maintenance, and cross-team ownership of quality metrics.

About

Author: Talia Rune Reviewed by: StackAuthority Editorial Team Review cadence: Quarterly (90-day refresh cycle)

About Talia Rune

Talia Rune is a Research Analyst at StackAuthority with 10 years of experience in security governance and buyer-side risk analysis. She completed an M.P.P. at Harvard Kennedy School and writes on how engineering leaders evaluate controls, accountability, and implementation risk under real operating constraints. Outside research work, she does documentary photography and coastal birdwatching.

Education: M.P.P., Harvard Kennedy School

Experience: 10 years

Domain: security governance, technology policy, and buyer-side risk analysis

Hobbies: documentary photography and coastal birdwatching

Read full author profile