2026-06-11 · harness

Chapter 9: Observability

Add tracing, token and cost tracking to your agent so you can debug a real run instead of rerunning it blind.

Xiaoman · The Glass Room

You can test it, but you still cannot see inside its head. Today you open it up.

Draft chapter. First cut to prove the format; it will be hardened before it is indexed.

What you’ll build

Tracing around your PR-reviewer loop so that every run produces a tree of spans: the top level task, each model call, and each tool call, each carrying inputs, outputs, token counts, and timing. When the reviewer posts a nonsense comment or stalls, you open the trace and read the exact path it took instead of rerunning it blind and hoping the bug shows up again.

By the end you will have a span schema, a logging schema, a per-run cost rollup, and a way to point at one span and say “the failure happened here, because of this.”

Prerequisites

  • The agent loop from earlier chapters (a think -> call tool -> observe loop).
  • A tracing approach. The concepts here map onto OpenTelemetry style spans (a span is a timed unit of work with attributes, nested under a parent). Pick any backend you like. See the official OpenTelemetry and Hugging Face Agents Course docs for setup.
  • Access to the model response metadata that reports token usage. The Anthropic API returns a usage object with input_tokens and output_tokens on every response; see the official docs.

Steps

1. Open a root span per run

Wrap one full agent task in a single root span. This span is the anchor everything else hangs off, and its attributes are what you search by later: a stable run_id, the PR identifier, the goal, the model name, and the git SHA of the agent code. The run_id is the single most important field. Without it you cannot connect a user’s bug report (“the reviewer approved a PR with a SQL injection”) to the trace that explains it.

import anyio
from claude_agent_sdk import query, ClaudeAgentOptions

async def review_pr(pr_id: str, goal: str):
    run_id = uuid4().hex
    with tracer.start_span("agent.run") as root:
        root.set_attributes({
            "run_id": run_id,
            "pr.id": pr_id,
            "goal": goal,
            "model": MODEL,
            "agent.version": GIT_SHA,
        })
        options = ClaudeAgentOptions(
            system_prompt=CONTRACT,
            allowed_tools=["Read"],
            cwd="./repo",
        )
        try:
            result = await trace_run(root, run_id, query(prompt=goal, options=options))
            root.set_attribute("outcome", "success")
            return result
        except AgentError as e:
            root.set_attributes({"outcome": "failure",
                                 "failure.step": e.step_id,
                                 "failure.reason": str(e)})
            raise
import { query } from "@anthropic-ai/claude-agent-sdk";

async function reviewPr(prId: string, goal: string) {
  const runId = crypto.randomUUID();
  const root = tracer.startSpan("agent.run");
  root.setAttributes({
    run_id: runId,
    "pr.id": prId,
    goal,
    model: MODEL,
    "agent.version": GIT_SHA,
  });
  const options = {
    systemPrompt: CONTRACT,
    allowedTools: ["Read"],
    cwd: "./repo",
  };
  try {
    const result = await traceRun(root, runId, query({ prompt: goal, options }));
    root.setAttribute("outcome", "success");
    return result;
  } catch (e: any) {
    root.setAttributes({ outcome: "failure", "failure.step": e.stepId,
                         "failure.reason": String(e) });
    throw e;
  } finally {
    root.end();
  }
}

2. Make each step a child span

The SDK runs that loop for you; you just iterate the message stream it yields. Each AssistantMessage maps to a model.call span, and if it carries a ToolUseBlock (the model asked for a tool) you open a tool.call span too, with the SDK executing the tool. Nesting them under the root makes the trace mirror the loop one to one, so reading top to bottom is the same as replaying the run. Give each iteration an incrementing step index. That index is how you say “it broke at step 7” instead of “it broke somewhere.”

from claude_agent_sdk import AssistantMessage, TextBlock, ToolUseBlock, ResultMessage

async def trace_run(root, run_id, stream):
    step, final = 0, None
    async for message in stream:
        if isinstance(message, AssistantMessage):
            for block in message.content:
                if isinstance(block, ToolUseBlock):
                    with tracer.start_span("tool.call", parent=root) as ts:
                        ts.set_attributes({"run_id": run_id, "step": step,
                                           "tool.name": block.name})
                        record_tool_span(ts, block)         # step 3
                elif isinstance(block, TextBlock):
                    final = block.text
            with tracer.start_span("model.call", parent=root) as ms:
                ms.set_attributes({"run_id": run_id, "step": step})
                record_model_span(ms, message)              # step 3 + 4
            step += 1
        elif isinstance(message, ResultMessage):
            record_run_totals(root, message)               # step 4
    return final
async function traceRun(root, runId, stream) {
  let step = 0, final = null;
  for await (const message of stream) {
    if (message.type === "assistant") {
      for (const block of message.message.content) {
        if (block.type === "tool_use") {
          const ts = tracer.startSpan("tool.call", { parent: root });
          ts.setAttributes({ run_id: runId, step, "tool.name": block.name });
          recordToolSpan(ts, block);          // step 3
          ts.end();
        } else if (block.type === "text") {
          final = block.text;
        }
      }
      const ms = tracer.startSpan("model.call", { parent: root });
      ms.setAttributes({ run_id: runId, step });
      recordModelSpan(ms, message);           // step 3 + 4
      ms.end();
      step += 1;
    } else if (message.type === "result") {
      recordRunTotals(root, message);         // step 4
    }
  }
  return final;
}

3. Record inputs and outputs on the span

A span with only timing is useless for debugging. Record what went in and what came out: for a model call, the text or tool request returned; for a tool call, the arguments the model passed (the SDK feeds the result back on the next message, which you can record on the ToolResultBlock if you want it). This is the difference between “the lint tool failed” and “the lint tool was called with path=/etc/passwd, which is why it failed.” Redact secrets first. Tokens, cookies, and Authorization headers must never land in a trace, because traces get shipped to third party backends and shared in bug reports.

SENSITIVE = ("authorization", "token", "api_key", "password", "secret")

def redact(obj):
    if isinstance(obj, dict):
        return {k: ("***" if any(s in k.lower() for s in SENSITIVE)
                    else redact(v)) for k, v in obj.items()}
    return obj

def record_tool_span(span, block: ToolUseBlock):
    # block.input is the model's arguments; the SDK runs the tool itself
    span.set_attribute("tool.args", json.dumps(redact(block.input)))
    span.set_attribute("tool.id", block.id)
const SENSITIVE = ["authorization", "token", "api_key", "password", "secret"];

function redact(obj) {
  if (obj && typeof obj === "object" && !Array.isArray(obj)) {
    return Object.fromEntries(Object.entries(obj).map(([k, v]) =>
      [k, SENSITIVE.some((s) => k.toLowerCase().includes(s)) ? "***" : redact(v)]));
  }
  return obj;
}

function recordToolSpan(span, block) {
  // block.input is the model's arguments; the SDK runs the tool itself
  span.setAttribute("tool.args", JSON.stringify(redact(block.input)));
  span.setAttribute("tool.id", block.id);
}

Truncate large payloads (an 8 KB cap is a reasonable default). A full repository diff can be megabytes; you want the shape and the head of it, not the whole thing in every span.

4. Track tokens and cost

The SDK hangs a usage dict (with input_tokens and output_tokens) on every AssistantMessage; read each step’s numbers onto its model.call span. The authoritative per-run totals live on the terminal ResultMessage: usage, total_cost_usd, duration_ms, num_turns, which you roll straight up to the root instead of summing by hand. Cost is just tokens times your published rate. Tracking it per span (not only per run) is what lets you answer “why did this run cost 12x the median?” The usual answer is that the diff or the conversation history blew up the input tokens on one step.

def record_model_span(span, message: AssistantMessage):
    usage = message.usage or {}
    span.set_attributes({
        "llm.input_tokens": usage.get("input_tokens", 0),
        "llm.output_tokens": usage.get("output_tokens", 0),
    })

def record_run_totals(root, result: ResultMessage):
    usage = result.usage or {}
    root.set_attributes({
        "llm.input_tokens": usage.get("input_tokens", 0),
        "llm.output_tokens": usage.get("output_tokens", 0),
        "llm.cost_usd": result.total_cost_usd or 0.0,   # SDK reports USD directly
        "run.duration_ms": result.duration_ms,
        "run.num_turns": result.num_turns,
    })
function recordModelSpan(span, message) {
  const usage = message.message.usage ?? {};
  span.setAttributes({
    "llm.input_tokens": usage.input_tokens ?? 0,
    "llm.output_tokens": usage.output_tokens ?? 0,
  });
}

function recordRunTotals(root, result) {
  const usage = result.usage ?? {};
  root.setAttributes({
    "llm.input_tokens": usage.input_tokens ?? 0,
    "llm.output_tokens": usage.output_tokens ?? 0,
    "llm.cost_usd": result.total_cost_usd ?? 0,   // SDK reports USD directly
    "run.duration_ms": result.duration_ms,
    "run.num_turns": result.num_turns,
  });
}

ResultMessage.total_cost_usd is the USD cost the SDK already computed, which is more reliable than multiplying tokens by a rate yourself; when you do compute per-token cost, take the rates and field names from the official pricing and API docs rather than hardcoding numbers that drift.

5. Tag the outcome and attribute the failure

On the root span, record success or failure. On failure, record which step and which span broke (see step 1). Attribution is the payoff for doing all this: with it, a week of production runs becomes a sortable table (“3% of runs fail, 80% of those at the lint step, all on PRs over 5000 lines”). That sentence tells you what to fix. Without attribution all you have is “sometimes it fails.”

A compact logging schema for one PR-reviewer run, emitted alongside the spans:

{
  "run_id": "a1b2...",
  "pr_id": "org/repo#412",
  "model": "claude-...",          // see official docs for current ids
  "outcome": "failure",
  "failure": { "step": 7, "span": "tool.call", "reason": "lint timeout" },
  "totals": { "steps": 8, "input_tokens": 41200,
              "output_tokens": 5300, "cost_usd": 0.21, "wall_ms": 14300 },
  "steps": [
    { "step": 0, "kind": "model.call", "out_tokens": 180 },
    { "step": 0, "kind": "tool.call", "tool": "read_file",
      "args": { "path": "src/auth.py" }, "ok": true },
    { "step": 7, "kind": "tool.call", "tool": "run_linter",
      "ok": false, "error": "timeout after 30s" }
  ]
}

Learned: spelling out the runEvery time Xiaoman runs, it records each step it took as a tree of timed, token-counted spans, so you can read top to bottom and see what it called at each step and where it broke.

How to verify

  • Run one PR review and open its trace. You should be able to walk the whole reasoning path top to bottom and tell, at each step, what the model decided and what came back.
  • Force a tool error (point the linter at a missing binary). Confirm the failing span is marked ok: false, the root shows outcome: failure, and failure.step points at the right iteration.
  • Sum input_tokens + output_tokens across the child spans and confirm it matches the per-run total reported on the ResultMessage. A mismatch means you are dropping spans somewhere.

Learned: pointing at a spanYou can open the trace of a real run, confirm the failing step is marked ok: false with failure on the root, and add up the per-step tokens to check they match the run total.

Why it works

A trace rebuilds a run into something you can still read after the fact. Here is why it beats print debugging: an agent run is nondeterministic, so rerunning it may not reproduce the bug, but the trace of that original run is frozen evidence. Nest the spans and you get causality for free (this tool call happened because that model call requested it), and attributes turn thousands of runs into a queryable dataset instead of a pile of logs.

Recap

Your agent is no longer a black box. For any run you can see the path it took, the tokens and dollars it spent, and the exact step where it broke. That visibility is the precondition for the next chapter: you cannot handle errors you cannot see.

Common pitfalls

  • Logging only the final answer. The interesting failures happen mid-loop, three tool calls deep. Trace every step or you will be debugging the symptom, not the cause.
  • Leaking secrets into traces. Redact tokens and credentials before recording. Traces leave your machine.
  • No run id. Without a stable id you cannot correlate a user report with its trace, and the whole system degrades into “works on my machine.”
  • Token totals that do not add up. Usually a sign of dropped or unparented spans; fix it before you trust the cost numbers.

Xiaoman lays its train of thought open for the first time, and in the traces and logs you see it hesitated a long while at one step. So that is what the half-second delay was. The Glass Room lights up.

Just lit The Glass Room · 10 / 16 lit

Now you can see, but it still shatters the moment it trips. Next: the Rising Slope.

Sources

  1. Hugging Face Agents Course: Observability and evaluation · official
  2. Anthropic Cookbook · official
UP NEXT · CHAPTER 10 Error handling & reliability