2026-06-11 · harness

第 10 章：错误处理与可靠性

让你的 agent 在工具失败、超时和参数错误时存活下来，不破坏状态也不无限循环。

小满 · 不倒之坡

看得见它了，可它一遇错就崩。今天教它，摔倒了自己爬起来。

草稿章节。跑通格式用的第一版，正式索引前会再打磨。

本章目标

在 PR reviewer 循环外面搭一层可靠性，处理每个真实 agent 都会碰到的四种失败：工具报错、工具卡死、模型幻觉出错误参数、多步操作半途失败。学完之后，reviewer 会优雅降级（说清楚自己做不到什么，然后停下），而不是崩溃、卡死，或者重复发评论。

这一层的结构很简单：在循环和每个工具之间夹一个做防护的包装器，再在上面加几条策略（重试预算、幂等、人工闸门）。

前置准备

第 9 章的 agent 循环与追踪。这里你会把失败标记到 span 上。
至少一个有副作用的工具，例如向 PR 发布一条评审评论。
工具 schema（每个工具一份 JSON Schema 或 pydantic 模型），用于校验参数。schema 格式见官方 tool-use 文档。

动手做

1. 动手前先校验参数

模型可能幻觉出一个字段、一个路径或一个类型。在产生任何副作用之前就抓住的失败，是处理起来代价最小的失败。SDK 给了你一个动手前的检查点：can_use_tool 回调，每次工具执行前都会被调用。校验不过就返回 PermissionResultDeny(message=...)，SDK 会把这条可读消息当成一条观测回传给模型，而不是抛异常把运行直接搞挂。模型通常下一轮就自己改对了。

from claude_agent_sdk import (
    ClaudeAgentOptions, ClaudeSDKClient,
    PermissionResultAllow, PermissionResultDeny, ToolPermissionContext,
)

async def can_use_tool(tool_name: str, tool_input: dict, context: ToolPermissionContext):
    spec = TOOLS.get(tool_name)
    if spec:
        try:
            spec.schema.validate(tool_input)      # JSON Schema / pydantic
        except SchemaError as e:
            return PermissionResultDeny(
                message=f"invalid arguments for {tool_name}: {e}. Re-read the schema and retry.")
    # schema 表达不了的领域校验:
    if tool_name == "post_comment" and tool_input.get("pr_id") != CURRENT_PR:
        return PermissionResultDeny(
            message=f"refusing: pr_id {tool_input.get('pr_id')} is not the PR under review ({CURRENT_PR}).")
    return PermissionResultAllow()

options = ClaudeAgentOptions(can_use_tool=can_use_tool, permission_mode="default")

import { query } from "@anthropic-ai/claude-agent-sdk";

async function canUseTool(toolName, toolInput, context) {
  const spec = TOOLS[toolName];
  if (spec) {
    try {
      spec.schema.validate(toolInput);      // JSON Schema / zod
    } catch (e) {
      return { behavior: "deny",
               message: `invalid arguments for ${toolName}: ${e}. Re-read the schema and retry.` };
    }
  }
  // domain checks the schema cannot express:
  if (toolName === "post_comment" && toolInput.pr_id !== CURRENT_PR) {
    return { behavior: "deny",
             message: `refusing: pr_id ${toolInput.pr_id} is not the PR under review (${CURRENT_PR}).` };
  }
  return { behavior: "allow", updatedInput: toolInput };
}

const options = { canUseTool, permissionMode: "default" };

2. 给每个工具包上超时，再加一份带退避的重试预算

运行卡住或掉线，不该把整个进程拖垮。SDK 把底层故障分成几个明确的异常类型：CLINotFoundError（CLI 没装好，是永久错误，重试也没用）、ProcessError（带 exit_code、stderr）、CLIConnectionError、CLIJSONDecodeError。按类型分别处理：临时性失败值得重试，永久性失败不值得，重试时用带抖动的指数退避，别让一个本就吃力的服务被你一遍遍猛打。再给尝试次数设个上限。退避加上重试上限，能防止一个时好时坏的依赖引发一连串重试。

from claude_agent_sdk import (
    query, CLINotFoundError, ProcessError, CLIConnectionError, CLIJSONDecodeError,
)

async def run_with_retry(prompt, options, *, attempts=3):
    for i in range(attempts):
        try:
            results = []
            async for message in query(prompt=prompt, options=options):
                results.append(message)
            return results
        except CLINotFoundError:
            raise                                       # 永久错: 别重试, 直接上抛
        except (ProcessError, CLIConnectionError, CLIJSONDecodeError) as e:
            if i == attempts - 1:
                raise
            await anyio.sleep(min(2 ** i, 8) + random() * 0.5)   # 退避 + 抖动

import { query } from "@anthropic-ai/claude-agent-sdk";

async function runWithRetry(prompt, options, attempts = 3) {
  for (let i = 0; i < attempts; i++) {
    try {
      const results = [];
      for await (const message of query({ prompt, options })) results.push(message);
      return results;
    } catch (e: any) {
      // CLINotFoundError 等永久错不重试; 临时错退避后再试
      const permanent = e?.name === "CLINotFoundError";
      if (permanent || i === attempts - 1) throw e;
      await new Promise((r) => setTimeout(r, Math.min(2 ** i, 8) * 1000 + Math.random() * 500));
    }
  }
}

单个工具内部的超时和重试（比如 HTTP 截止时间）还是在工具实现里自己处理；这一层管的是整次 query() 运行的重试。

3. 把错误当观测，而非崩溃

在 SDK 跑的循环里，工具失败只是流里多出来的一条消息：失败的工具结果会作为 ToolResultBlock（is_error=True）放在一条 UserMessage 里喂回给模型，模型据此调整，换一个文件、跳过坏掉的 linter，或者干脆放弃。你要做的就是遍历这个流，别让任何一条工具失败把循环搞崩。max_turns 给总轮数设上限，这样一个老是重试同一个注定失败调用的模型，就不会一直空转。整次运行成没成，看最后那条 ResultMessage 的 is_error 和 subtype。这是最重要的一条可靠性规则：遍历流时，工具失败永远不抛异常，只把它记成一条观测。

from claude_agent_sdk import (
    AssistantMessage, UserMessage, ToolUseBlock, ToolResultBlock, ResultMessage,
)

async def run_loop(prompt, options):
    async for message in query(prompt=prompt, options=options):
        if isinstance(message, AssistantMessage):
            for block in message.content:
                if isinstance(block, ToolUseBlock):
                    span = tracer.start_span("tool.call")
                    span.set_attribute("tool.name", block.name)
        elif isinstance(message, UserMessage):
            for block in message.content:
                if isinstance(block, ToolResultBlock) and block.is_error:
                    # 工具失败回喂为观测, 而非崩溃; 模型在下一轮调整
                    tracer.start_span("tool.error").set_attribute("error", str(block.content))
        elif isinstance(message, ResultMessage):
            return {"ok": not message.is_error, "subtype": message.subtype}

async function runLoop(prompt, options) {
  for await (const message of query({ prompt, options })) {
    if (message.type === "assistant") {
      for (const block of message.message.content) {
        if (block.type === "tool_use") {
          tracer.startSpan("tool.call").setAttribute("tool.name", block.name);
        }
      }
    } else if (message.type === "user") {
      for (const block of message.message.content) {
        if (block.type === "tool_result" && block.is_error) {
          // 工具失败回喂为观测, 而非崩溃; 模型在下一轮调整
          tracer.startSpan("tool.error").setAttribute("error", String(block.content));
        }
      }
    } else if (message.type === "result") {
      return { ok: !message.is_error, subtype: message.subtype };
    }
  }
}

设 max_turns（py）/ maxTurns（ts）给循环封顶；options 沿用第 1 步带 can_use_tool 的那份。

4. 让副作用幂等

如果第 3 步重试了一个发评论的工具，你绝不能把同一条评论发两遍。给每个有副作用的动作配一个幂等 key，这个 key 由它那些有意义的输入算出来，结果是确定的；让工具（或远端 API）碰到重复的 key 时直接返回原来的结果，什么也不做。同样的 key，同样的结果，不会重复发。

# 你用 SDK 自定义工具(@tool)注册的 post_comment 工具体
def post_comment(pr_id, path, line, body):
    key = sha256(f"{pr_id}:{path}:{line}:{body}".encode()).hexdigest()
    if key in posted_keys:                  # 或: 作为 Idempotency-Key 头传过去
        return posted_keys[key]
    res = github.create_review_comment(pr_id, path, line, body)
    posted_keys[key] = res
    return res

// 你用 SDK 自定义工具注册的 post_comment 工具体
function postComment(prId, path, line, body) {
  const key = sha256(`${prId}:${path}:${line}:${body}`);
  if (postedKeys.has(key)) return postedKeys.get(key);   // 或: 作为 Idempotency-Key 头传过去
  const res = github.createReviewComment(prId, path, line, body);
  postedKeys.set(key, res);
  return res;
}

5. 回滚部分失败

一个「提交评审」动作可能先发五条行内评论、再发一条总结。要是总结那步失败了，你手上就剩五条没人管的评论，和一个一头雾水的作者。所以在一个工作单元里记下哪些步骤已经成功，失败时把它们撤掉；或者把步骤设计成做了一半也能安全重跑（再配上第 4 步的幂等，往往就意味着你可以直接把整个单元重试一遍）。

def submit_review(pr_id, comments, summary):
    done = []
    try:
        for c in comments:
            done.append(post_comment(pr_id, **c))       # 幂等
        post_summary(pr_id, summary)
    except Exception:
        for res in done:
            delete_comment(res["id"])                   # 补偿动作
        raise AgentError("review rolled back", step="submit_review")

function submitReview(prId, comments, summary) {
  const done = [];
  try {
    for (const c of comments) done.push(postComment(prId, c.path, c.line, c.body)); // 幂等
    postSummary(prId, summary);
  } catch (e) {
    for (const res of done) deleteComment(res.id);      // 补偿动作
    throw new AgentError("review rolled back", "submit_review");
  }
}

6. 加一个人工检查点

有些动作代价太大，不能完全交给自动化。在任何不可逆、或者影响面很大的动作之前（批准并自动合并 PR、发到公开仓库），先暂停，要一个明确的批准。微软的「Building Trustworthy AI Agents」一课把这叫作：对有后果的决策，保留人在环。具体落地，它就是 can_use_tool 里的一道闸门，没有带外的一声「yes」，agent 就过不去：拒绝就返回 PermissionResultDeny，模型把它当成一条观测，转去做别的。把这段并进第 1 步那个 can_use_tool 就行。

HIGH_RISK = {"approve_and_merge", "force_push", "post_to_public_repo"}

async def can_use_tool(tool_name, tool_input, context):
    if tool_name in HIGH_RISK:
        if not await approvals.request(run_id, tool_name, tool_input):   # 阻塞等待人工 yes/no
            return PermissionResultDeny(message="action declined by reviewer")
    # ... 这里接第 1 步的 schema / 领域校验 ...
    return PermissionResultAllow()

const HIGH_RISK = new Set(["approve_and_merge", "force_push", "post_to_public_repo"]);

async function canUseTool(toolName, toolInput, context) {
  if (HIGH_RISK.has(toolName)) {
    if (!(await approvals.request(runId, toolName, toolInput))) {        // 阻塞等待人工 yes/no
      return { behavior: "deny", message: "action declined by reviewer" };
    }
  }
  // ... 这里接第 1 步的 schema / 领域校验 ...
  return { behavior: "allow", updatedInput: toolInput };
}

习得「摔不死」小满遇到工具报错、卡死或参数错的时候不再崩掉，它把失败当成一条观测收下，换个做法继续，碰到危险动作还会停下来等你点头。

如何验证

在调用途中杀掉一个工具（让它指向一个连不上的主机）。确认超时之后，运行报出一条干净的 ok: false 观测，模型据此调整，进程不卡死。
喂一个幻觉参数（一个属于别的 PR 的 pr_id）。确认校验用可读消息把它拒掉，模型下一轮就能恢复。
用完全相同的输入触发两次 post_comment。确认幂等 key 挡住了重复，两次调用返回同一个评论 id。
强制让 post_summary 失败，确认行内评论被回滚，PR 保持干净。

习得「确认它爬得起来」你能故意制造超时、喂个幻觉参数、把同一次发评论触发两遍，亲眼看它重试、改正、不重复发、半途失败还能回滚干净。

原理

贯穿全章的一个想法是：agent 循环应该把现实世界里那些乱七八糟的东西当成输入，而不是崩溃的借口。校验、超时、重试，把这些混乱变成模型能拿来推理的结构化观测；幂等和回滚，让副作用可以安全地重复执行；人工闸门，给那些你还不敢完全信任的动作限定影响面。每一条都是一个小策略，合起来，就是一个 demo 和一个能放着长期跑的系统之间的区别。

小结

reviewer 现在能扛住四种失败：动手前先校验、对临时错误超时再重试、把每个失败回喂成观测、绝不重复发评论、回滚做了一半的活、危险动作前停下来等人。再配上第 9 章的可观测性，失败你既能看见，也能消化掉。

常见坑

对非幂等的动作做重试。 没有 key，每重试一次就重复一次副作用。先加幂等，再加重试。
对永久错误做重试。 一个 400 重试三次还是同样地失败。只对能重试的错误重试，而且要退避。
悄悄把错误吞了。 永远要把失败暴露给 trace 和模型。一个没声响的失败，就是日后让你查半天的 bug。
没有步数上限。 一个不停重试注定失败调用的模型，会一直循环到预算烧完。循环步数和重试次数都得设上限。
破坏性步骤没有人工闸门。 有些动作（自动合并、强推）代价太大，不能完全交给自动化。

小满第一次摔倒后自己爬了起来，重试、降级，没有崩。可你也发现它嘴硬，失败了还措辞啰嗦地说没事，那点心虚藏不住。不倒之坡，亮了。

0123456789101112131415

刚点亮不倒之坡 · 地图已点亮 11 / 16

它本事够大了，大到能闯祸了。下一站：界碑禁区。

来源

Anthropic Claude Docs: Tool use · official
Microsoft AI Agents for Beginners: Building Trustworthy AI Agents · official

下一章 · 第 11 章护栏与沙箱