阶段七：生产化 — Pydantic AI 教程

1测试：不花钱也能跑测试

痛点：测试 AI 代码太贵了

每跑一次测试就要调一次 API，又慢又费钱，结果还不确定（同一个问题可能每次回答不同）。怎么办？

Pydantic AI 提供了两个"假模型"，专门用来写测试：

TestModel    —— 自动生成符合格式的假数据，不调真 API
FunctionModel —— 你自己写函数决定返回什么

第一步：防止测试时误调真 API

在测试文件里加这一句，就算忘了 mock 也不会真的花钱：

from pydantic_ai import models

# 阻止所有真实模型请求
models.ALLOW_MODEL_REQUESTS = False

大白话： 就像给钱包加了锁。即使代码里写着 openai:gpt-4o，测试时也不会真的调 OpenAI，会直接报错提醒你。

TestModel：最简单的测试方式

TestModel 会自动根据你的 output_type 生成假数据，不需要你操心返回什么：

import pytest
from pydantic_ai import Agent
from pydantic_ai.models.test import TestModel
from pydantic import BaseModel

class CityInfo(BaseModel):
    name: str
    country: str
    population: int

agent = Agent('openai:gpt-4o', output_type=CityInfo)

async def test_city_agent():
    with agent.override(model=TestModel()):
        result = await agent.run('介绍一下北京')
        # TestModel 自动生成符合 CityInfo 格式的数据
        assert isinstance(result.output, CityInfo)
        assert isinstance(result.output.name, str)
        assert isinstance(result.output.population, int)

TestModel 做了什么？ 它看了你的 output_type 的 JSON Schema，然后自动生成满足格式的数据。不涉及任何 AI，纯程序生成，所以又快又确定。

自定义返回文本

# 如果 output_type 是纯文本
agent = Agent('openai:gpt-4o')

with agent.override(model=TestModel(custom_output_text='你好世界')):
    result = await agent.run('打个招呼')
    assert result.output == '你好世界'

TestModel 还会调用工具

TestModel 不是偷懒跳过工具 —— 它会调用所有注册的工具，然后把工具返回值编码成 JSON 作为响应。这意味着你的工具逻辑也能被测到。

FunctionModel：精细控制

如果你想精确控制模型返回什么（比如测试特定的边界情况），用 FunctionModel：

from pydantic_ai.models.function import FunctionModel, AgentInfo
from pydantic_ai import ModelMessage, ModelResponse, TextPart

def my_fake_model(
    messages: list[ModelMessage],
    info: AgentInfo,
) -> ModelResponse:
    # 你可以根据输入决定返回什么
    user_msg = str(messages[-1])
    if '天气' in user_msg:
        return ModelResponse(parts=[TextPart(content='今天晴天')])
    else:
        return ModelResponse(parts=[TextPart(content='我不知道')])

with agent.override(model=FunctionModel(my_fake_model)):
    result = await agent.run('北京天气怎么样？')
    assert '晴天' in result.output

FunctionModel vs TestModel：
TestModel：自动生成，省事，适合大部分测试
FunctionModel：手动控制，灵活，适合测试特定场景

agent.override()：替换一切

override() 不仅能替换模型，还能替换依赖和工具：

from pydantic_ai import Agent

agent = Agent('openai:gpt-4o', deps_type=MyDeps)

# 替换模型
with agent.override(model=TestModel()):
    ...

# 替换依赖（比如用内存数据库替换真数据库）
with agent.override(deps=fake_deps):
    ...

# 同时替换多个
with agent.override(model=TestModel(), deps=fake_deps):
    ...

capture_run_messages()：偷看对话过程

想知道 Agent 和模型之间具体聊了什么？用 capture_run_messages()：

from pydantic_ai import capture_run_messages

with capture_run_messages() as messages:
    with agent.override(model=TestModel()):
        result = await agent.run('你好')

# messages 是一个列表，包含所有来回的消息
# [ModelRequest(...), ModelResponse(...), ...]
for msg in messages:
    print(type(msg).__name__, msg.parts)

pytest Fixture 模式（推荐）

把 override 做成 fixture，所有测试自动用假模型：

import pytest
from pydantic_ai.models.test import TestModel

# 你的业务代码里的 agent
from myapp.agent import my_agent

@pytest.fixture
def mock_agent():
    with my_agent.override(model=TestModel()):
        yield

async def test_feature_a(mock_agent):
    result = await my_agent.run('测试输入 A')
    assert result.output is not None

async def test_feature_b(mock_agent):
    result = await my_agent.run('测试输入 B')
    assert result.output is not None

2评估：给 AI 打分

单元测试 vs 评估

对比	单元测试	评估（Evals）
对象	确定性代码	概率性 AI 输出
结果	通过/失败	0.0 ~ 1.0 分数
频率	每次提交都跑	定期或模型升级时跑
目的	代码没 bug	AI 回答质量够好

大白话： 单元测试检查"代码对不对"，评估检查"AI 好不好"。代码对不对是确定的，AI 好不好是打分的。

评估框架的核心概念

Dataset（数据集）

→ Case 1（测试用例）

→ inputs: "法国的首都是哪里？"

→ expected_output: "巴黎"

→ Case 2（测试用例）

→ inputs: "2 + 2 = ?"

→ expected_output: "4"

→ Evaluators（评估器）

→ ExactMatch（精确匹配）

→ MyCustomEval（自定义评估）

写一个简单的评估

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext

# 第一步：定义评估器
class ContainsAnswer(Evaluator):
    """检查输出是否包含预期答案"""

    async def evaluate(
        self, ctx: EvaluatorContext
    ) -> float:
        if ctx.expected_output is None:
            return 0.5
        if ctx.expected_output.lower() in ctx.output.lower():
            return 1.0
        return 0.0

# 第二步：定义测试用例
dataset = Dataset(
    cases=[
        Case(
            name='首都问题',
            inputs='法国的首都是哪里？',
            expected_output='巴黎',
        ),
        Case(
            name='数学问题',
            inputs='2 + 2 等于多少？',
            expected_output='4',
        ),
        Case(
            name='常识问题',
            inputs='水的化学式是什么？',
            expected_output='H2O',
        ),
    ],
    evaluators=[ContainsAnswer()],
)

# 第三步：定义要评估的任务
from pydantic_ai import Agent

agent = Agent('openai:gpt-4o')

async def my_task(inputs: str) -> str:
    result = await agent.run(inputs)
    return result.output

# 第四步：运行评估
report = dataset.evaluate_sync(my_task)
report.print(include_input=True, include_output=True)

输出类似：

┌─────────┬──────────────────┬──────┬──────────────┐
│ Case    │ Input            │ Score│ Output       │
├─────────┼──────────────────┼──────┼──────────────┤
│ 首都问题│ 法国的首都是哪里？│ 1.0  │ 巴黎是法国...│
│ 数学问题│ 2 + 2 等于多少？  │ 1.0  │ 4            │
│ 常识问题│ 水的化学式是什么？│ 1.0  │ H2O          │
└─────────┴──────────────────┴──────┴──────────────┘

内置评估器

from pydantic_evals.evaluators.common import (
    IsInstance,      # 检查类型
    ExactMatch,      # 精确匹配
    ContainsText,    # 包含指定文本
)

dataset = Dataset(
    cases=[...],
    evaluators=[
        IsInstance(type_name='str'),
        ExactMatch(),
    ],
)

LLM 评审员：让 AI 评判 AI

有些回答不能简单判断对错（比如写作质量），可以让另一个 AI 来打分：

from pydantic_evals.evaluators.llm_judge import LLMJudge

judge = LLMJudge(
    model='openai:gpt-4o',
    prompt_template='''
    评估以下回答的质量（0-1分）:
    问题: {input}
    回答: {output}
    预期: {expected_output}
    请只返回一个数字:
    ''',
)

dataset = Dataset(
    cases=[...],
    evaluators=[judge],
)

注意： LLM 评审也不是 100% 准确，但比人工批量审核快得多。建议同时使用确定性评估器和 LLM 评审器。

3Logfire 可观测性：看见 AI 在干什么

为什么需要监控？

AI 应用和传统应用不同：

慢 —— 一次请求几秒到几十秒
贵 —— 每次请求都花钱
不确定 —— 同一个问题可能给不同答案
黑盒 —— 你不知道它为什么这么回答

Logfire 是 Pydantic 团队开发的可观测性平台，基于 OpenTelemetry 标准，帮你看清 AI 应用内部发生了什么。

三步开启监控

import logfire
from pydantic_ai import Agent

# 第 1 步：配置 Logfire
logfire.configure()

# 第 2 步：启用 Pydantic AI 仪表化
logfire.instrument_pydantic_ai()

# 第 3 步：正常写代码，监控自动生效
agent = Agent('openai:gpt-4o', instructions='简洁回答。')
result = agent.run_sync('Hello World 这个梗的出处是哪？')
print(result.output)

运行后登录 Logfire 控制台，就能看到完整的调用追踪：哪个 Agent 运行了、调了什么工具、用了多少 token、花了多少钱。

安装和认证

# 安装（通常已包含在 pydantic-ai 中）
pip install 'pydantic-ai[logfire]'

# 认证（获取 token）
logfire auth

# 创建项目
logfire projects new

还能监控 HTTP 请求

想看发给模型的原始 HTTP 请求和响应？加一行：

logfire.configure()
logfire.instrument_pydantic_ai()
logfire.instrument_httpx(capture_all=True)  # 捕获所有 HTTP 请求

隐私控制

生产环境可能不想把用户的问题发到监控平台：

from pydantic_ai import Agent, InstrumentationSettings

settings = InstrumentationSettings(
    include_content=False,  # 不记录对话内容
)

agent = Agent('openai:gpt-4o', instrument=settings)

用其他监控后端

Logfire 基于 OpenTelemetry，所以可以对接任何兼容 OTel 的后端：

import os

# 发到自建的 OTel Collector
os.environ['OTEL_EXPORTER_OTLP_ENDPOINT'] = 'http://localhost:4318'

logfire.configure(send_to_logfire=False)  # 不发 Logfire，只发 OTel
logfire.instrument_pydantic_ai()

支持的后端包括：Langfuse、Arize、SigNoz、mlflow、Braintrust、W&B Weave 等。

全局 vs 单个 Agent

# 方式一：全局开启（所有 Agent 都监控）
logfire.instrument_pydantic_ai()

# 方式二：只监控特定 Agent
agent = Agent('openai:gpt-4o', instrument=True)

# 方式三：所有 Agent 都监控（不需要 logfire）
Agent.instrument_all()

4持久化执行：AI 挂了也能接着跑

问题在哪？

AI 请求经常失败 —— 网络超时、API 限流、服务宕机。如果你的工作流跑了 10 步，第 8 步失败了，你肯定不想从头来。

持久化执行就是把每一步的结果存起来，失败了从上一步成功的地方继续。

三种方案

方案	特点	适用场景
Temporal	基于重放，需要外部服务器	企业级高可靠系统
Prefect	Python 原生，支持云	数据流水线、ML 工作流
DBOS	数据库检查点，轻量	轻量应用、快速上手

Temporal 示例

import uuid
from temporalio import workflow
from temporalio.client import Client

from pydantic_ai import Agent
from pydantic_ai.durable_exec.temporal import TemporalAgent

# 1. 创建 Agent
agent = Agent(
    'openai:gpt-4o',
    instructions='你是地理专家。',
    name='geography',  # 必须有名字
)

# 2. 包装为 TemporalAgent
temporal_agent = TemporalAgent(agent)

# 3. 定义 Temporal 工作流
@workflow.defn
class GeographyWorkflow:
    @workflow.run
    async def run(self, prompt: str) -> str:
        result = await temporal_agent.run(prompt)
        return result.output

# 4. 启动
async def main():
    client = await Client.connect('localhost:7233')
    handle = await client.start_workflow(
        GeographyWorkflow.run,
        '墨西哥的首都是哪里？',
        id=str(uuid.uuid4()),
        task_queue='default',
    )
    result = await handle.result()
    print(result)  # 墨西哥城

Prefect 示例

from pydantic_ai import Agent
from pydantic_ai.durable_exec.prefect import PrefectAgent

agent = Agent(
    'openai:gpt-4o',
    instructions='你是地理专家。',
    name='geography',
)

prefect_agent = PrefectAgent(agent)

async def main():
    result = await prefect_agent.run('墨西哥的首都是哪里？')
    print(result.output)  # 墨西哥城

Prefect 最简单： 不需要外部服务器，直接包装一下就能用。它会自动把模型请求和工具调用变成 Prefect Tasks，带重试和缓存。

DBOS 示例

from dbos import DBOS, DBOSConfig
from pydantic_ai import Agent
from pydantic_ai.durable_exec.dbos import DBOSAgent

# 配置数据库（SQLite 或 PostgreSQL）
dbos_config: DBOSConfig = {
    'name': 'my_ai_app',
    'system_database_url': 'sqlite:///app.sqlite',
}
DBOS(config=dbos_config)

agent = Agent(
    'openai:gpt-4o',
    instructions='你是地理专家。',
    name='geography',
)

dbos_agent = DBOSAgent(agent)

async def main():
    DBOS.launch()
    result = await dbos_agent.run('墨西哥的首都是哪里？')
    print(result.output)  # 墨西哥城

三种方案怎么选？

需要企业级可靠性？ → Temporal

已经在用 Prefect？ → Prefect

想快速上手？ → DBOS（SQLite 就够）

不需要持久化？ → 直接用 Agent，不用包装

5CLI 和 Web UI：快速体验

clai 命令行工具

不用写代码，直接在终端和 AI 聊天：

# 安装
pip install clai

# 直接问一句
clai "法国的首都是哪里？"

# 指定模型
clai --model anthropic:claude-sonnet-4-5 "你好"

# 交互模式
clai

交互模式下的特殊命令：

命令	作用
`/exit`	退出
`/markdown`	显示最后的回答（Markdown）
`/multiline`	多行输入模式
`/cp`	复制最后的回答到剪贴板

用你自己的 Agent

# my_agent.py
from pydantic_ai import Agent

agent = Agent(
    'openai:gpt-4o',
    instructions='你只能用诗歌的形式回答问题。',
)

# 从命令行使用
clai --agent my_agent:agent "今天天气怎么样？"

从代码启动 CLI

from pydantic_ai import Agent

agent = Agent('openai:gpt-4o', instructions='用中文回答。')
agent.to_cli_sync()  # 进入交互模式

Web UI：浏览器里聊天

# 启动 Web UI
clai web -m openai:gpt-4o

# 多个模型可选
clai web -m openai:gpt-4o -m anthropic:claude-sonnet-4-5

# 加内置工具
clai web -m openai:gpt-4o -t web_search -t code_execution

# 加系统指令
clai web -m openai:gpt-4o -i '你是一个编程助手'

打开 http://127.0.0.1:7932 就能在浏览器里聊天。

从代码创建 Web UI

from pydantic_ai import Agent

agent = Agent('openai:gpt-4o', instructions='你是编程助手。')

@agent.tool_plain
def run_python(code: str) -> str:
    """运行 Python 代码"""
    try:
        return str(eval(code))
    except Exception as e:
        return f'错误: {e}'

# 创建 Web 应用
app = agent.to_web(
    models=['openai:gpt-4o', 'anthropic:claude-sonnet-4-5'],
)

# 启动服务器
uvicorn my_app:app --host 127.0.0.1 --port 7932

总结

生产化要素	工具	一句话总结
测试	TestModel / FunctionModel	不花钱也能跑出确定性测试
评估	pydantic-evals	给 AI 输出质量打分
监控	Logfire / OTel	看见 AI 内部干了什么
容错	Temporal / Prefect / DBOS	挂了也能接着跑
体验	clai / Web UI	快速试玩和演示

上线前的 Checklist

Production Readiness Checklist

测试全部通过（TestModel + pytest）
ALLOW_MODEL_REQUESTS = False（防止测试花钱）
评估分数达标（关键场景 > 0.8）
监控已开启（至少知道 token 用量和错误率）
重要工作流有容错（Temporal / Prefect / DBOS）
敏感内容已脱敏（InstrumentationSettings）
有降级方案（API 挂了怎么办）

动手练习

练习 1：测试你的 Agent

import pytest
from pydantic_ai import Agent, models
from pydantic_ai.models.test import TestModel
from pydantic import BaseModel

# 防止测试花钱
models.ALLOW_MODEL_REQUESTS = False

class WeatherResponse(BaseModel):
    city: str
    temperature: float
    description: str

agent = Agent('openai:gpt-4o', output_type=WeatherResponse)

@agent.tool_plain
def get_weather(city: str) -> str:
    """获取天气"""
    return f'{city}: 25°C, 晴'

async def test_weather_agent():
    with agent.override(model=TestModel()):
        result = await agent.run('北京天气怎么样？')
        assert isinstance(result.output, WeatherResponse)

练习 2：简单评估

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext

class ScoreRelevance(Evaluator):
    async def evaluate(self, ctx: EvaluatorContext) -> float:
        # 简单判断：回答是否包含关键词
        keywords = ctx.expected_output.split(',')
        hits = sum(1 for kw in keywords if kw.strip() in ctx.output)
        return hits / len(keywords)

dataset = Dataset(
    cases=[
        Case(
            name='Python 介绍',
            inputs='用一句话介绍 Python',
            expected_output='编程,语言,简洁',
        ),
    ],
    evaluators=[ScoreRelevance()],
)

# report = dataset.evaluate_sync(my_task)
# report.print()

下一步：阶段八 · 集成与实战

恭喜你完成了生产化的学习！你现在已经具备了将 AI 应用安全、可靠地部署到生产环境的知识。
接下来通过完整的业务应用案例（Slack Bot、AG-UI 协议等）来巩固所学。