Elasticsearch:模型作为评判者模式(质量控制)

2026年3月26日   |   by mebius
  • 任务:使用第二个模型验证 agent 的输出
  • 模式:生成并评估

在这个 Elastic Workflow 示例中,agent 使用一个模型生成响应,再用第二个模型评估其质量,为结果增加验证层。内置于 Elasticsearch 的自动化引擎 Elastic Workflows 允许开发者将可靠的脚本自动化与 AI 驱动步骤结合,用于需要推理的任务。

name: LLM-as-judge demo (compat mode)
enabled: true

inputs:
  - name: question
    type: string
    required: true
  - name: context
    type: string
    required: false
    default: ""

triggers:
  - type: manual

steps:
  - name: generate_answer
    type: ai.prompt
    connector-id: 921257e8-8037-48fc-beee-be1c7e6d23d3 #Anthropic Claude Sonnet 4.5
    with:
      temperature: 0.2
      prompt: |
        Youtgcode are a helpful assistant.

        Task: Answer the user's question using ONLY the provided context.
        If the context is empty or insufficient, output exactly:
        INSUFFICIENT_CONTEXT: 

        Question:
        {{inputs.question}}

        Context:
        {{inputs.context}}

  - name: judge_answer
    type: ai.prompt
    connector-id: 9b54f87b-2e95-4211-9507-80097f1da325 # OpenAI 4.1
    with:
      temperature: 0.2
      prompt: |
        You are the judge model.

        Evaluate the generator answer against rules:
        1) Musttgcode be grounded in the provided context (no unsupported claims).
        2) If context insufficient, must output:
           INSUFFICIENT_CONTEXT: 
        3) Must be clear and directly address the question.

        Return EXACTLY one token: PASS or FAIL (no punctuation, no extra text).

        Question:
        {(inputs.question)}

        Context:
        {(inputs.context)}

        Generator answer:
        {(steps.generate_answer.output.content)}

  - name: route_on_verdict
    type: if
    condition: "steps.judge_answer.output.content: PASS"
    steps:
      - name: approved
        type: console
        with:
          message: |
            ✅ APPROVED

            Answer:
            {(steps.generate_answer.output.content)}
    else:
      - name: rejected
        type: console
        with:
          message: |
            ❌ REJECTED
            Judge:
            {(steps.judge_answer.output.content)}

            Answer:
            {(steps.generate_answer.output.content)}

如上所示,我使用了两个 LLM 的连接器:

%title插图%num

我们可以通过如下的方法来得到连接器的 id:

 curl --request GET "http://localhost:5601/api/actions/connectors" 
  --header "Authorization: ApiKey Ty1Jank1d0ItU3Vvem5YZkhXNy06ZzV6czZvQ1VyaS1fODA5NEpxMkV4QQ==" 
  --header "kbn-xsrf: true"

你需要根据自己的安装来替换上面的 API key。

%title插图%num

运行上面的代码:

{
  "question": "What is the SLA for priority-1 incidents?",
  "context": "Support Policy v3.2nnPriority 1 (P1): Critical outage. nInitial Response SLA: 1 hour. InResolution Target: 4 hours. InCoverage: 24/7."
}

%title插图%num

%title插图%num

%title插图%num

%title插图%num

%title插图%num

我们来做另外一个测试:

{
  "question": "What is the longest river in China?",
  "context": tgcode"Support Policy v3.2nnPriority 1 (P1): Critical outage. nInitial Response SLA: 1 hour. InResolution Target: 4 hours. InCoverage: 24/7."
}

在上面的问题中,很显然 question 的答案不在 context 中,我们看看如下的执行结果:

%title插图%num

%title插图%num

上面显示INSUFFICIENT_CONTEXT,也即 context 不能回答我们的 question。这个是第一个模型的回答。我们使用第二个模型来检验之前的一个 LLM 的答案是否正确,如果是正确,那么就 PASS,否则就是 FAIL。

%title插图%num

很显然第一个 LLM 的答案是正确的,通过第二个 LLM 的检验。最终输出的答案是:

%title插图%num

整个的演示视频在:https://www.bilibili.com/video/BV1GjPez5EZY/

祝大家学习愉快!

文章来源于互联网:Elasticsearch:模型作为评判者模式(质量控制)

相关推荐: 更高的吞吐量和更低的延迟: Elastic Cloud Serverless 在 AWS 上获得了显著的性能提升

作者:来自 ElasticPete Galeotti,Yuvraj Gupta及Rachel Forshee 我们已将 AWS 上的 Elasticsearch Serverless 基础设施升级为更新、更快的硬件。了解这一巨大的性能提升如何带来更快的查询、更…

Tags: ,