在 Elastic 8.15 中使用最少的代码快速构建 RAG

2024年9月5日   |   by mebius

作者:来自 ElasticHan Xiang Choong

%title插图%num

Elastic 8.15 已经发布,语义搜索比以往任何时候都更容易实现。

我们将介绍如何在 15 分钟内完成所有这些任务:

  1. 将你的文档存储在某些数据存储服务中,例如 AWS S3 Bucket
  2. 设置 Elastic Search 数据连接器
  3. 使用 eland 库上传嵌入模型,在 Elastic 中设置推理 API
  4. 将其连接到使用 semantic_text 数据类型的索引
  5. 将你的 inference API 添加到该索引
  6. 同步数据连接器
  7. 立即使用 Elastic Playground

你将需要:

  1. 更新到 Elastic 8.15 的 Elastic Cloud Deployment
  2. S3 bucket
  3. LLM API 服务tgcode(Anthropic、Azure、OpenAI、Gemini)

就这样!让我们完成它。

收集数据

为了跟进这个特定的演示,我上传了一个包含此处使用的数据的 zip 文件。它是《精灵宝钻》的前 60 页左右,每页都是单独的 pdf 文件。我现在正在经历《指环王》的冲击。请随意下载并上传到你的 S3 存储桶!

对于大型文档,有时需要将文档拆分为单独的页面,因为 Elastic S3 数据连接器不会提取超过 10MB 的文件的内容。

我使用这个 Python 脚本将 PDF 拆分为单独的页面:

import os
from PyPDF2 import PdfReader, PdfWriter

def split_pdf(input_pdf_path, output_folder, filename=''):
    # Create the output folder if it doesn't exist
    if not os.path.exists(output_folder):
        os.makedirs(output_folder)

    # Open the PDF file
    with open(input_pdf_path, 'rb') as file:
        pdf_reader = PdfReader(file)
        
        # Iterate through each page
        for page_num in range(len(pdf_reader.pages)):
            pdf_writer = PdfWriter()
            pdf_writer.add_page(pdf_reader.pages[page_num])
            
            # Generate the output file name
            output_filename = f'{filename}pg_{page_num + 1}.pdf'
      tgcode      output_path = os.path.join(output_folder, output_filename)
            
            # Save the page as a new PDF
            with open(output_path, 'wb') as output_file:
                pdf_writer.write(output_file)
            
            print(f'Saved {output_filename}')

# Example usage
input_pdf = 'The Silmarillion (Illustrated) - J. R. R. Tolkien; Ted Nasmith;.pdf'
output_folder = './silmarillion_pages/'

split_pdf(input_pdf, output_folder, "Silmarillion_")

设置 S3 连接器

数据连接器可以接收各种类型的数据。在这里,我们坚持使用装有 pdf 页面的 S3 存储桶。

%title插图%num
我的 S3 存储桶

我只需跳转到我的 Elastic Cloud 部署,转到 Search -> Content-> Connectors,然后使用所有默认设置创建一个名为 aws-connector 的新连接器。然后,我将打开配置并添加我的存储桶的名称以及标记到我的 AWS 用户的密钥和访问密钥。

%title插图%num
Elastic Cloud S3 连接器配置

运行快速同步以验证一切是否正常。同步将提取数据源中每个未提取的文件,提取其内容,并将其作为唯一文档存储在索引中。每个文档都将包含其原始文件名。与现有索引文档具有相同文件名的数据源文档不会被重新提取,所以不用担心!同步也可以定期安排。该方法在文档中描述。如果一切正常,假设我的 AWS 凭证和权限都正确无误,数据将进入名为 aws-connector 的索引。

%title插图%num
我们的 S3 连接器首次成功同步

看起来一切都很好。让我们抓住我们的嵌入模型!

上传嵌入模型

Eland 是一个 Python Elasticsearch 客户端,它可以轻松地将 numpy、pandas 和 scikit-learn 函数转换为 Elasticsearch 支持的等效函数。就我们的目的而言,它将是我们上传 HuggingFace 模型的方法,以便在我们的 Elasticsearch 集群中进行部署。你可以像这样安装 eland:

python -m pip install eland

现在使用 bash 编辑器并制作这个小 .sh 脚本,适当地填写每个参数:

MODEL_ID="sentence-transformers/all-MiniLM-L6-v2"
ELASTIC_USERNAME=""
ELASTIC_PASSWORD=""
CLOUD_ID=""

eland_import_hub_model 
    --cloud-id $CLOUD_ID 
    --es-username $ELASTIC_USERNAME 
    --es-password $ELASTIC_PASSWORD 
    --hub-model-id $MODEL_ID 
    --task-type text_embedding 
    --start

MODEL_ID 指的是从 huggingface 中获取的模型。我选择 all-MiniLM-L6-v2 主要是因为它非常好,而且非常小,并且易于在 CPU 上运行。运行 bash 脚本,完成后,你的模型应该出现在 Elastic 部署中的 Machine Learning -> Model Management -> Trained Models

%title插图%num
使用 eland 部署刚刚上传的模型

只需单击带圆圈的播放按钮即可部署模型,就完成了。

设置语义文本索引

是时候设置语义搜索了。导航到 Management -> Dev Tools,然后删除索引,因为它没有启用语义文本数据类型。

DELETE aws-connector

使用以下命令检查你上传的模型的 model_id:

GET _ml/trained_models

现在创建一个名为 minilm-l6 的推理端点,并向其传递正确的 model_id。我们不必担心 num_allocations 和 num_threads,因为这不是生产环境,而且 minilm-l6 也不是大问题。

PUT _inference/text_embedding/minilm-l6
{
  "service": "elasticsearch",
  "service_settings": {
    "num_allocations": 1,
    "num_threads": 1,
    "model_id": "sentence-transformers__all-minilm-l6-v2" 
  }
}

现在重新创建 aws-connector 索引。将 body属性设置为 semantic_text类型,并添加新推理端点的 ID。

PUT aws-connector
{
  "mappings": {
    "properties": {
      "body": {
        "type": "semantic_text",
        "inference_id": "minilm-l6"
      }
    }
  }
}

返回连接器并运行另一次全内容同步(这次是真的!)。传入的文档将自动分块为 250 个单词的块,重叠部分为 100 个单词。你无需明确执行任何操作。现在很方便!

%title插图%num
这次真正同步你的 S3 连接器!

就完成了。检查你的 aws-connector 索引,里面会有 140 个文档,每个文档现在都是一个嵌入块:

%title插图%num
索引中充满分块文档

使用 Elastic Playground 进行 RAG

快速转到 Search -> Build -> Playground 并添加你选择的 LLM 连接器。我使用的是 Azure OpenAI:

%title插图%num
设置端点和 API 密钥

现在让我们设置聊天体验。单击 Add Data Sources 并选择 aws-connector

%title插图%num

查看新聊天体验的查询选项卡。假设所有设置均正确,它将自动设置为此混合搜索查询,model_id 为 minilm-l6。

%title插图%num
默认混合搜索查询

让我们问一个问题!我们将采用三份文件作为背景,并添加我的特殊 RAG 提示:

%title插图%num
添加提示并选择上下文的搜索结果数量

查询:Describe the fall from Grace of Melkor

我们将使用相对开放的 RAG 查询。要得到令人满意的回答,需要从文本的多个部分提取信息。这将很好地表明 RAG 是否按预期工作。

Summary
The fall from grace of Melkor is a central theme in the mythology of J.R.R. Tolkien's legendarium. According to the provided context, Melkor's descent into discord and rebellion can be summarized as follows:

1. **Initial Greatness and Ambition**:
   - Melkor was one of the Ainur, the powerful spirits created by Ilvatar. He was endowed with the greatest gifts of power and knowledge among the Ainur, having a share in all the gifts of his brethren [2].
   - His ambition led him to seek the Imperishable Flame, desiring to bring into Being things of his own. However, he did not find the Flame, as it was with Ilvatar [2].

2. **Introduction of Discord**:
   - During the Great Music, which was the Ainur's act of creation through song, Melkor began to interweave his own themes that were not in accord with Ilvatar's original theme. This caused discord and disrupted the harmony of the music [2].
   - Melkor's discord spread, and many Ainur attuned their music to his, leading to a turbulent and chaotic sound. Despite this, Ilvatar introduced new themes to counter Melkor's discord, but Melkor's influence continued to cause strife [2].

3. **Conflict and Rebellion**:
   - When the Ainur entered the World (E), Melkor continued to meddle in their work, turning it to his own desires and purposes. He kindled great fires and sought to claim the Earth as his own kingdom [3].
   - Manw, who was the chief instrument of Ilvatar's second theme, opposed Melkor and called upon other spirits to aid him in preventing Melkor from hindering their labor. This led to strife between Melkor and the other Valar [3].

4. **Continued Strife and Withdrawal**:
   - Despite being opposed by Manw and the other Valar, Melkor did not abandon his desire for dominion over Arda (the Earth). He withdrew to other regions but continued to harbor the desire to rule [3].

In summary, Melkor's fall from grace was marked by his initial ambition and desire for power, his introduction of discord into the Great Music, his continued rebellion and meddling in the creation of the World, and his persistent desire to claim dominion over Arda despite opposition from the other Valar. His actions led to ongoing conflict and strife, setting the stage for much of the subsequent history and mythology of Middle-earth.

好吧,我深信不疑。它甚至有引文!再来一句祝你好运:

查询:Who were the greatest students of Aule the Smith?

这个特定的查询并不太难,我只是在寻找对文本中非常具体的引用的引用。让我们看看它是如何做到的!

According to the provided context, the greatest students of Aul the Smith were the Noldor. The text states, "The Noldor learned most of him, and he was ever their friend" [1]. This indicates that the Noldor, one of the clans of the Elves, were particularly skilled in crafts and knowledge due to their learning from Aul.

Aul, being a master of all crafts and a smith, imparted his knowledge and skills to the Noldor, making them renowned for their craftsmanship and artistry in Middle-earth. This relationship between Aul and the Noldor is significant in the lore, as it highlights the transmission of knowledge and skill from a Vala to the Elves.

嗯,没错。看起来 RAG 运行良好。

结论

这真是太方便了,而且毫不费力 —— 太棒了!我们真的生活在未来。我绝对可以用这个。我希望你和我一样兴奋地尝试它,想要展示它。

准备好自己tgcode尝试一下了吗?开始免费试用
想要获得 Elastic 认证吗?了解下一次 Elasticsearch 工程师培训何时举行!

原文:https://search-labs.elastic.co/search-labs/blog/rag-minimal-code-in-815

文章来源于互联网:在 Elastic 8.15 中使用最少的代码快速构建 RAG

相关推荐: Elasticsearch 开放推理 API 增加了对 Amazon Bedrock 的支持

作者:来自 Elastic Mark Hoy, Hemant Malik Elasticsearch 开放推理 API 增加了对托管在 Amazon Bedrock 上的模型生成嵌入的支持。 Elasticsearch 开放 itgcodenfereence …

Tags: , , ,