使用 Azure OpenAI 服务对数据进行联合 SharePoint 搜索
2024年11月29日 | by mebius
作者:来自 ElasticGustavo Llermaly
使用 Azure OpenAI 服务处理你的数据,并使用 Elastic 作为向量数据库。
在本文中,我们将探索 Azure OpenAI 服务 “On Your Data”,使用 Elasticsearch 作为数据源。我们将使用 Elastic Sharepoint Online Native 连接器来索引我们的 Sharepoint 文档并使其保持同步。
假设我们有一个 Sharepoint 网站,其中包含有关公司和员工的信息,我们想使用自定义应用程序与其聊天。设计和开发该架构通常需要一些时间。你必须负责提取,然后设置搜索引擎,然后配置 RAG 系统,该系统从数据源读取并将信息传递给 LLM 以回答问题。幸运的是,我们可以使用 Elastic 和 Azure 来加快这一速度!
步骤
- 设置 Sharepoint 连接器
- 设置 Azure OpenAI 服务
- 高级用法
- 文档级安全性 (DLS)
设置 Sharepoint 连接器
我们将创建一个包含以下文档的 Sharepoint 站点:
PE_Payslip.docx
该文件包含有关 Planet Express 公司的公开信息。
PE_Payslip.docx
这份文件包含 Planet Express 的工资单,具体来说是 CEO 的工资。我们不想让每个人都知道这些信息。
为了将我们的站点文档导入 Elastic,然后在添加或修改文件时保持它们同步,我们将使用 Elastic Sharepoint Online 连接器。第一步是准备好你的 Sharepoint 环境。你可以在此处找到有关如何设置的详细说明。
创建和配置 Sharepoint Online 应用程序后,你可以继续在 Elastic 中创建连接器索引:
创建 SharePoint Online 连接器
下一步是使用 Kibana Content UI 对文档主体进行向量化,以便我们可以对它们运行向量搜索:
创建向量字段
我们将使用 Elastic 的开箱即用的 e5-multilingual 模型,但你可以加入任何你想要的兼容嵌入模型,或者通过 Open Inference Service 使用 OpenAI 等外部提供商。你也可以重复此过程,并根据需要添加更多字段。
配置连接器索引后,我们可以运行同步以开始索引文档:
如果一切顺利,你应该会在 “Documents” 选项卡中看到你的文档:
默认情况下,连接器将提取不同类型的文档,如列表、列表项和网站。在本文中,我们只对文档感兴趣,因此让我们在连接器中应用过滤器来实现此目的。
此过滤器将排除列表和与集合相关的文档。你必须运行完整内容同步才能应用此过滤器。
设置 Azure OpenAI 服务
最简单的设置是转到 Azure AI Studio,并添加 Elasticsearch 作为聊天数据源:
选中自定义映射复选框以与连接器设置保持一致
我们可以在 keyword 或 vector 之间进行选择,让我们从 keyword 开始。
现在我们可以将连接器的映射与 Azure 将用于查询的字段对齐。
我们可以开始对我们的文档提问了!让我们开始询问有关 Planet Express 的问题:
我们可以开始对我们的文件提出问题了!让我们开始询问 Planet Express:
如果我们询问 CEO 的薪水怎么办?
我们可能不想公开这些信息。
让我们解决这个问题!
高级用法
Azure AI Studio 聊天并不是使用此服务的唯一方式。Azure OpenAI On Your Data 可以部署到 Copilot、Teams 或使用 API/SDK。我们将采用后者。
先决条件:
- 配置从用户到 Azure OpenAI 资源的角色分配。所需角色:Cognitive Services OpenAI User。
- 安装 Az CLI 并运行 az login。选择你的订阅后,将打开一个网页以对你进行身份验证。
- 定义以下变量:AzureOpenAIEndpoint、ChatCompletionsDeploymentName、SearchEndpoint、IndexName、Key。
要查找 AzureOpenAIEndpoint、ChatCompletionsDeploymentName,你可以单击聊天中的 “View Code” 选项卡:
从这里复制 endpoint 和 deployment 值。
SearchEndpoint、IndexName、Key 是 Elasticsearch URL、索引名称和 API 密钥。
我们将使用tgcode python。首先安装所需的依赖项:
pip install openai azure-identity
现在添加你收集的值。你也可以将它们存储为环境变量:
import os
from openai import AzureOpenAI
from azure.identity import DefaultAzureCredential, get_bearer_token_provider
endpoint = os.getenv("ENDPOINT_URL", "")
deployment = os.getenv("DEPLOYMENT_NAME", "")
现在让我们继续将 API 调用添加到文件中:
token_provider = get_bearer_token_provider(
DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
client = AzureOpenAI(
azure_endpoint=endpoint,
azure_ad_token_provider=token_provider,
api_version="2024-02-15-preview",
)
completion = client.chat.completions.create(
model=deployment,
messages=[
{
"role": "user",
"content": "What is the CEO salary?",
},
],
extra_body={
"data_sources": [
{
"type": "elasticsearch",
"parameters": {
"endpoint": search_endpoint,
"index_name": index_name,
"authentication": {
"type": "encoded_api_key",
"encoded_api_key": key
}
},
"query_type": "simple",
"fields_mapping": {
"content_fields_separator": "n",
"content_fields": [
"body"
],
"filepath_field": "name",
"title_field": "Title",
"url_field": "webUrl",
"vector_fields": [
"ml.inference.body.predicted_value"
]
},
}
]
}
)
print(completion.model_dump_json(indent=2))
运行脚本:
python myscript.py
我们得到了相同的答案。
{
"id": "01b421b9-212c-4a95-b4b8-072bbd2972dc",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "The CEO's salary is 1 jillion per month .",
"refusal": null,
"role": "assistant",
"function_call": null,
"tool_calls": null,
"end_turn": true,
"context": {
"citations": [
{
"content": "https://1fbkbs.sharepoint.com/_layouts/15/download.aspx?UniqueId=51621a8b-cede-4396-b42c-7f4bd54607b6&Translate=false&tempauth=v1.eyJzaXRlaWQiOiJiN2Q3NzBjMC03ZTgwLTQ5OTMtOTZjZC1hOGY3YjMxZWUyYmQiLCJhcHBfZGlzcGxheW5hbWUiOiJzcC1sYWJzIiwiYXVkIjoiMDAwMDAwMDMtMDAwMC0wZmYxLWNlMDAtMDAwMDAwMDAwMDAwLzFmYmticy5zaGFyZXBvaW50LmNvbUA5MTVkYzNkOS04NTI2LTRhODYtYTc4My02MDc1OTVkMzMxZjUiLCJleHAiOiIxNzI3OTY1MTczIn0.CgoKBHNuaWQSAjQ4EgsI-Jr32YHusT0QBRoNMjAuMTkwLjEzMi40MCosdW02Ym9VYzVMdTZGRXNuc1hrL2UwenA3QW1iWFRlM1BkQUovd2RTakNHbz0wdTgBQhChVktf74AAYI8XRwPrhkUMShBoYXNoZWRwcm9vZnRva2VuegExugE3Z3JvdXAucmVhZCBhbGxzaXRlcy5yZWFkIGFsbGZpbGVzLnJlYWQgYWxscHJvZmlsZXMucmVhZMIBSTIyNjk4YTdkLTRhZmQtNGJhNS1iMzMyLTNiMzA2NGRkYjFiNkA5MTVkYzNkOS04NTI2LTRhODYtYTc4My02MDc1OTVkMzMxZjXIAQE.kCyzpMNSnJKjdCpubfkQ_L7XvMZBFMseOjZQwHl_EEkn#microsoft.graph.driveItemnPlanet Express Interplanetary Payslip Employee name: Philip J. Fry Position: CEO Pay period: July 2024 Currency: Jillions PAYMENTS DEDUCTIONS YEAR TO DATE Basic Salary 1 jillion Taxes 0 Total pay to date: 1 jillion Taxable pay to date: 0 Tax paid to date: 0 THIS MONTH Gross pay: 1 jillion Income tax: 0 Total gross payments: 1 jillion Total deductions: 0 Net pay: 1 jillionn01BV67HZ4LDJRFDXWOSZB3ILD7JPKUMB5WnPE_payslip.docxndrive_itemnhttps://1fbkbs.sharepoint.com/_layouts/15/Doc.aspx?sourcedoc=%7B51621A8B-CEDE-4396-B42C-7F4BD54607B6%7D&file=PE_payslip.docx&action=default&mobileredirect=true",
"title": null,
"url": null,
"filepathtgcode": null,
"chunk_id": "0"
},
{
"content": "https://1fbkbs.sharepoint.com/_layouts/15/download.aspx?UniqueId=d771ceaa-e47f-4fc1-bda1-64a1c55d6e48&Translate=false&tempauth=v1.eyJzaXRlaWQiOiJiN2Q3NzBjMC03ZTgwLTQ5OTMtOTZjZC1hOGY3YjMxZWUyYmQiLCJhcHBfZGlzcGxheW5hbWUiOiJzcC1sYWJzIiwiYXVkIjoiMDAwMDAwMDMtMDAwMC0wZmYxLWNlMDAtMDAwMDAwMDAwMDAwLzFmYmticy5zaGFyZXBvaW50LmNvbUA5MTVkYzNkOS04NTI2LTRhODYtYTc4My02MDc1OTVkMzMxZjUiLCJleHAiOiIxNzI3OTY1MTczIn0.CgoKBHNuaWQSAjQ4EgsI-Jr32YHusT0QBRoNMjAuMTkwLjEzMi40MCosc1NiQlBlQU9sZEVMWUUxMmVodnNTK3NSMmx4blJsOGoybGR0N1Zxeko5Zz0wdTgBQhChVktf74AAYI8XRwPrhkUMShBoYXNoZWRwcm9vZnRva2VuegExugE3Z3JvdXAucmVhZCBhbGxzaXRlcy5yZWFkIGFsbGZpbGVzLnJlYWQgYWxscHJvZmlsZXMucmVhZMIBSTIyNjk4YTdkLTRhZmQtNGJhNS1iMzMyLTNiMzA2NGRkYjFiNkA5MTVkYzNkOS04NTI2LTRhODYtYTc4My02MDc1OTVkMzMxZjXIAQE.eN2threRzN2AZvYmPTCsNsKy1x-MLV_RbDq_yzSexG8n#microsoft.graph.driveItemnPlanet Express Interplanetary Our Company Planet Express, Inc. is an intergalactic delivery company owned and operated by Professor Farnsworth to fund his research. Founded in 2961, its headquarters is located in New New York, and its crew includes many important characters of the series. The current crew reached their 100th delivery in September 3010, and to celebrate, Bender threw a 100th-delivery party. The inaugural delivery crew, which disappeared on its first interplanetary mission, was found alive in June 3011. The company scrapes by, in spite of fierce competition from the leader in package delivery, Mom's Friendly Delivery Company. They stay in business thanks to their complete disregard for safety and minimum wage laws, and the Professor's unscrupulous acceptance of the occasional bribe.n01BV67HZ5KZZY5O77EYFH33ILEUHCV23SInPlanet_Express.docxndrive_itemnhttps://1fbkbs.sharepoint.com/_layouts/15/Doc.aspx?sourcedoc=%7BD771CEAA-E47F-4FC1-BDA1-64A1C55D6E48%7D&file=Planet_Express.docx&action=default&mobileredirect=true",
"title": null,
"url": null,
"filepath": null,
"chunk_id": "0"
}
],
"intent": "["CEO salary", "current CEO salary", "CEO compensation"]"
}
}
}
],
"created": 1727985189,
"model": "gpt-4o",
"object": "extensions.chat.completion",
"service_tier": null,
"system_fingerprint": "fp_67802d9a6d",
"usage": {
"completion_tokens": 28,
"prompt_tokens": 3658,
"total_tokens": 3686,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
}
不同之处在于,现在我们可以根据每个请求覆盖 Elasticsearch 设置。
文档级安全性 (Document Level Security – DLS)
我们如何保护文档?Elastic 提供了镜像 Sharepoint 安全权限的工具,因此,根据提问者的身份,我们可以根据他们在 Sharepoint 上拥有的权限来检索或不检索文档,以回答问题。我们将为此使用文档级安全性 (DLS)。
事实上,工资单信息不会与网站成员共享,只有网站所有者和管理员才能使用:
让我们首先在连接器中运行访问控制同步来填充安全索引:
现在我们可以预期安全索引会收集每个文档/用户的权限。
让我们获取一个用户。转到 Kibana DevTools 并运行以下命令:
GET .search-acl-filter-sharepoint-labs/_search
响应:
{
"_index": ".search-acl-filter-sharepoint-labs",
"_id": "LeeG@1fbkbs.onmicrosoft.com",
"_score": 1,
"_source": {
"created_at": "2024-07-16T07:28:22",
"id": "LeeG@1fbkbs.onmicrosoft.com",
"_timestamp": "2024-08-05T00:42:48.058411+00:00",
"identity": {
"user_id": "user_id:2f7a1527-da11-4738-ad9d-0f6be1acb6a7",
"email": "email:LeeG@1fbkbs.onmicrosoft.com",
"username": "user:LeeG@1fbkbs.onmicrosoft.com"
},
"query": {
"template": {
"source": """{
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "_allow_access_control"
}
}
}
},
{
"terms": {
"_allow_access_control.enum": {{#toJson}}access_control{{/toJson}}
}
}
]
}
}""",
"params": {
"access_control": [
"group:038fae1d-6ea3-485a-83b9-4362b54a14f5",
"user_id:2f7a1527-da11-4738-ad9d-0f6be1acb6a7",
"group:d11975c2-4fe8-45fd-9789-cbf37d4f115d",
"group:c0c350fa-37b0-476a-829d-733800cfbeea",
"group:70ddf71e-c04e-4202-b0ab-d4fd78921b72",
"group:829ee542-eb93-40f5-9790-688457a2b0f5",
"email:LeeG@1fbkbs.onmicrosoft.com",
"user:LeeG@1fbkbs.onmicrosoft.com",
"group:62ab5abe-bac2-4fc7-9b5f-92985b8ae69c"
]
}
}
}
}
}
该用户是网站成员,因此非常适合测试权限。
从这里我们可以从前面的响应中获取 query.template 部分,并为用户 LeeG 创建一个 API 密钥,执行以下操作:
POST /_security/api_key
{
"name": "LeeG-api-key",
"expiration": "30d",
"role_descriptors": {
"sharepoint-online-role": {
"index": [
{
"names": [
"sharepoint-labs"
],
"privileges": [
"read",
"view_index_metadata"
],
"query": {
"template": {
"params": {
"access_control": [
"group:038fae1d-6ea3-485a-83b9-4362b54a14f5",
"user_id:2f7a1527-da11-4738-ad9d-0f6be1acb6a7",
"group:d11975c2-4fe8-45fd-9789-cbf37d4f115d",
"group:c0c350fa-37b0-476a-829d-733800cfbeea",
"group:70ddf71e-c04e-4202-b0ab-d4fd78921b72",
"group:829ee542-eb93-40f5-9790-688457a2b0f5",
"email:LeeG@1fbkbs.onmicrosoft.com",
"user:LeeG@1fbkbs.onmicrosoft.com",
"group:62ab5abe-bac2-4fc7-9b5f-92985b8ae69c"
]
},
"source":"""{
"bool": {
"should": [
{
"bool": {
"must_not": {
"exists": {
"field": "_allow_access_control"
}
}
}
},
{
"terms": {
"_allow_access_control.enum": {{#toJson}}access_control{{/toJson}}
}
}
]
}
}"""
}
}
}
]
}
}
}
响应将是具有 LeeG 组权限的 ApiKey:
{
"id": "rpgMIJEBvlvLsU6BeL5O",
"name": "LeeG-api-key",
"expiration": 1725411573838,
"api_key": "S3Q4XCNuTeu9fPITZNmLfA",
"encoded": "cnBnTUlKRUJ2bHZMc1U2QmVMNU86UzNRNFhDTnVUZXU5ZlBJVFpObUxmQQ=="
}
从这里获取 encoded 的值,以便在将来使用 Azure OpenAI On Your Data 进行调用时使用。如果你使用此 ApiKey,你将只能在 sharepoint-labs 连接器索引中看到 LeeG 的用户具有权限的文档。
让我们再试一次,现在使用 LeeG-api-key API 密钥:
completion = client.chat.completions.create(
model=deployment,
messages=[
{
"role": "user",
"content": "What is the CEO Salary?",
},
],
extra_body={
"data_sources": [
{
"type": "elasticsearch",
"parameters": {
"endpoint": search_endpoint,
"index_name": index_name,
"authentication": {
"type": "encoded_api_key",
"encoded_api_key": key # Our new API Key goes here
}
},
"query_type": "simple",
"fields_mapping": {
"content_fields_separator": "n",
"content_fields": [
"body"
],
"filepath_field": "name",
"title_field": "Title",
"url_field": "webUrl",
"vector_fields": [
"ml.inference.body.predicted_value"
]
},
}
]
}
)
print(completion.model_dump_json(indent=2))
响应:
{
"id": "564eb9d5-5321-41d9-97c5-5abd9323b2d2",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "The requested information is not available in the retrieved data. Please try another query or topic.",
"refusal": null,
"role": "assistant",
"function_call": null,
"tool_calls": null,
"end_turn": true,
"context": {
"citations": [
{
"content": "https://1fbkbs.sharepoint.com/_layouts/15/download.aspx?UniqueId=d771ceaa-e47f-4fc1-bda1-64a1c55d6e48&Translate=false&tempauth=v1.eyJzaXRlaWQiOiJiN2Q3NzBjMC03ZTgwLTQ5OTMtOTZjZC1hOGY3YjMxZWUyYmQiLCJhcHBfZGlzcGxheW5hbWUiOiJzcC1sYWJzIiwiYXVkIjoiMDAwMDAwMDMtMDAwMC0wZmYxLWNlMDAtMDAwMDAwMDAwMDAwLzFmYmticy5zaGFyZXBvaW50LmNvbUA5MTVkYzNkOS04NTI2LTRhODYtYTc4My02MDc1OTVkMzMxZjUiLCJleHAiOiIxNzI3OTY1MTczIn0.CgoKBHNuaWQSAjQ4EgsI-Jr32YHusT0QBRoNMjAuMTkwLjEzMi40MCosc1NiQlBlQU9sZEVMWUUxMmVodnNTK3NSMmx4blJsOGoybGR0N1Zxeko5Zz0wdTgBQhChVktf74AAYI8XRwPrhkUMShBoYXNoZWRwcm9vZnRva2VuegExugE3Z3JvdXAucmVhZCBhbGxzaXRlcy5yZWFkIGFsbGZpbGVzLnJlYWQgYWxscHJvZmlsZXMucmVhZMIBSTIyNjk4YTdkLTRhZmQtNGJhNS1iMzMyLTNiMzA2NGRkYjFiNkA5MTVkYzNkOS04NTI2LTRhODYtYTc4My02MDc1OTVkMzMxZjXIAQE.eN2threRzN2AZvYmPTCsNsKtgcodey1x-MLV_RbDq_yzSexG8n#microsoft.graph.driveItemnPlanet Express Interplanetary Our Company Planet Express, Inc. is an intergalactic delivery company owned and operated by Professor Farnsworth to fund his research. Founded in 2961, its headquarters is located in New New York, and its crew includes many important characters of the series. The current crew reached their 100th delivery in September 3010, and to celebrate, Bender threw a 100th-delivery party. The inaugural delivery crew, which disappeared on its first interplanetary mission, was found alive in June 3011. The company scrapes by, in spite of fierce competition from the leader in package delivery, Mom's Friendly Delivery Company. They stay in business thanks to their complete disregard for safety and minimum wage laws, and the Professor's unscrupulous acceptance of the occasional bribe.n01BV67HZ5KZZY5O77EYFH33ILEUHCV23SInPlanet_Express.docxndrive_itemnhttps://1fbkbs.sharepoint.com/_layouts/15/Doc.aspx?sourcedoc=%7BD771CEAA-E47F-4FC1-BDA1-64A1C55D6E48%7D&file=Planet_Express.docx&action=default&mobileredirect=true",
"title": null,
"url": null,
"filepath": null,
"chunk_id": "0"
}
],
"intent": "["CEO salary", "current CEO salary", "CEO compensation"]"
}
}
}
],
"created": 1727985521,
"model": "gpt-4o",
"object": "extensions.chat.completion",
"service_tier": null,
"system_fingerprint": "fp_67802d9a6d",
"usage": {
"completion_tokens": 31,
"prompt_tokens": 2952,
"total_tokens": 2983,
"completion_tokens_details": null,
"prompt_tokens_details": null
}
}
太棒了!现在 CEO 的私人文件受到保护了。
我们可以尝试另一个问题,现在与 Lee 可以看到的文档相关,例如 Planet Express 信息文档。
尝试再次运行 python 文件,但现在将消息切换为:
{
"role": "user",
"content": "What is Planet Express?"
}
回答:
“Planet Express, Inc. is an intergalactic delivery company owned and operated by Professor Farnsworth to fund his research. Founded in 2961, its headquarters is located in New New York. The company has a crew that includes many important characters from the series it is featured in. Despite fierce competition from the leading package delivery company, Mom’s Friendly Delivery Company, Planet Express manages to stay in business by disregarding safety and minimum wage laws and occasionally accepting bribes…”
奖励:向量搜索
我们也将文档存储为向量,因此我们可以利用 Azure 向量查询类型。你可以在 UI 中选择查询类型为 vector:
如你所见,它将自动从 Elasticsearch 中检测模型。
它还将自动检测向量字段。你必须填写其余字段以进行引用。
或者通过以下方式使用 SDK:
token_provider = get_bearer_token_provider(
DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default")
client = AzureOpenAI(
azure_endpoint=endpoint,
azure_ad_token_provider=token_provider,
api_version="2024-02-15-preview",
)
completion = client.chat.completions.create(
model=deployment,
messages=[
{
"role": "user",
"content": "What is Planet Express?",
},
],
extra_body={
"data_sources": [
{
"type": "elasticsearch",
"parameters": {
"endpoint": search_endpoint,
"index_name": index_name,
"authentication": {
"type": "encoded_api_key",
"encoded_api_key": key
}
},
"query_type": "vector",
"embedding_dependency": {
"type": "model_id",
"model_id": ".multilingual-e5-small_linux-x86_64"
},
"fields_mapping": {
"content_fields_separator": "n",
"content_fields": [
"body"
],
"filepath_field": "name",
"title_field": "Title",
"url_field": "webUrl",
"vector_fields": [
"ml.inference.body.predicted_value"
]
},
}
]
}
)
print(completion.model_dump_json(indent=2))
结论
Azure OpenAI “On Your Data” 服务是一种工具,可让你快速与数据进行聊天,而无需训练或微调模型。与 Elastic SharePoint 连接器结合使用,它允许你控制谁有权访问你的数据,以防止任何安全漏洞,使它们成为扎实的聊天问题的绝佳组合。
Elasticsearch 与行业领先的 Gen AI 工具和提供商进行了原生集成。查看我们的网络研讨会,了解如何超越 RAG 基础知识,或构建可用于生产环境的应用程序 Elastic Vector Database。
要为你的用例构建最佳搜索解决方案,请立即开始免费云试用或在你的本地机器上试用 Elastic。
原文:Federated SharePoint searches with Azure OpenAI Service On your data – Search Labs