Elasticsearch:运用 Elasticsearch 查找类似文档:more_like_this
2021年7月26日 | by mebius
More Like This Query 查找与给定文档集 “相似” 的文档。 为此,More Like This选择这些输入文档的一组代表性术语,使用这些术语形成查询,执行查询并返回结果。 用户控制输入文档、应如何选择术语以及如何形成查询。
最简单的用例包括请求与提供的文本片段相似的文档。 在这里,我们要求所有在 “title” 和 “description” 字段中包含类似于 “Once upon a time” 的文本的所有电影,将所选术语的数量限制为 12。
GET /_search
{
"query": {
"more_like_this" : {
"fields" : ["title", "description"],
"like" : "Once upon a time",
"min_term_freq" : 1,
"max_query_terms" : 12
}
}
}
一个更复杂的用例包括将文本与索引中已经存在的文档混合。 在这种情况下,指定文档的语法类似于 Multi GET API 中使用的语法。
GET /_search
{
"query": {
"more_like_this": {
"fields": [ "title", "description" ],
"like": [
{
"_index": "imdb",
"_id": "1"
},
{
"_index": "imdb",
"_id": "2"
},
"and potentially some more text here as well"
],
"min_term_freq": 1,
"max_query_terms": 12
}
}
}
最后,用户可以混合一些文本、一组选定的文档,但也可以提供不一定出现在索引中的文档。 为了提供索引中不存在的文档,语法类似于人工文档。
GET /_search
{
"query": {
"more_like_this": {
"fields": [ "name.first", "name.last" ],
"like": [
{
"_index": "marvel",
"doc": {
"name": {
"first": "Ben",
"last": "Grimm"
},
"_doc": "You got no idea what I'd... what I'd give to be invisible."
}
},
{
"_index": "marvel",
"_id": "2"
}
],
"min_term_freq": 1,
"max_query_terms": 12
}
}
}
动手实践
在下面,我将使用一个简单的例子来展示如何使用 more_like_this 查询来查找相似的文档。尽管这个查询是一个非常有趣的功能,但是可能很多开发者不会选择使用这种查询,一方面是对这个查询不是很理解,另一方面,开发者可能会选择使用传统的查询,比如 match, term 及 range。希望通过这篇文章的介绍,你会在未来的工作中根据自己使用案例选择使用 more_like_this 查询。
准备数据
未来这个展示,我们将使用 movies 这个数据集。
请点击上面的 Download 链接下载这个数据集。把这个数据集下载下来并保存于项目的 data 子目录中。
然后,我们可以在地址https://github.com/liu-xiao-guo/searchflix下载整个源码,并把如下的文件拷贝出来:
- pipeline/movies.conf 文件拷贝出来,放入到项目的根目录中
- elastic/elasticsearch/mappings/movies.mapping 文件拷贝出来,放入到项目的根目录中
- dictionaries/countries_geo.csv 文件拷贝出来,并放入到 dictionaries 子目录下
经过这样的操作过后,我们可以看到的文件是这样的:
$ pwd
/Users/liuxg/data/morelikethis
$ ls
data dictionaries movies.conf movies.mapping
$ tree -L 3
.
├── data
│ └── movies_metadata.csv
├── dictionaries
│ ├── countries_geo.csv
│ └── source.txt
├── movies.conf
└── movies.mapping
进入到项目的子目录,我们在 terminal 中打入如下的命令:
curl -XPUT -H'Content-type: application/json' localhost:9200/movies -d@mappings/movies.mapping
我们接下来导入数据:
sudo /bin/logstash -f movies.conf
在这里,我们必须使用 sudo,这是因为在 movies.conf 里,我们有使用/dev/null。经过上面的导入,我们可以在 Kibana 中可以查看到已经导入的文档:
GET movies/_count
{
"count" : 45432,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
}
}
movies 索引中,一个典型的文档是这样的:
GET movies/_search
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 10000,
"relation" : "gte"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "12110",
"_score" : 1.0,
"_source" : {
"original_title" : "Dracula: Dead and Loving It",
"adult" : false,
"vote_average" : 5.7,
"genres" : [
{
"id" : 35,
"name" : "Comedy"
},
{
"id" : 27,
"name" : "Horror"
}
],
"tagline" : null,
"production_companies" : [
{
"id" : 5,
"name" : "Columbia Pictures"
},
{
"id" : 97,
"name" : "Castle Rock Entertainment"
},
{
"id" : 6368,
"name" : "Enigma Pictures"
}
],
"imdb_id" : "tt0112896",
"spoken_languages" : [
{
"iso_639_1" : "en",
"name" : "English"
},
{
"iso_639_1" : "de",
"name" : "Deutsch"
}
],
"production_countries_name_list" : [
"France",
"United States of America"
],
"@version" : "1",
"title" : "Dracula: Dead and Loving It",
"homepage" : null,
"original_language" : "en",
"belongs_to_collection" : null,
"production_countries_location_list" : [
"46.227638,2.213749",
"37.09024,-95.712891"
],
"popularity" : 5.430331,
"budget" : 0.0,
"revenue" : 0.0,
"production_countries" : [
{
"iso_3166_1" : "FR",
"location" : "46.227638,2.213749",
"name" : "France"
},
{
"iso_3166_1" : "US",
"location" : "37.09024,-95.712891",
"name" : "United States of America"
}
],
"release_date" : "1995-12-22",
"poster_path" : "/xve4cgfYItnOhtzLYoTwTVy5FGr.jpg",
"@timestamp" : "1995-12-21T16:00:00.000Z",
"id" : 12110,
"runtime" : 88.0,
"status" : "Released",
"genres_list" : [
"Comedy",
"Horror"
],
"overview" : "When a lawyer shows up at the vampire's doorstep, he falls prey to his charms and joins him in his search for fresh blood. Enter Dr. van Helsing, who may be the only one able to vanquish the count.",
"vote_count" : 210,
"video" : "false"
}
}
...
在上面,我们可以看到有一个叫做 overview 的字段。
More Like This 查询
more_like_this 查询的目的是在索引文档中查找与用户通知的某些条目相似的文档。他们通过从知情条目中选择相关术语,然后使用这些术语构建查询来做到这一点。
知情条目可以是自由文本或其他索引文档。也就是说,你可以轻松搜索与已在同一索引或其他索引中编入索引的任何文档相似的文档。也就是说,我想用一个用例来演示此查询的用法,即向用户提供与他选择的电影或他刚刚观看的电影相似的电影概要。
此查询的唯一必需参数是,你必须输入要搜索相似文档的文本,或包含一个对象的数组,该对象指示要搜索的文档的索引/ID 文件。在第二种情况下,还可以将现有和索引文档与人工文档混合,即可以模拟带有自由文本的文档。下面是一个例子:
GET movies/_search
{
"fields": ["overview"],
"query": {
"more_like_this": {
"fields": [
"overview"
],
"like": "Princess Leia is captured and held hostage by the evil Imperial forces in their effort to take over the galactic Empire. Venturesome Luke Skywalker and dashing captain Han Solo team together with the loveable robot duo R2-D2 and C-3PO to rescue the beautiful princess and restore peace and justice in the Empire.",
"min_term_freq": 1,
"max_query_terms": 12
}
}
, "_source": false
}
通常,虽然不是强制性的,但你还需要输入 fields 参数,这是一个包含字段名称的数组,将在其中检查相似性。 另一个有趣的参数是 unlike,它与 like 结合使用(它们不相互排斥),它遵循相同的语法,并将通过排除与我们通知我们不知道的文档相似的文档来减少结果的数量 想。 基本上(像 X)AND(不像 Y)。
此查询中的其余参数分为两种类型。
用于选择术语的参数
- max_query_terms:要选择的最大术语数。我们拥有的术语越多,准确度就越高,但以牺牲性能为代价。默认值为 25。
- min_term_freq:应忽略输入文档/文本中的术语的最小频率。默认值为 2。
- min_doc_freq:文档的最小频率,低于该频率的输入文档应被忽略。默认值为 5。
- max_doc_freq:最大文档频率,高于该频率时,输入文档的术语应被忽略。忽略非常频繁的术语(如停用词)会很有用。默认情况下它被禁用 (0)。
- min_word_length:最小术语长度,低于该长度的术语应被忽略。默认值为 0。
- max_word_length:最大术语大小,超过该术语应被忽略。旧名称 max_word_len 已弃用。默认情况下它被禁用 (0)。
- stop_words:一组停用词,要忽略的术语。
分析器:用于输入文本的分析器。默认情况下,它是与 fields 参数中通知的第一个字段关联的分析器。
查询构造参数
- minimum_should_match:控制必须找到的术语数。 使用与最小值应该匹配的相同语法。 默认值为 “30%”。
- fail_on_unsupported_field:如果提供的任何字段(字段)不属于任何受支持的类型(关键字或文本),则控制查询是否应失败。 默认为真。
- boost_terms:将构建的查询中的每个术语都可以通过其 TF-IDF 分数来增强。 默认情况下它被禁用 (0),任何正值都会激活此功能。
- include:定义查询结果中是否应返回输入文档。 默认为假。
- boost:定义整个查询的 boost 值。 默认值为 1.0。
实践
回到之前提到的用例(寻找类似的电影向用户推荐),让我们做一些实验。
在下面的示例中,我将使用电影 “Jaws”(大白鲨)的概要并尝试找到类似的电影:
GET movies/_search
{
"size": 5,
"_source": [
"title",
"overview"
],
"query": {
"more_like_this": {
"fields": [
"overview"
],
"min_term_freq": 1,
"max_query_terms": 12,
"like": "An insatiable great white shark terrorizes the townspeople of Amity Island, The police chief, an oceanographer and a grizzled shark hunter seek to destroy the bloodthirsty beast."
}
}
}
以下是前 5 个结果:
"hits" : [
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "578",
"_score" : 100.41875,
"_source" : {
"overview" : "An insatiable great white shark terrorizes the townspeople of Amity Island, The police chief, an oceanographer and a grizzled shark hunter seek to destroy the bloodthirsty beast.",
"title" : "Jaws"
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "52454",
"_score" : 19.65117,
"_source" : {
"overview" : "When the prehistoric warm-water beast the Crocosaurus crosses paths with that cold-water monster the Mega Shark, all hell breaks loose in the oceans as the world's top scientists explore every option to halt the aquatic frenzy. Swallowing everything in their paths -- including a submarine or two -- Croc and Mega lead an explorer and an oceanographer on a wild chase. Eventually, the desperate men turn to a volcano for aid.",
"title" : "Mega Shark vs. Crocosaurus"
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "246594",
"_score" : 18.51667,
"_source" : {
"overview" : "When another Mega Shark returns from the depths of the sea, world militaries go on high alert. Ocean traffic grinds to a standstill as everyone lives in fear of the insatiable beast. Out of options, the US government unleashes the top secret Mecha Shark project -- a mechanical shark built to have the same exact characteristics as Mega. A pair of scientists pilot the mechanical creature as they fight Mega in a pitched battle to save the planet. But when faulty mechanics cause the Mecha to go after humans, the scientists must somehow guide Mega to Mecha in hopes that the two titans will kill each other - or risk untold worldwide destruction.",
"title" : "Mega Shark vs. Mecha Shark"
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "43084",
"_score" : 14.461939,
"_source" : {
"overview" : "Wealthy big game hunter, Wilson Frields, funds an expedition going deep into the Florida Everglades to search for the Calusa: a lost tribe of Native Americans. When the team discover the gruesome remains of another expedition, Friels admits he is searching for the Calusa's Fountain of Youth and its guardian, a mythical and deadly beast. As they delve deeper into the Everglades, the bloodthirsty beast begins to stalk and kill members of the group and, in one struggle, their leader Brinson Thomas is injured and begins to metamorphose into a creature himself. His only hope: to drink from the waters of the Fountain. The terrible truth behind the Calusa must be discovered if any of them are going to get out of there alive!",
"title" : "Deadly Species"
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "385232",
"_score" : 11.46097,
"_source" : {
"overview" : "When the powerful wizard, Lord Tensley, is jilted by Princess Ennogard, he vows to rid the land of love. He commands his fire-breathing dragon to destroy any sign of affection seen throughout the kingdom. As the death toll rises, Camilan, a brave but arrogant warrior seeks to marry his true love despite the curse upon the land. In order to fulfill his destiny, he seeks the help of his estranged brother Ramicus, a bounty hunter with no desire to get involved. It takes an enchanted distress message and the promise of great reward from the beautiful Princess Ennogard, to lure Ramicus into the quest to defeat the wizard and his terrible beast.",
"title" : "Dudes & Dragons"
tgcode }
}
]
请注意,第一个结果正是 “Jaws” 本身,因为我没有执行指示该电影的文档的查询(如果我这样做了,因为 include 参数默认为 false,文档本身将不会返回), 但是在类似参数中,我告知了它在索引文档中出现的概要,并且索引中肯定不会有比文档本身更相似的文档。
至于其他结果,它们都与鲨鱼有关也就不足为奇了,因为这肯定是知情文本中的相关术语。
让我们尝试通知代表用户刚刚观看的电影的文档:
GET movies/_search
{
"fields": [
"overview"
],
"query": {
"match": {
"title": "rocky"
}
},
"_source": false
}
在上面,我们搜索文档, 并查看结果:
"hits" : [
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "1366",
"_score" : 11.072304,
"fields" : {
"overview" : [
"When world heavyweight boxing champion, Apollo Creed wants to give an unknown fighter a shot at the title as a publicity stunt, his handlers choose palooka Rocky Balboa, an uneducated collector for a Philadelphia loan shark. Rocky teams up with trainer Mickey Goldmill to make the most of this once in a lifetime break."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "1371",
"_score" : 9.325746,
"fields" : {
"overview" : [
"Now the world champion, Rocky Balboa is living in luxury and only fighting opponents who pose no threat to him in the ring. His lifestyle of wealth and idleness is shaken when a powerful young fighter known as Clubber Lang challenges him to a bout. After taking a pounding from Lang, the humbled champ turns to former bitter rival Apollo Creed to help him regain his form for a rematch with Lang."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "41288",
"_score" : 9.325746,
"fields" : {
"overview" : [
"""Step into the ring with one of America's greatest legends...and stand a couple of rounds with greatness! "Pulling no punches" (LA Daily News), Jon Favreau (Swingers) and Oscar(r) winner* George C. Scott give TKO performances in this outstanding biography of the only undefeated world heavyweight champion in the history of boxing! In the small blue-collar town of Brockton, Massachusetts, young Rocky Marciano (Favreau) turns to the ring as his ticket out. Training twice as hard and twice as long as anyone else, he pounds his way to victory and his reputation quickly spreads as "the guy to beat." But behind the gloves Rocky is unhappy with his gift and he's thinking of retiring. So, with the fate of his career hanging in the balance, he finds a way to unleash his thunder againthis time against his biggest hero: Joe Louis!"""
]
}
}
...
]
在上面,我们可以看到一个 _id 为1366 的文档。我们接下来查找和这个 _id 相似的文档。我们可以这么来查询:
GET movies/_search
{
"size": 5,
"fields": [
"overview"
],
"query": {
"more_like_this": {
"fields": [
"overview"
],
"like": {
"_index": "movies",
"_id": "1366"
},
"min_term_freq": 1,
"max_query_terms": 12
}
},
"_source": false
}
用户肯定有可能对特许经营中的其他电影感兴趣,这就是我们得到的结果:
"hits" : [
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "312221",
"_score" : 57.172073,
"fields" : {
"overview" : [
"The former World Heavyweight Champion Rocky Balboa serves as a trainer and mentor to Adonis Johnson, the son of his late friend and former rival Apollo Creed."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "184741",
"_score" : 29.255241,
"fields" : {
"overview" : [
"A chorus girl (Marion Davies) and a heavyweight boxer (Clark Gable) are paired romantically as a publicity stunt."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "1371",
"_score" : 27.638262,
"fields" : {
"overview" : [
"Now the world champion, Rocky Balboa is living in luxury and only fighting opponents who pose no threat to him in the ring. His lifestyle of wealth and idleness is shaken when a powerful young fighter known as Clubber Lang challenges him to a bout. After taking a pounding from Lang, the humbled champ turns to former bitter rival Apollo Creed to help him regain his form for a rematch with Lang."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "1246",
"_score" : 26.655176,
"fields" : {
"overview" : [
"When he loses a highly publicized virtual boxing match to ex-champ Rocky Balboa, reigning heavyweitgcodeght titleholder, Mason Dixon retaliates by challenging Rocky to a nationally televised, 10-round exhibition bout. To the surprise of his son and friends, Rocky agrees to come out of retirement and face an opponent who's faster, stronger and thirty years his junior."
]
}
},
{
"_index" : "movies",
"_type" : "_doc",
"_id" : "1367",
"_score" : 25.732662,
"fields" : {
"overview" : [
"""After Rocky goes the distance with champ Apollo Creed, both try to put the fight behind them and move on. Rocky settles down with Adrian but can't put his life together outside the ring, while Creed seeks a rematch to restore his reputation. Stgcodeoon enough, the "Master of Disaster" and the "Italian Stallion" are set on a collision course for a climactic battle that is brutal and unforgettable."""
]
}
}
]
总结
more_like_this 查询具有很大的潜力,可以在我们的搜索解决方案中提供额外的功能,如果一方面它使用起来非常简单,另一方面它提供了一个有趣的参数来专门针对我们搜索类似的文件。 这个查询也用于 NLP 的上下文中,更具体地用于文本分类
无论如何,我希望这篇文章引起了人们对 Elasticsearch 上可用的这种和其他类型的查询进行试验的兴趣,尽管这些查询有点不常用。
文章来源于互联网:Elasticsearch:运用 Elasticsearch 查找类似文档:more_like_this
相关推荐: Enterprise:如何使用 Python 客户端将数据提取到 App Search 中
在之前的有些文章中,我已经介绍了如何导入数据到 App Search 中。你可以参考文章 “Elastic:菜鸟上手指南” 中的 “解决方案” 章节。在今天的文章中,我将介绍如何使用 Python client 把 Postgres 数据库中的数据导入到 Ap…