Elasticsearch:Keep words token 过滤器
2023年2月12日 | by mebius
Keep words token 过滤器是用来仅保留包含在指定单词列表中的 token,尽管你的文字中可能含有比这个列表更多的 token。在某些情况下,我们可以有一个包含多个单词的字段,但是将字段中的所有单词都设为标记可能并不有趣。这个过滤器使用 Lucene 的 KeepWordFilter。它的使用和我们经常使用到的 stop 过滤器正好相反。关于 stop filter 的使用,你可以查看我之前的文章 “Elasticsearch:分词器中的 token 过滤器使用示例”。
示例
以下 _analyzeAPI 请求使用 keep 过滤器仅保留”thief”, “corporate”, “technology” 及”project” 标记:
GET _analyze
{
"tokenizer": "standard",
"filter": [
{
"type": "keep",
"keep_words": [ "thief", "corporate", "technology", "project", "elephant" ]
}
],
"text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom ttgcodehe project and his team to disaster."
}
上述命令返回结果:
{
"tokens": [
{
"token": "thief",
"start_offset": 2,
"end_offset": 7,
"type": "",
"position": 1
},
{
"token": "corporate",
"start_offset": 19,
"end_offset": 28,
"type": "",
"position": 4
},
{
"token": "technology",
"start_offset": 70,
"end_offset": 80,
"type": "",
"position": 12
},
{
"token": "project",
"start_offset": 187,
"end_offset": 194,
"type": "",
"position": 35
}
]
}
从上述的结果中,我们可以看到尽管 text 字段有一段很长的文字,但是返回的结果中只含有 keep 过滤器中的 keep_words 的一部分 token。如果按tgcode照正常的不使用 keep 过滤器,返回的结果是这样的:
GET _analyze
{
"tokenizer": "standard",
"text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
}
上述命令返回的结果是:
{
"tokens": [
{
"token": "A",
"start_offset": 0,
"end_offset": 1,
"type": "",
"position": 0
},
{
"token": "thief",
"start_offset": 2,
"end_offset": 7,
"type": "",
"position": 1
},
{
"token": "who",
"start_offset": 8,
"end_offset": 11,
"type": "",
"position": 2
},
{
"token": "steals",
"start_offset": 12,
"end_offset": 18,
"type": "",
"position": 3
},
{
"token": "corporate",
"start_offset": 19,
"end_offset": 28,
"type": "",
"position": 4
},
{
"token": "secrets",
"start_offset": 29,
"end_offset": 36,
"type": "",
"position": 5
},
{
"token": "through",
"start_offset": 37,
"end_offset": 44,
"type": "",
"position": 6
},
{
"token": "the",
"start_offset": 45,
"end_offset": 48,
"type": "",
"position": 7
},
{
"token": "use",
"start_offset": 49,
"end_offset": 52,
"type": "",
"position": 8
},
{
"token": "of",
"start_offset": 53,
"end_offset": 55,
"type": "",
"position": 9
},
{
"token": "dream",
"start_offset": 56,
"end_offset": 61,
"type": "",
"position": 10
},
{
"token": "sharing",
"start_offset": 62,
"end_offset": 69,
"type": "",
"position": 11
},
{
"token": "technology",
"start_offset": 70,
"end_offset": 80,
"type": "",
"position": 12
},
{
"token": "is",
"start_offset": 81,
"end_offset": 83,
"type": "",
"position": 13
},
{
"token": "given",
"start_offset": 84,
"end_offset": 89,
"type": "",
"position": 14
},
{
"token": "the",
"start_offset": 90,
"end_offset": 93,
"type": "",
"position": 15
},
{
"token": "inverse",
"start_offset": 94,
"end_offset": 101,
"type": "",
"position": 16
},
{
"token": "task",
"start_offset": 102,
"end_offset": 106,
"type": "",
"position": 17
},
{
"token": "of",
"start_offset": 107,
"end_offset": 109,
"type": "",
"position": 18
},
{
"token": "planting",
"start_offset": 110,
"end_offset": 118,
"type": "",
"position": 19
},
{
"token": "an",
"start_offset": 119,
"end_offset": 121,
"type": "",
"position": 20
},
{
"token": "idea",
"start_offset": 122,
"end_offset": 126,
"type": "",
"position": 21
},
{
"token": "into",
"start_offset": 127,
"end_offset": 131,
"type": "",
"position": 22
},
{
"token": "the",
"start_offset": 132,
"end_offset": 135,
"type": "",
"position": 23
},
{
"token": "mind",
"start_offset": 136,
"end_offset": 140,
"type": "",
"position": 24
},
{
"token": "of",
"start_offset": 141,
"end_offset": 143,
"type": "",
"position": 25
},
{
"token": "a",
"start_offset": 144,
"end_offset": 145,
"type": "",
"position": 26
},
{
"token": "C.E.O",
"start_offset": 146,
"end_offset": 151,
"type": "",
"position": 27
},
{
"token": "but",
"start_offset": 154,
"end_offset": 157,
tgcode "type": "",
"position": 28
},
{
"token": "his",
"start_offset": 158,
"end_offset": 161,
"type": "",
"position": 29
},
{
"token": "tragic",
"start_offset": 162,
"end_offset": 168,
"type": "",
"position": 30
},
{
"token": "past",
"start_offset": 169,
"end_offset": 173,
"type": "",
"position": 31
},
{
"token": "may",
"start_offset": 174,
"end_offset": 177,
"type": "",
"position": 32
},
{
"token": "doom",
"start_offset": 178,
"end_offset": 182,
"type": "",
"position": 33
},
{
"token": "the",
"start_offset": 183,
"end_offset": 186,
"type": "",
"position": 34
},
{
"token": "project",
"start_offset": 187,
"end_offset": 194,
"type": "",
"position": 35
},
{
"token": "and",
"start_offset": 195,
"end_offset": 198,
"type": "",
"position": 36
},
{
"token": "his",
"start_offset": 199,
"end_offset": 202,
"type": "",
"position": 37
},
{
"token": "team",
"start_offset": 203,
"end_offset": 207,
"type": "",
"position": 38
},
{
"token": "to",
"start_offset": 208,
"end_offset": 210,
"type": "",
"position": 39
},
{
"token": "disaster",
"start_offset": 211,
"end_offset": 219,
"type": "",
"position": 40
}
]
}
很显然,这个列表比之前使用过滤器的情况下比较,列表要长很多。
添加到索引中
我们可以定义如下的一个索引:
PUT keep_example
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_keep"
]
}
},
"filter": {
"my_keep": {
"type": "keep",
"stopwords": [
"thief",
"corporate",
"technology",
"project",
"elephant"
]
}
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer"
}
}
}
}
我们可以使用如下的命令来进行测试:
GET keep_example/_analyze
{
"analyzer": "my_analyzer",
"text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
}
配置参数
参数 | 描述 |
---|---|
keep_words |
(必需*,字符串数组)要保留的单词列表。 只有与此列表中的单词匹配的标记才会包含在输出中。 必须指定此参数或 keep_words_path。 |
keep_words_path |
(必需*,字符串数组)包含要保留的单词列表的文件的路径。 只有与此列表中的单词匹配的标记才会包含在输出中。 此路径必须是绝对路径或相对于 Elasticsearch config 位置的路径,并且文件必须是 UTF-8 编码的。 文件中的每个单词必须用换行符分隔。 必须指定此参数或 keep_words。 |
keep_words_case | (可选,布尔值)如果为真,则把 keep_words 中词都进行小写。 默认为假。 |
在实际的使用中,我们的 keep_words 可能会比较长。放入到命令中会不方便,且难以阅读。我们可以把这个列表放入到 keep_words_path 的文件中,比如:
PUT keep_words_example
{
"settings": {
"analysis": {
"analyzer": {
"standard_keep_word_array": {
"tokenizer": "standard",
"filter": [ "keep_word_array" ]
},
"standard_keep_word_file": {
"tokenizer": "standard",
"filter": [ "keep_word_file" ]
}
},
"filter": {
"keep_word_array": {
"type": "keep",
"keep_words": [ "one", "two", "three" ]
},
"keep_word_file": {
"type": "keep",
"keep_words_path": "analysis/example_word_list.txt"
}
}
}
}
}
如上所示,我们把 example_word_list.txt 放置于我们的 Elasticsearch 安装目录中的如下位置:
$ pwd
/Users/liuxg/elastic/elasticsearch-8.6.1/config/analysis
$ ls
example_word_list.txt
$ cat example_word_list.txt
thief
corporate
technology
project
elephant
文章来源于互联网:Elasticsearch:Keep words token 过滤器
相关推荐: Elasticsearch:Elasticsearch percolate 查询
Elasticsearch 通常如何工作? 我们将文档索引到 Elasticsearch 中并对其运行查询以获得满足提供的搜索条件的文档。 我们构造一个匹配或术语查询作为输入,匹配查询的文档作为结果返回。 但这不是 percolate query 的情况……