Elasticsearch:Keep words token 过滤器

2023年2月12日   |   by mebius

Keep words token 过滤器是用来仅保留包含在指定单词列表中的 token,尽管你的文字中可能含有比这个列表更多的 token。在某些情况下,我们可以有一个包含多个单词的字段,但是将字段中的所有单词都设为标记可能并不有趣。这个过滤器使用 Lucene 的 KeepWordFilter。它的使用和我们经常使用到的 stop 过滤器正好相反。关于 stop filter 的使用,你可以查看我之前的文章 “Elasticsearch:分词器中的 token 过滤器使用示例”。

示例

以下 _analyzeAPI 请求使用 keep 过滤器仅保留”thief”, “corporate”, “technology” 及”project” 标记:

GET _analyze
{
  "tokenizer": "standard",
  "filter": [
    {
      "type": "keep",
      "keep_words": [ "thief", "corporate", "technology", "project", "elephant" ]
    }
  ],
  "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom ttgcodehe project and his team to disaster."
}

上述命令返回结果:

{
  "tokens": [
    {
      "token": "thief",
      "start_offset": 2,
      "end_offset": 7,
      "type": "",
      "position": 1
    },
    {
      "token": "corporate",
      "start_offset": 19,
      "end_offset": 28,
      "type": "",
      "position": 4
    },
    {
      "token": "technology",
      "start_offset": 70,
      "end_offset": 80,
      "type": "",
      "position": 12
    },
    {
      "token": "project",
      "start_offset": 187,
      "end_offset": 194,
      "type": "",
      "position": 35
    }
  ]
}

从上述的结果中,我们可以看到尽管 text 字段有一段很长的文字,但是返回的结果中只含有 keep 过滤器中的 keep_words 的一部分 token。如果按tgcode照正常的不使用 keep 过滤器,返回的结果是这样的:

GET _analyze
{
  "tokenizer": "standard",
  "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
}

上述命令返回的结果是:

{
  "tokens": [
    {
      "token": "A",
      "start_offset": 0,
      "end_offset": 1,
      "type": "",
      "position": 0
    },
    {
      "token": "thief",
      "start_offset": 2,
      "end_offset": 7,
      "type": "",
      "position": 1
    },
    {
      "token": "who",
      "start_offset": 8,
      "end_offset": 11,
      "type": "",
      "position": 2
    },
    {
      "token": "steals",
      "start_offset": 12,
      "end_offset": 18,
      "type": "",
      "position": 3
    },
    {
      "token": "corporate",
      "start_offset": 19,
      "end_offset": 28,
      "type": "",
      "position": 4
    },
    {
      "token": "secrets",
      "start_offset": 29,
      "end_offset": 36,
      "type": "",
      "position": 5
    },
    {
      "token": "through",
      "start_offset": 37,
      "end_offset": 44,
      "type": "",
      "position": 6
    },
    {
      "token": "the",
      "start_offset": 45,
      "end_offset": 48,
      "type": "",
      "position": 7
    },
    {
      "token": "use",
      "start_offset": 49,
      "end_offset": 52,
      "type": "",
      "position": 8
    },
    {
      "token": "of",
      "start_offset": 53,
      "end_offset": 55,
      "type": "",
      "position": 9
    },
    {
      "token": "dream",
      "start_offset": 56,
      "end_offset": 61,
      "type": "",
      "position": 10
    },
    {
      "token": "sharing",
      "start_offset": 62,
      "end_offset": 69,
      "type": "",
      "position": 11
    },
    {
      "token": "technology",
      "start_offset": 70,
      "end_offset": 80,
      "type": "",
      "position": 12
    },
    {
      "token": "is",
      "start_offset": 81,
      "end_offset": 83,
      "type": "",
      "position": 13
    },
    {
      "token": "given",
      "start_offset": 84,
      "end_offset": 89,
      "type": "",
      "position": 14
    },
    {
      "token": "the",
      "start_offset": 90,
      "end_offset": 93,
      "type": "",
      "position": 15
    },
    {
      "token": "inverse",
      "start_offset": 94,
      "end_offset": 101,
      "type": "",
      "position": 16
    },
    {
      "token": "task",
      "start_offset": 102,
      "end_offset": 106,
      "type": "",
      "position": 17
    },
    {
      "token": "of",
      "start_offset": 107,
      "end_offset": 109,
      "type": "",
      "position": 18
    },
    {
      "token": "planting",
      "start_offset": 110,
      "end_offset": 118,
      "type": "",
      "position": 19
    },
    {
      "token": "an",
      "start_offset": 119,
      "end_offset": 121,
      "type": "",
      "position": 20
    },
    {
      "token": "idea",
      "start_offset": 122,
      "end_offset": 126,
      "type": "",
      "position": 21
    },
    {
      "token": "into",
      "start_offset": 127,
      "end_offset": 131,
      "type": "",
      "position": 22
    },
    {
      "token": "the",
      "start_offset": 132,
      "end_offset": 135,
      "type": "",
      "position": 23
    },
    {
      "token": "mind",
      "start_offset": 136,
      "end_offset": 140,
      "type": "",
      "position": 24
    },
    {
      "token": "of",
      "start_offset": 141,
      "end_offset": 143,
      "type": "",
      "position": 25
    },
    {
      "token": "a",
      "start_offset": 144,
      "end_offset": 145,
      "type": "",
      "position": 26
    },
    {
      "token": "C.E.O",
      "start_offset": 146,
      "end_offset": 151,
      "type": "",
      "position": 27
    },
    {
      "token": "but",
      "start_offset": 154,
      "end_offset": 157,
  tgcode    "type": "",
      "position": 28
    },
    {
      "token": "his",
      "start_offset": 158,
      "end_offset": 161,
      "type": "",
      "position": 29
    },
    {
      "token": "tragic",
      "start_offset": 162,
      "end_offset": 168,
      "type": "",
      "position": 30
    },
    {
      "token": "past",
      "start_offset": 169,
      "end_offset": 173,
      "type": "",
      "position": 31
    },
    {
      "token": "may",
      "start_offset": 174,
      "end_offset": 177,
      "type": "",
      "position": 32
    },
    {
      "token": "doom",
      "start_offset": 178,
      "end_offset": 182,
      "type": "",
      "position": 33
    },
    {
      "token": "the",
      "start_offset": 183,
      "end_offset": 186,
      "type": "",
      "position": 34
    },
    {
      "token": "project",
      "start_offset": 187,
      "end_offset": 194,
      "type": "",
      "position": 35
    },
    {
      "token": "and",
      "start_offset": 195,
      "end_offset": 198,
      "type": "",
      "position": 36
    },
    {
      "token": "his",
      "start_offset": 199,
      "end_offset": 202,
      "type": "",
      "position": 37
    },
    {
      "token": "team",
      "start_offset": 203,
      "end_offset": 207,
      "type": "",
      "position": 38
    },
    {
      "token": "to",
      "start_offset": 208,
      "end_offset": 210,
      "type": "",
      "position": 39
    },
    {
      "token": "disaster",
      "start_offset": 211,
      "end_offset": 219,
      "type": "",
      "position": 40
    }
  ]
}

很显然,这个列表比之前使用过滤器的情况下比较,列表要长很多。

添加到索引中

我们可以定义如下的一个索引:

PUT keep_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_keep"
          ]
        }
      },
      "filter": {
        "my_keep": {
          "type": "keep",
          "stopwords": [
            "thief",
            "corporate",
            "technology",
            "project",
            "elephant"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}

我们可以使用如下的命令来进行测试:

GET keep_example/_analyze
{
  "analyzer": "my_analyzer", 
  "text": "A thief who steals corporate secrets through the use of dream-sharing technology is given the inverse task of planting an idea into the mind of a C.E.O., but his tragic past may doom the project and his team to disaster."
}

配置参数

参数 描述
keep_words

(必需*,字符串数组)要保留的单词列表。 只有与此列表中的单词匹配的标记才会包含在输出中。

必须指定此参数或 keep_words_path。

keep_words_path

(必需*,字符串数组)包含要保留的单词列表的文件的路径。 只有与此列表中的单词匹配的标记才会包含在输出中。

此路径必须是绝对路径或相对于 Elasticsearch config 位置的路径,并且文件必须是 UTF-8 编码的。 文件中的每个单词必须用换行符分隔。

必须指定此参数或 keep_words。

keep_words_case (可选,布尔值)如果为真,则把 keep_words 中词都进行小写。 默认为假。

在实际的使用中,我们的 keep_words 可能会比较长。放入到命令中会不方便,且难以阅读。我们可以把这个列表放入到 keep_words_path 的文件中,比如:

PUT keep_words_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_keep_word_array": {
          "tokenizer": "standard",
          "filter": [ "keep_word_array" ]
        },
        "standard_keep_word_file": {
          "tokenizer": "standard",
          "filter": [ "keep_word_file" ]
        }
      },
      "filter": {
        "keep_word_array": {
          "type": "keep",
          "keep_words": [ "one", "two", "three" ]
        },
        "keep_word_file": {
          "type": "keep",
          "keep_words_path": "analysis/example_word_list.txt"
        }
      }
    }
  }
}

如上所示,我们把 example_word_list.txt 放置于我们的 Elasticsearch 安装目录中的如下位置:

$ pwd
/Users/liuxg/elastic/elasticsearch-8.6.1/config/analysis
$ ls
example_word_list.txt
$ cat example_word_list.txt 
thief
corporate
technology
project
elephant

文章来源于互联网:Elasticsearch:Keep words token 过滤器

相关推荐: Elasticsearch:Elasticsearch percolate 查询

Elasticsearch 通常如何工作? 我们将文档索引到 Elasticsearch 中并对其运行查询以获得满足提供的搜索条件的文档。 我们构造一个匹配或术语查询作为输入,匹配查询的文档作为结果返回。 但这不是 percolate query 的情况……

Tags: ,