8.2.1

您所在的位置:网站首页 standard和standardized的区别 8.2.1

8.2.1

#8.2.1| 来源: 网络整理| 查看: 265

ES默认提供了八种内置的analyzer,针对不同的场景可以使用不同的analyzer;

1、standard analyzer 1.1、standard类型及分词效果

在未显式指定analyzer的情况下standard analyzer为默认analyzer,其提供基于语法进行分词(基于Unicode文本分段算法)且在多数语言当中表现都不错;

//测试standard analyzer默认分词效果 //请求参数 POST _analyze { "analyzer": "standard", "text": "transimission control protocol is a transport layer protocol" } //结果返回 { "tokens" : [ { "token" : "transimission", "start_offset" : 0, "end_offset" : 13, "type" : "", "position" : 0 }, { "token" : "control", "start_offset" : 14, "end_offset" : 21, "type" : "", "position" : 1 }, { "token" : "protocol", "start_offset" : 22, "end_offset" : 30, "type" : "", "position" : 2 }, { "token" : "is", "start_offset" : 31, "end_offset" : 33, "type" : "", "position" : 3 }, { "token" : "a", "start_offset" : 34, "end_offset" : 35, "type" : "", "position" : 4 }, { "token" : "transport", "start_offset" : 36, "end_offset" : 45, "type" : "", "position" : 5 }, { "token" : "layer", "start_offset" : 46, "end_offset" : 51, "type" : "", "position" : 6 }, { "token" : "protocol", "start_offset" : 52, "end_offset" : 60, "type" : "", "position" : 7 } ] }

以上句子通过分词之后得到的关键词为: [transmission, control, protocol, is, a, transport, layer, protocol]

1.2、standard类型可配置参数 序号参数参数说明1max_token_length原始字符串拆分出的单个token所允许的最大长度,若拆分出的token查询超过最大值则按照最大值位置进行拆分,多余的作为另外的token,默认值为255;2stopwords预定义的停用词,可以为0个或多个,例如_english_或数组类型值,默认值为_none_;3stopwords_path停用词文件路径;

以下实例配置max_token_length参数

//standard参数配置定义 PUT standard_analyzer_token_length_conf_index { "settings": { "analysis": { "analyzer": { "english_analyzer":{ "type":"standard", "max_token_length":5, "stopwords":"_english_" } } } } } //测试standard可配置参数 POST standard_analyzer_token_length_conf_index/_analyze { "analyzer": "english_analyzer", "text": "transimission control protocol is a transport layer protocol" } //测试结果返回 { "tokens" : [ { "token" : "trans", "start_offset" : 0, "end_offset" : 5, "type" : "", "position" : 0 }, { "token" : "imiss", "start_offset" : 5, "end_offset" : 10, "type" : "", "position" : 1 }, { "token" : "ion", "start_offset" : 10, "end_offset" : 13, "type" : "", "position" : 2 }, { "token" : "contr", "start_offset" : 14, "end_offset" : 19, "type" : "", "position" : 3 }, { "token" : "ol", "start_offset" : 19, "end_offset" : 21, "type" : "", "position" : 4 }, { "token" : "proto", "start_offset" : 22, "end_offset" : 27, "type" : "", "position" : 5 }, { "token" : "col", "start_offset" : 27, "end_offset" : 30, "type" : "", "position" : 6 }, { "token" : "trans", "start_offset" : 36, "end_offset" : 41, "type" : "", "position" : 9 }, { "token" : "port", "start_offset" : 41, "end_offset" : 45, "type" : "", "position" : 10 }, { "token" : "layer", "start_offset" : 46, "end_offset" : 51, "type" : "", "position" : 11 }, { "token" : "proto", "start_offset" : 52, "end_offset" : 57, "type" : "", "position" : 12 }, { "token" : "col", "start_offset" : 57, "end_offset" : 60, "type" : "", "position" : 13 } ] }

以上句子通过分词之后得到的关键词为: [trans, imiss, ion, contr, ol, proto, col, trans, port, layer, proto, col]

1.3、standard analyzer的组成定义 序号子构件构件说明1Tokenizerstandard tokenizer2Token Filterslowercase token filter,stop token filter(默认禁用)

如果希望自定义一个与standard类似的analyzer,只需要在原定义中配置可配置参数即可,其它的可以完全照搬standard的配置,如下示例:

//测试自定义analyzer PUT custom_rebuild_standard_analyzer_index { "settings": { "analysis": { "analyzer": { "rebuild_analyzer":{ "type":"custom", "tokenizer":"standard", "filter":["lowercase"] } } } } } //测试请求参数 POST custom_rebuild_standard_analyzer_index/_analyze { "text": "transimission control protocol is a transport layer protocol" } //测试结果返回 { "tokens" : [ { "token" : "transimission", "start_offset" : 0, "end_offset" : 13, "type" : "", "position" : 0 }, { "token" : "control", "start_offset" : 14, "end_offset" : 21, "type" : "", "position" : 1 }, { "token" : "protocol", "start_offset" : 22, "end_offset" : 30, "type" : "", "position" : 2 }, { "token" : "is", "start_offset" : 31, "end_offset" : 33, "type" : "", "position" : 3 }, { "token" : "a", "start_offset" : 34, "end_offset" : 35, "type" : "", "position" : 4 }, { "token" : "transport", "start_offset" : 36, "end_offset" : 45, "type" : "", "position" : 5 }, { "token" : "layer", "start_offset" : 46, "end_offset" : 51, "type" : "", "position" : 6 }, { "token" : "protocol", "start_offset" : 52, "end_offset" : 60, "type" : "", "position" : 7 } ] }

自定义的standard若希望使用内置standard的配置参数,必须保证type类型为standard,否则配置的参数无效,示例如下:

//自定义analyzer PUT custom_rebuild_standard_analyzer_index_1 { "settings": { "analysis": { "analyzer": { "rebuild_analyzer":{ //此处的type若为standard,则max_token_length有效,反之若为custom则无效 "type":"custom", "tokenizer":"standard", "max_token_length":8, "filter":["lowercase"] } } } } } //测试验证 POST custom_rebuild_standard_analyzer_index_1/_analyze { "analyzer": "rebuild_analyzer", "text": "transimission control protocol is a transport layer protocol" }

以上示例均可自行验证

2、simple analyzer 2.1、simple类型及分词效果

simple类型分词器是根据非字母字符对文本进行拆分,且将处理的所有关键词转换成小写格式

//测试standard analyzer默认分词效果 //请求参数 POST _analyze { "analyzer": "simple", "text": "Transimission Control Protocol is a transport layer protocol" } //结果返回 { "tokens" : [ { "token" : "transimission", "start_offset" : 0, "end_offset" : 13, "type" : "", "position" : 0 }, { "token" : "control", "start_offset" : 14, "end_offset" : 21, "type" : "", "position" : 1 }, { "token" : "protocol", "start_offset" : 22, "end_offset" : 30, "type" : "", "position" : 2 }, { "token" : "is", "start_offset" : 31, "end_offset" : 33, "type" : "", "position" : 3 }, { "token" : "a", "start_offset" : 34, "end_offset" : 35, "type" : "", "position" : 4 }, { "token" : "transport", "start_offset" : 36, "end_offset" : 45, "type" : "", "position" : 5 }, { "token" : "layer", "start_offset" : 46, "end_offset" : 51, "type" : "", "position" : 6 }, { "token" : "protocol", "start_offset" : 52, "end_offset" : 60, "type" : "", "position" : 7 } ] }

以上句子通过分词之后得到的关键词为: [transmission, control, protocol, is, a, transport, layer, protocol]

2.2、默认standard analyzer的组成定义 序号子构件构件说明1Tokenizerlowercase tokenizer

如果希望自定义一个与simple类似的analyzer,只需要在在自定义analyzer时指定type为custom,其它的可以完全照搬simple的配置,如下示例:

//测试自定义analyzer PUT custom_rebuild_simple_analyzer_index { "settings": { "analysis": { "analyzer": { "rebuild_simple":{ "tokenizer":"lowercase", "filter":[] } } } } } //测试请求参数 POST custom_rebuild_simple_analyzer_index/_analyze { "text": "transimission control protocol is a transport layer protocol" } //测试结果返回 { "tokens" : [ { "token" : "transimission", "start_offset" : 0, "end_offset" : 13, "type" : "", "position" : 0 }, { "token" : "control", "start_offset" : 14, "end_offset" : 21, "type" : "", "position" : 1 }, { "token" : "protocol", "start_offset" : 22, "end_offset" : 30, "type" : "", "position" : 2 }, { "token" : "is", "start_offset" : 31, "end_offset" : 33, "type" : "", "position" : 3 }, { "token" : "a", "start_offset" : 34, "end_offset" : 35, "type" : "", "position" : 4 }, { "token" : "transport", "start_offset" : 36, "end_offset" : 45, "type" : "", "position" : 5 }, { "token" : "layer", "start_offset" : 46, "end_offset" : 51, "type" : "", "position" : 6 }, { "token" : "protocol", "start_offset" : 52, "end_offset" : 60, "type" : "", "position" : 7 } ] }


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3