分词器的介绍和使用

2024-07-11 17:23| 来源: 网络整理| 查看: 265

什么是分词器

将用户输入的一段文本，按照一定逻辑，分析成多个词语的一种工具 example: The best 3-points shooter is Curry!

常用的内置分词器 standard analyzersimple analyzerwhitespace analyzerstop analyzerlanguage analyzerpattern analyzer standard analyzer

标准分析器是默认分词器，如果未指定，则使用该分词器

POST localhost:9200/_analyze

{ "analyzer": "standard", "text": "The best 3-points shooter is Curry!" } simple analyzer

simple 分析器当它遇到只要不是字母的字符，就将文本解析成term，而且所有的term都小写的

POST localhost:9200/_analyze

{ "analyzer": "simple", "text": "The best 3-points shooter is Curry!" } whitespace analyzer

whitespace 分析器，当它遇到空白字符时，就将文本解析成terms

POST localhost:9200/_analyze

{ "analyzer": "whitespace", "text": "The best 3-points shooter is Curry!" } stop analyzer

stop 分析器和 simple 分析器很像，唯一不同的是，stop 分析器增加了对删除停止词的支持，默认使用了english停止词

stopwords 预定义的停止词列表，比如 (the,a,an,this,of,at)等等

POST localhost:9200/_analyze

{ "analyzer": "whitespace", "text": "The best 3-points shooter is Curry!" } language analyzer

（特定的语言的分词器，比如说，english，英语分词器),内置语言：arabic, armenian,basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish,french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian,lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish,swedish, turkish, thai

Post localhost:9200/_analyze

{ "analyzer": "english", "text": "The best 3-points shooter is Curry!" } pattern analyzer

用正则表达式来将文本分割成terms，默认的正则表达式是\W+（非单词字符）

POST localhost:9200/_analyze

{ "analyzer": "pattern", "text": "The best 3-points shooter is Curry!" } 选择分词器

PUT localhost:9200/my_index

{ "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "whitespace" } } } }, "mappings": { "properties": { "name": { "type": "text" }, "team_name": { "type": "text" }, "position": { "type": "text" }, "play_year": { "type": "long" }, "jerse_no": { "type": "keyword" }, "title": { "type": "text", "analyzer": "my_analyzer" } } } }

PUT localhost:9200/my_index/_doc/1

{ "name": "库里", "team_name": "勇士", "position": "控球后卫", "play_year": 10, "jerse_no": "30", "title": "The best 3-points shooter is Curry!" }

POST localhost:9200/my_index/_search

{ "query": { "match": { "title": "Curry!" } } } 常见中文分词器的使用如果用默认的分词器standard（postman测试）

POST localhost:9200/_analyze

{ "analyzer": "standard", "text": "火箭明年总冠军" }

结果

{ "tokens": [ { "token": "火", "start_offset": 0, "end_offset": 1, "type": "", "position": 0 }, { "token": "箭", "start_offset": 1, "end_offset": 2, "type": "", "position": 1 }, { "token": "明", "start_offset": 2, "end_offset": 3, "type": "", "position": 2 }, { "token": "年", "start_offset": 3, "end_offset": 4, "type": "", "position": 3 }, { "token": "总", "start_offset": 4, "end_offset": 5, "type": "", "position": 4 }, { "token": "冠", "start_offset": 5, "end_offset": 6, "type": "", "position": 5 }, { "token": "军", "start_offset": 6, "end_offset": 7, "type": "", "position": 6 } ] } 常见分词器

smartCN 一个简单的中文或中英文混合文本的分词器 IK分词器更智能更友好的中文分词器

smartCn分词器

bin目录下安装： sh elasticsearch-plugin install analysis-smartcn

检验

安装后重新启动POST localhost:9200/_analyze { "analyzer": "smartcn", "text": "火箭明年总冠军" } { "tokens": [ { "token": "火箭", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "明年", "start_offset": 2, "end_offset": 4, "type": "word", "position": 1 }, { "token": "总", "start_offset": 4, "end_offset": 5, "type": "word", "position": 2 }, { "token": "冠军", "start_offset": 5, "end_offset": 7, "type": "word", "position": 3 } ] }

卸载 sh elasticsearch-plugin remove analysis-smartcn

IK分词器

下载 https://github.com/medcl/elasticsearch-analysis-ik/releases

安装解压安装到plugins目录

检验

安装后重新启动

POST localhost:9200/_analyze

{ "analyzer": "ik_max_word", "text": "火箭明年总冠军" }

【本文地址】

分词器的介绍和使用

分词器的介绍和使用

今日新闻

推荐新闻