分词器的介绍和使用

您所在的位置:网站首页 分词是干什么用的英语 分词器的介绍和使用

分词器的介绍和使用

2024-07-11 17:23| 来源: 网络整理| 查看: 265

什么是分词器

将用户输入的一段文本,按照一定逻辑,分析成多个词语的一种工具 example: The best 3-points shooter is Curry!

常用的内置分词器 standard analyzersimple analyzerwhitespace analyzerstop analyzerlanguage analyzerpattern analyzer standard analyzer

标准分析器是默认分词器,如果未指定,则使用该分词器

POST localhost:9200/_analyze

{ "analyzer": "standard", "text": "The best 3-points shooter is Curry!" } simple analyzer

simple 分析器当它遇到只要不是字母的字符,就将文本解析成term,而且所有的term都小写的

POST localhost:9200/_analyze

{ "analyzer": "simple", "text": "The best 3-points shooter is Curry!" } whitespace analyzer

whitespace 分析器,当它遇到空白字符时,就将文本解析成terms

POST localhost:9200/_analyze

{ "analyzer": "whitespace", "text": "The best 3-points shooter is Curry!" } stop analyzer

stop 分析器 和 simple 分析器很像,唯一不同的是,stop 分析器增加了对删除停止词的支持,默认使用了english停止词

stopwords 预定义的停止词列表,比如 (the,a,an,this,of,at)等等

POST localhost:9200/_analyze

{ "analyzer": "whitespace", "text": "The best 3-points shooter is Curry!" } language analyzer

(特定的语言的分词器,比如说,english,英语分词器),内置语言:arabic, armenian,basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish,french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian,lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish,swedish, turkish, thai

Post localhost:9200/_analyze

{ "analyzer": "english", "text": "The best 3-points shooter is Curry!" } pattern analyzer

用正则表达式来将文本分割成terms,默认的正则表达式是\W+(非单词字符)

POST localhost:9200/_analyze

{ "analyzer": "pattern", "text": "The best 3-points shooter is Curry!" } 选择分词器

PUT localhost:9200/my_index

{ "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "whitespace" } } } }, "mappings": { "properties": { "name": { "type": "text" }, "team_name": { "type": "text" }, "position": { "type": "text" }, "play_year": { "type": "long" }, "jerse_no": { "type": "keyword" }, "title": { "type": "text", "analyzer": "my_analyzer" } } } }

PUT localhost:9200/my_index/_doc/1

{ "name": "库里", "team_name": "勇士", "position": "控球后卫", "play_year": 10, "jerse_no": "30", "title": "The best 3-points shooter is Curry!" }

POST localhost:9200/my_index/_search

{ "query": { "match": { "title": "Curry!" } } }   常见中文分词器的使用 如果用默认的分词器standard(postman测试)

POST localhost:9200/_analyze

{ "analyzer": "standard", "text": "火箭明年总冠军" }

结果

{ "tokens": [ { "token": "火", "start_offset": 0, "end_offset": 1, "type": "", "position": 0 }, { "token": "箭", "start_offset": 1, "end_offset": 2, "type": "", "position": 1 }, { "token": "明", "start_offset": 2, "end_offset": 3, "type": "", "position": 2 }, { "token": "年", "start_offset": 3, "end_offset": 4, "type": "", "position": 3 }, { "token": "总", "start_offset": 4, "end_offset": 5, "type": "", "position": 4 }, { "token": "冠", "start_offset": 5, "end_offset": 6, "type": "", "position": 5 }, { "token": "军", "start_offset": 6, "end_offset": 7, "type": "", "position": 6 } ] } 常见分词器

smartCN   一个简单的中文或中英文混合文本的分词器 IK分词器   更智能更友好的中文分词器

 

smartCn分词器

bin目录下安装:  sh elasticsearch-plugin install analysis-smartcn

检验

安装后重新启动POST localhost:9200/_analyze { "analyzer": "smartcn", "text": "火箭明年总冠军" } { "tokens": [ { "token": "火箭", "start_offset": 0, "end_offset": 2, "type": "word", "position": 0 }, { "token": "明年", "start_offset": 2, "end_offset": 4, "type": "word", "position": 1 }, { "token": "总", "start_offset": 4, "end_offset": 5, "type": "word", "position": 2 }, { "token": "冠军", "start_offset": 5, "end_offset": 7, "type": "word", "position": 3 } ] }

卸载 sh elasticsearch-plugin remove analysis-smartcn

IK分词器

下载 https://github.com/medcl/elasticsearch-analysis-ik/releases

安装 解压安装到plugins目录

检验

安装后重新启动

POST localhost:9200/_analyze

{ "analyzer": "ik_max_word", "text": "火箭明年总冠军" }

 



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3