ES中文检索须知：分词器与中文分词器

您所在的位置：网站首页 › es中文分词器推荐 › ES中文检索须知：分词器与中文分词器

ES中文检索须知：分词器与中文分词器

#ES中文检索须知：分词器与中文分词器| 来源: 网络整理| 查看: 265

ElasticSearch (es)的核心功能即为数据检索，常被用来构建内部搜索引擎或者实现大规模数据在推荐召回流程中的粗排过程。

ES分词

分词即为将doc通过Analyzer切分成一个一个Term（关键字），es分词在索引构建和数据检索时均有体现：

构建倒排索引时每一个term都指向包含这个term的多个doc。搜索时需要通过分词将查询语句切分成一个一个term进行检索。

简单来说，ES的数据检索原理包含分词、基于分词结果计算相似度得分、按得分从高到低排序返回指定长度下的排序结果三个主要步骤，本文主要关注中文场景下的分词过程。

#mermaid-svg-SehEmMXBix40FQW4 {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-SehEmMXBix40FQW4 .error-icon{fill:#552222;}#mermaid-svg-SehEmMXBix40FQW4 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-SehEmMXBix40FQW4 .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-SehEmMXBix40FQW4 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-SehEmMXBix40FQW4 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-SehEmMXBix40FQW4 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-SehEmMXBix40FQW4 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-SehEmMXBix40FQW4 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-SehEmMXBix40FQW4 .marker.cross{stroke:#333333;}#mermaid-svg-SehEmMXBix40FQW4 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-SehEmMXBix40FQW4 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-SehEmMXBix40FQW4 .cluster-label text{fill:#333;}#mermaid-svg-SehEmMXBix40FQW4 .cluster-label span{color:#333;}#mermaid-svg-SehEmMXBix40FQW4 .label text,#mermaid-svg-SehEmMXBix40FQW4 span{fill:#333;color:#333;}#mermaid-svg-SehEmMXBix40FQW4 .node rect,#mermaid-svg-SehEmMXBix40FQW4 .node circle,#mermaid-svg-SehEmMXBix40FQW4 .node ellipse,#mermaid-svg-SehEmMXBix40FQW4 .node polygon,#mermaid-svg-SehEmMXBix40FQW4 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-SehEmMXBix40FQW4 .node .label{text-align:center;}#mermaid-svg-SehEmMXBix40FQW4 .node.clickable{cursor:pointer;}#mermaid-svg-SehEmMXBix40FQW4 .arrowheadPath{fill:#333333;}#mermaid-svg-SehEmMXBix40FQW4 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-SehEmMXBix40FQW4 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-SehEmMXBix40FQW4 .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-SehEmMXBix40FQW4 .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-SehEmMXBix40FQW4 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-SehEmMXBix40FQW4 .cluster text{fill:#333;}#mermaid-svg-SehEmMXBix40FQW4 .cluster span{color:#333;}#mermaid-svg-SehEmMXBix40FQW4 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-SehEmMXBix40FQW4 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;}

query

分词

基于分词结果计算相似度得分

按得分从高到低返回指定长度下的排序结果

ES内置分词器

ES官方内置了一些常用的分词器（Analyzer，分词器在NLP中称为tokenzier，es使用analyzer的原因是除了分词之外后续还会进行一些文本分析的动作）：

the quick brown-foxes jumped over the lazy dog’s bone. 该英语句子是一个经典的自然语言处理例句，它是全字母句，包含了全部26个字母。

分词器

作用

分词对象

结果示例

Standard Analyzer

标准分词器，也是默认的分词器，基于Unicode文本分割算法。默认分词器适用于英语，并且对大多数语言都有效