深圳南山网站建设公司,冠县网站开发,淘客怎么建网站做推广,网站设计步骤ppt文章目录分词器1 normalization#xff1a;文档规范化,提高召回率2 字符过滤器#xff08;character filter#xff09;#xff1a;分词之前的预处理#xff0c;过滤无用字符3 令牌过滤器#xff08;token filter#xff09;#xff1a;停用词、时态转换、大小写转换、…
文章目录分词器1 normalization文档规范化,提高召回率2 字符过滤器character filter分词之前的预处理过滤无用字符3 令牌过滤器token filter停用词、时态转换、大小写转换、同义词转换、语气词处理等。比如hashave himhe applesapple the/oh/a干掉4 分词器tokenizer切词5 常见分词器6 自定义分词器custom analyzer7 中文分词器ik分词安装和部署IK文件描述ik提供的两种analyzer:热更新分词器
1 normalization文档规范化,提高召回率
#normalization
GET _analyze
{text: Mr. Ma is an excellent teacher,analyzer: english
}2 字符过滤器character filter分词之前的预处理过滤无用字符
HTML Strip Character Filterhtml_strip 参数escaped_tags 需要保留的html标签
##HTML Strip Character Filter
###测试数据pIapos;m so ahappy/a!/p
DELETE my_index
PUT my_index
{settings: {analysis: {char_filter: {my_char_filter:{type:html_strip,escaped_tags:[a]}},analyzer: {my_analyzer:{tokenizer:keyword,char_filter:[my_char_filter]}}}}
}Mapping Character Filtertype mapping
##Mapping Character Filter
DELETE my_index
PUT my_index
{settings: {analysis: {char_filter: {my_char_filter:{type:mapping,mappings:[滚 *,垃 *,圾 *]}},analyzer: {my_analyzer:{tokenizer:keyword,char_filter:[my_char_filter]}}}}
}
GET my_index/_analyze
{analyzer: my_analyzer,text: 你就是个垃圾滚
}Pattern Replace Character Filtertype pattern_replace
##Pattern Replace Character Filter
#17611001200
DELETE my_index
PUT my_index
{settings: {analysis: {char_filter: {my_char_filter:{type:pattern_replace,pattern:(\\d{3})\\d{4}(\\d{4}),replacement:$1****$2}},analyzer: {my_analyzer:{tokenizer:keyword,char_filter:[my_char_filter]}}}}
}
GET my_index/_analyze
{analyzer: my_analyzer,text: 您的手机号是17611001200
}3 令牌过滤器token filter停用词、时态转换、大小写转换、同义词转换、语气词处理等。比如hashave himhe applesapple the/oh/a干掉
#token filter
DELETE test_index
PUT /test_index
{settings: {analysis: {filter: {my_synonym: {type: synonym_graph,synonyms_path: analysis/synonym.txt}},analyzer: {my_analyzer: {tokenizer: ik_max_word,filter: [ my_synonym ]}}}}
}
GET test_index/_analyze
{analyzer: my_analyzer,text: [蒙丢丢大G霸道daG]
}
GET test_index/_analyze
{analyzer: ik_max_word,text: [奔驰G级]
}
DELETE test_index
PUT /test_index
{settings: {analysis: {filter: {my_synonym: {type: sys,synonyms: [赵,钱,孙,李吴,周王]}},analyzer: {my_analyzer: {tokenizer: standard,filter: [ my_synonym ]}}}}
}
GET test_index/_analyze
{analyzer: my_analyzer,text: [赵,钱,孙,李,周]
}
#大小写
GET test_index/_analyze
{tokenizer: standard,filter: [lowercase], text: [AASD ASDA SDASD ASDASD]
}
GET test_index/_analyze
{tokenizer: standard,filter: [uppercase], text: [asdasd asd asg dsfg gfhjsdf asfdg g]
}GET test_index/_analyze
{tokenizer: standard,filter: {type: condition,filter:uppercase,script: {source: token.getTerm().length() 5}}, text: [asdasd asd asg dsfg gfhjsdf asfdg g]
}
#停用词
DELETE test_index
PUT /test_index
{settings: {analysis: {analyzer: {my_analyzer: {type: standard,stopwords:[me,you]}}}}
}
GET test_index/_analyze
{analyzer: my_analyzer, text: [Teacher me and you in the china]
}4 分词器tokenizer切词
#分词器 tokenizer
GET test_index/_analyze
{tokenizer: ik_max_word,text: [我爱北京天安门,天安门上太阳升]
}5 常见分词器
standard analyzer默认分词器中文支持的不理想会逐字拆分。pattern tokenizer以正则匹配分隔符把文本拆分成若干词项。simple pattern tokenizer以正则匹配词项速度比pattern tokenizer快。whitespace analyzer以空白符分隔 Tim_cookie
6 自定义分词器custom analyzer
char_filter内置或自定义字符过滤器 。token filter内置或自定义token filter 。tokenizer内置或自定义切词器。
#自定义分词器
DELETE custom_analysis
PUT custom_analysis
{settings: {analysis: {char_filter: {my_char_filter: {type: mapping,mappings: [ and,| or]},html_strip_char_filter:{type:html_strip,escaped_tags:[a]}},filter: {my_stopword: {type: stop,stopwords: [is,in,the,a,at,for]}},tokenizer: {my_tokenizer: {type: pattern,pattern: [ ,.!?]}}, analyzer: {my_analyzer:{type:custom,char_filter:[my_char_filter,html_strip_char_filter],filter:[my_stopword,lowercase],tokenizer:my_tokenizer}}}}
}GET custom_analysis/_analyze
{analyzer: my_analyzer,text: [What is ,aas.df/a ssp in ? /p | is ! in the a at for ]
}7 中文分词器ik分词 安装和部署 ik下载地址https://github.com/medcl/elasticsearch-analysis-ikGithub加速器https://github.com/fhefh2015/Fast-GitHub创建插件文件夹 cd your-es-root/plugins/ mkdir ik将插件解压缩到文件夹 your-es-root/plugins/ik重新启动es IK文件描述 IKAnalyzer.cfg.xmlIK分词配置文件
主词库main.dic 英文停用词stopword.dic不会建立在倒排索引中特殊词库 quantifier.dic特殊词库计量单位等suffix.dic特殊词库行政单位surname.dic特殊词库百家姓preposition特殊词库语气词 自定义词库网络词汇、流行词、自造词等 ik提供的两种analyzer: ik_max_word会将文本做最细粒度的拆分比如会将“中华人民共和国国歌”拆分为“中华人民共和国,中华人民,中华,华人,人民共和国,人民,人,民,共和国,共和,和,国国,国歌”会穷尽各种可能的组合适合 Term Queryik_smart: 会做最粗粒度的拆分比如会将“中华人民共和国国歌”拆分为“中华人民共和国,国歌”适合 Phrase 查询。 热更新 远程词库文件 优点上手简单缺点 词库的管理不方便要操作直接操作磁盘文件检索页很麻烦文件的读写没有专门的优化性能不好多一层接口调用和网络传输 ik访问数据库 MySQL驱动版本兼容性 https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-versions.htmlhttps://dev.mysql.com/doc/connector-j/5.1/en/connector-j-versions.html 驱动下载地址 https://mvnrepository.com/artifact/mysql/mysql-connector-java
GET custom_analysis/_analyze
{analyzer: ik_max_word,text: [我爱中华人民共和国]
}GET custom_analysis/_analyze
{analyzer: ik_max_word,text: [蒙丢丢,大G,霸道,渣男,渣女,奥巴马]
}GET custom_analysis/_analyze
{analyzer: ik_max_word,text: [吴磊,美国,日本,澳大利亚]
}