当前位置：首页 > news >正文

现代装修风格三室两厅效果图重庆镇海seo整站优化价格

news 2026/5/4 0:50:10

现代装修风格三室两厅效果图,重庆镇海seo整站优化价格,武清网站开发,网站kv如何做目录 ES分词器详解基本概念分词发生时期分词器的组成切词器#xff1a;Tokenizer 词项过滤器#xff1a;Token Filter 停用词同义词字符过滤器#xff1a;Character Filter HTML 标签过滤器#xff1a;HTML Strip Character Filter 字符映射过滤器#x…目录 ES分词器详解基本概念分词发生时期分词器的组成切词器Tokenizer 词项过滤器Token Filter 停用词同义词字符过滤器Character Filter HTML 标签过滤器HTML Strip Character Filter 字符映射过滤器Mapping Character Filter 正则替换过滤器Pattern Replace Character Filter 相关性详解什么是相关性Relevance 相关性算法 TF-IDF BM25 通过Explain API查看TF-IDF Boosting Query ES分词器详解基本概念分词器官方称之为文本分析器顾名思义是对文本进行分析处理的一种手段基本处理逻辑为按照预先制定的分词规则把原始文档分割成若干更小粒度的词项粒度大小取决于分词器规则。分词发生时期分词器的处理过程发生在 Index Time 和 Search Time 两个时期。 Index Time文档写入并创建倒排索引时期其分词逻辑取决于映射参数analyzer。 Search Time搜索发生时期其分词仅对搜索词产生作用。分词器的组成切词器Tokenizer用于定义切词分词逻辑。词项过滤器Token Filter用于对分词之后的单个词项的处理逻辑。字符过滤器Character Filter用于处理单个字符。注意分词器不会对源数据造成任何影响分词仅仅是对倒排索引或者搜索词的行为。切词器Tokenizer tokenizer 是分词器的核心组成部分之一其主要作用是分词或称之为切词。主要用来对原始文本进行细粒度拆分。拆分之后的每一个部分称之为一个 Term或称之为一个词项。可以把切词器理解为预定义的切词规则。官方内置了很多种切词器默认的切词器位 standard。词项过滤器Token Filter 词项过滤器用来处理切词完成之后的词项例如把大小写转换删除停用词或同义词处理等。官方同样预置了很多词项过滤器基本可以满足日常开发的需要。当然也是支持第三方也自行开发的。 GET _analyze{filter : [lowercase],text : WWW ELASTIC ORG CN}GET _analyze{tokenizer : standard,filter : [uppercase],text : [www.elastic.org.cn,www elastic org cn]} 停用词在切词完成之后会被干掉词项即停用词。停用词可以自定义英文停用词englisha, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, such, that, the, their, then, there, these, they, this, to, was, will, with。中日韩停用词cjka, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, of, on, or, s, such, t, that, the, their, then, there, these, they, this, to, was, will, with, www。 DELETE test_token_filter_stop PUT test_token_filter_stop {settings: {analysis: {filter: {my_filter: {type: stop,stopwords: [www],ignore_case: true}}}} } GET test_token_filter_stop/_analyze {tokenizer: standard,filter: [my_filter],text: [What www WWW are you doing] } 同义词同义词定义规则 a, b, c d这种方式a、b、c 会被 d 代替。 a, b, c, d这种方式下a、b、c、d 是等价的。 PUT test_token_filter_synonym {settings: {analysis: {filter: {my_synonym: {type: synonym,synonyms: [ good, nice excellent ] //good, nice, excellent}}}} } GET test_token_filter_synonym/_analyze {tokenizer: standard, filter: [my_synonym], text: [good] } 字符过滤器Character Filter 分词之前的预处理过滤无用字符。 PUT index_name {settings: {analysis: {char_filter: {my_char_filter: {type: char_filter_type}}}} } type使用的字符过滤器类型名称可配置以下值 html_strip、mapping、pattern_replace HTML 标签过滤器HTML Strip Character Filter 字符过滤器会去除 HTML 标签和转义 HTML 元素如、 PUT test_html_strip_filter {settings: {analysis: {char_filter: {my_char_filter: {type: html_strip, // html_strip 代表使用 HTML 标签过滤器escaped_tags: [ // 当前仅保留 a 标签 a]}}}} } GET test_html_strip_filter/_analyze {tokenizer: standard, char_filter: [my_char_filter],text: [pIapos;m so ahappy/a!/p] } 参数escaped_tags需要保留的 html 标签。字符映射过滤器Mapping Character Filter 通过定义映替换为规则把特定字符替换为指定字符 PUT test_html_strip_filter {settings: {analysis: {char_filter: {my_char_filter: {type: mapping, // mapping 代表使用字符映射过滤器mappings: [ // 数组中规定的字符会被等价替换为指定的字符滚 *,垃 *,圾 *]}}}} } GET test_html_strip_filter/_analyze {//tokenizer: standard, char_filter: [my_char_filter],text: 你就是个垃圾滚 } 正则替换过滤器Pattern Replace Character Filter PUT text_pattern_replace_filter {settings: {analysis: {char_filter: {my_char_filter: {type: pattern_replace, // pattern_replace 代表使用正则替换过滤器 pattern: (\d{3})\d{4}(\d{4}), // 正则表达式replacement: $1****$2}}}} } GET text_pattern_replace_filter/_analyze {char_filter: [my_char_filter],text: 您的手机号是18868686688 } 相关性详解搜索是用户和搜索引擎的对话用户关心的是搜索结果的相关性 1. 是否可以找到所有相关的内容 2. 有多少不相关的内容被返回了 3. 文档的打分是否合理 4. 结合业务需求平衡结果排名什么是相关性Relevance 搜索的相关性算分描述了一个文档和查询语句匹配的程度。ES 会对每个匹配查询条件的结果进行算分_score。打分的本质是排序需要把最符合用户需求的文档排在前面。如何衡量相关性 1. Precision(查准率)―尽可能返回较少的无关文档。 2. Recall(查全率)–尽量返回较多的相关文档。 3. Ranking -是否能够按照相关度进行排序。相关性算法 ES5之前默认的相关性算分采用TF-IDF现在采用BM25。 TF-IDF TF-IDFterm frequency–inverse document frequency是一种用于信息检索与数据挖掘的常用加权技术。 Lucene中的TF-IDF评分公式 TF是词频(Term Frequency) 检索词在文档中出现的频率越高相关性也越高。词频TF 某个词在文档中出现的次数 / 文档的总词数 IDF是逆向文本频率(Inverse Document Frequency) 每个检索词在索引中出现的频率频率越高相关性越低。总文档中有些词比如“是”、“的” 、“在” 在所有文档中出现频率都很高并不重要可以减少多个文档中都频繁出现的词的权重。逆向文本频率IDF log (语料库的文档总数 / (包含该词的文档数1)) 字段长度归一值 field-length norm 检索词出现在一个内容短的 title 要比同样的词出现在一个内容长的 content 字段权重更大。以上三个因素——词频term frequency、逆向文本频率inverse document frequency和字段长度归一值field-length norm——是在索引时计算并存储的最后将它们结合在一起计算单个词在特定文档中的权重。 BM25 BM25 就是对 TF-IDF 算法的改进对于 TF-IDF 算法TF(t) 部分的值越大整个公式返回的值就会越大。BM25 就针对这点进行来优化随着TF(t) 的逐步加大该算法的返回值会趋于一个数值。从ES5开始默认算法改为BM25和经典的TF-IDF相比,当TF无限增加时BM25算分会趋于一个数值。 BM25公式通过Explain API查看TF-IDF PUT /test_score/_bulk {index:{_id:1}} {content:we use Elasticsearch to power the search} {index:{_id:2}} {content:we like elasticsearch} {index:{_id:3}} {content:Thre scoring of documents is caculated by the scoring formula} {index:{_id:4}} {content:you know,for search}GET /test_score/_search {explain: true, query: {match: {content: elasticsearch}} }GET /test_score/_explain/2 {query: {match: {content: elasticsearch}} } Boosting Query Boosting是控制相关度的一种手段。可以通过指定字段的boost值影响查询结果参数boost的含义 1. 当boost 1时打分的权重相对性提升 2. 当0 boost 3. 当boost 应用场景希望包含了某项内容的结果不是不出现而是排序靠后。 POST /blogs/_bulk {index:{_id:1}} {title:Apple iPad,content:Apple iPad,Apple iPad} {index:{_id:2}} {title:Apple iPad,Apple iPad,content:Apple iPad}GET /blogs/_search {query: {bool: {should: [{match: {title: {query: apple,ipad,boost: 1}}},{match: {content: {query: apple,ipad,boost: 4}}}]}} } 案例要求苹果公司的产品信息优先展示 POST /news/_bulk {index:{_id:1}} {content:Apple Mac} {index:{_id:2}} {content:Apple iPad} {index:{_id:3}} {content:Apple employee like Apple Pie and Apple Juice}GET /news/_search {query: {bool: {must: {match: {content: apple}}}} } 利用must not排除不是苹果公司产品的文档 GET /news/_search {query: {bool: {must: {match: {content: apple}},must_not: {match:{content: pie}}}} } 利用negative_boost降低相关性对某些返回结果不满意但又不想排除掉 must_not)可以考虑boosting query的negative_boost。 1. negative_boost 对 negative部分query生效。 2. 计算评分时,boosting部分评分不修改negative部分query乘以negative_boost值。 3. negative_boost取值:0-1.0举例:0.3。 GET /news/_search {query: {boosting: {positive: {match: {content: apple}},negative: {match: {content: pie}},negative_boost: 0.2}} }

查看全文

http://www.hkea.cn/news/14521598/