当前位置：首页 > news >正文

男女做那个视频的网站建设银行河北分行网站

news 2026/4/23 14:22:01

男女做那个视频的网站,建设银行河北分行网站,厦门网站建设平台,做网站图片在第一部分中#xff0c;我们讨论了使用前缀查询#xff0c;这是一种自动完成的查询时间方法。在这篇文章中#xff0c;我们将讨论 n-gram - 一种索引时间方法#xff0c;它在基本标记化后生成额外的分词#xff0c;以便我们稍后在查询时能够获得更快的前缀匹配。但在此… 在第一部分中我们讨论了使用前缀查询这是一种自动完成的查询时间方法。在这篇文章中我们将讨论 n-gram - 一种索引时间方法它在基本标记化后生成额外的分词以便我们稍后在查询时能够获得更快的前缀匹配。但在此之前让我们先看看什么是 n-gram。根据维基百科 - n-gram 是给定文本或语音序列中 n 个项目的连续序列有关 n-gram 的更多详细的介绍请参阅之前的文章 “Elasticsearch: Ngrams, edge ngrams, and shingles”。是的就是这么简单只是一系列文本。这里的 “n” 项在字符级 n-gram 的情况下表示 “n” 个字符在单词级 n-gram 的情况下表示 “n” 个单词。词级 n-gram 也称为 shingles。此外根据 “n” 的值这些被分类为 uni-gramn1、bi-gramn2、tri-gramn3等。下面的例子会更清楚 Character n-grams for input string harry:n 1 : [h, a, r, r, y]n 2 : [ha, ar, rr, ry]n 3 : [har, arr, rry]Word n-grams for input string harry potter and the goblet of fire:n 1 : [harry, potter, and, the, goblet, of, fire]n 2 : [harry potter, potter and, and the, the goblet,goblet of, of fire]n 3 : [harry potter and, potter and the, and the goblet,the goblet of, goblet of fire]在这篇文章中我们将讨论两种基于 n-gram 的方法 - 首先使用 edge n-gram 分词器然后使用内置的 search-as-you-type 类型该类型也在内部使用 n-gram 分词器。这些额外的分词在索引文档时被输出到倒排索引中从而最大限度地减少搜索时间延迟。在这里Elasticsearch 只需将输入与这些分词进行比较这与前缀查询方法不同它需要检查单个分词是否以给定输入开头。 Edge-n-gram 分词器正如我们已经看到的文本字段被分析并存储在倒排索引中。分词是这个三步分析过程中的第二步在过滤字符之后但在应用分词过滤器之前运行。 Edge-n-gram 分词器是 Elasticsearch 中可用的内置分词器之一。它首先将给定文本分解为分词然后为每个分词生成字符级 n-grams。让我们为电影创建一个索引这次使用 edge-n-gram 分词器 PUT /movies {settings: {analysis: {analyzer: {custom_edge_ngram_analyzer: {type: custom,tokenizer: customized_edge_tokenizer,filter: [lowercase]}},tokenizer: {customized_edge_tokenizer: {type: edge_ngram,min_gram: 2,max_gram: 10,token_chars: [letter,digit]}}}},mappings: {properties: {title: {type: text,analyzer: custom_edge_ngram_analyzer}}} } 在前缀查询示例中我们没有将分析器参数传递给映射中的任何字段而是依赖于默认的标准分析器。上面我们首先创建了一个自定义分析器custom_edge_ngram_analyzer并传递给它类型为 edge_ngram 的自定义分词器 customized_edge_tokenizer。 Edge_ngram 分词器可以使用以下参数进行定制 min_gram ⇒ 放入 gram 中的最小字符数默认为 1类似于上面看到的 uni-gram 示例max_gram ⇒ 放入 gram 中的最大字符数默认为 2类似于上面看到的 bi-gram 示例token_chars ⇒ 要保留在 token 中的字符如果 Elasticsearch 遇到任何不属于提供的列表的字符它将使用该字符作为新 token 的断点。支持的字符类包括字母、数字、标点符号、符号和空格。在上面的映射中我们保留了字母和数字作为 token 的一部分。如果我们将输入字符串传递为“harry potter: Deathly Hallows”Elasticsearch 将通过打破空格和标点符号来生成 [harry, potter, deathly, hallows]。让我们使用 _analyze API 来测试我们的自定义边 n-gram 分析器的行为 GET /movies/_analyze {field: title,text: Harry Potter and the Order of the Phoenix } 上面命令返回的结果为 [ha, har, harr, harry, po, pot, pott, potte, potter, an,and, th, the, or, ord, orde, order, of, th, the, ph, pho,phoe, phoen, phoeni, phoenix] 为了保持简洁我没有包含实际响应其中包含一组对象每 gram 一个对象包含有关该 gram 的元数据。无论如何正如可以观察到的我们的自定义分析器按设计工作 - 为传递的字符串发出 gram小写且长度在最小 - 最大设置内。让我们索引一些电影来测试自动完成功能 - POST /movies/_doc {title: Harry Potter and the Half-Blood Prince }POST /movies/_doc {title: Harry Potter and the Deathly Hallows – Part 1 } Edge-n-grammed 字段也支持中缀匹配。即你也可以通过传递 “har” 和 “dead” 来匹配标题为 “harry potter and the deathly hallows” 的文档。这使得它适合自动完成实现其中输入文本中的单词没有固定的顺序。 GET /movies/_search?filter_path**.hits {query: {match: {title: {query: deathly }}} } 上面命令返回结果 {hits: {hits: [{_index: movies,_id: fb-HHIsByaLf0EuT7s0I,_score: 4.0647593,_source: {title: Harry Potter and the Deathly Hallows – Part 1}}]} } GET /movies/_search?filter_path**.hits {query: {match: {title: {query: harry pot}}} } 上面的命令返回 {hits: {hits: [{_index: movies,_id: fL-HHIsByaLf0EuT5M2i,_score: 1.1879652,_source: {title: Harry Potter and the Half-Blood Prince}},{_index: movies,_id: fb-HHIsByaLf0EuT7s0I,_score: 1.1377401,_source: {title: Harry Potter and the Deathly Hallows – Part 1}}]} } GET /movies/_search?filter_path**.hits {query: {match: {title: {query: potter har}}} } {hits: {hits: [{_index: movies,_id: fL-HHIsByaLf0EuT5M2i,_score: 1.3746086,_source: {title: Harry Potter and the Half-Blood Prince}},{_index: movies,_id: fb-HHIsByaLf0EuT7s0I,_score: 1.3159354,_source: {title: Harry Potter and the Deathly Hallows – Part 1}}]} } 默认情况下对分析字段上例中的 title的搜索查询也会对搜索词运行分析器。如果你将搜索词指定为“deathly potter”希望它只匹配第二个文档你会感到惊讶因为它匹配两个文档。这是因为搜索词 “deathly potter” 也将被分词将 “deathly” 和 “potter” 输出为单独的分词。尽管 “Harry Potter and the Deathly Hallows – Part 1” 与最高分相匹配但输入查询分词是单独匹配的从而为我们提供了两个文档作为结果。如果你认为这可能会导致问题你也可以为搜索查询指定分析器。因此edge-n-gram 通过在倒排索引中保存额外的分词来克服前缀查询的限制从而最大限度地减少查询时间延迟。但是这些额外的分词确实会占用节点上的额外空间并可能导致性能下降。我们在选择 n-gram 字段时应该小心因为某些字段的值可能具有无限大小并且可能会导致索引膨胀。 Search_as_you_type Search_as_you 类型数据类型是在 Elasticsearch 7.2 中引入的旨在为自动完成功能提供开箱即用的支持。与 edge-n-gram 方法一样这也通过生成额外的分词来优化自动完成查询来完成索引时的大部分工作。当特定字段映射为 search_as_you_type 类型时会在内部为其创建其他子字段。让我们将标题字段类型更改为 search_as_you_type DELETE moviesPUT /movies {mappings: {properties: {title: {type: search_as_you_type,max_shingle_size: 3}}} } 对于上述索引中的标题属性将创建三个子字段。这些子字段使用 shingle token 过滤器。 Shingles 只不过是一组连续的单词单词 n-gram如上所示。根 title 字段 ⇒ 使用映射中提供的分析器进行分析如果未提供则使用默认值title._2gram ⇒ 这会将标题分成各有两个单词的部分即大小为 2 的 shingles。title._3gram ⇒ 这会将标题分成每个包含三个单词的部分。title._index_prefix ⇒ 这将对 title._3gram 下生成的分词执行进一步的 edge ngram 分词。我们可以使用我们最喜欢的 _analyze API 来测试它的行为 GET movies/_analyze {text: Harry Potter and the Goblet of Fire,field: title } 上面返回 harry, potter, and, the, goblet, of, fire GET movies/_analyze {text: Harry Potter and the Goblet of Fire,field: title._2gram } 上面的命令返回 harry potter, potter and, and the, the goblet, goblet of, of fire GET movies/_analyze {text: Harry Potter and the Goblet of Fire,field: title._3gram } 上面的命令返回 harry potter andpotter and theand the gobletthe goblet ofgoblet of fire GET movies/_analyze {text: Harry Potter and the Goblet of Fire,field: title._index_prefix } 上面的命令返回 h, ha, har, harr, harry, harry[空隔], harry p, harry po, harry pot, harry pott, harry potte, harry potter, harry potter[空隔], harry potter a, harry potter an, harry potter and, p, po, pot, pott, potte, potter, potter[空隔], potter a, potter an, potter and, potter and[空隔], potter and t, potter and th, potter and the, a, an, and, and[空隔], and t, and th, and the, and the[空隔], and the g, and the go, and the gob, and the gobl, and the goble, and the goblet, t, th, the, the[空隔], the g, the go, the gob, the gobl, the goble, the goblet, the goblet[空隔], the goblet o, the goblet of, g, go, gob, gobl, goble, goblet, goblet o, goblet of, goblet of[空隔], goblet of f, goblet of fi, goblet of fir, goblet of fire, o, of, of[空隔], of f, of fi, of fir, of fire, of fire[空隔], f, fi, fir, fire, fire[空隔], fire[空隔][空隔] 要创建多少个子字段由 max_shingle_size 参数决定默认为 3可以设置为 2、3 或 4。Search_as_you_type 是一个类似文本的字段因此我们为文本字段使用其他选项例如分析器、还支持索引、存储、search_analyzer。你现在一定已经猜到了它支持前缀和中缀匹配。在查询时我们需要使用 multi_match 查询因为我们也需要定位其子字段 POST movies/_doc {title: Harry Potter and the Goblet of Fire } GET /movies/_search?filter_path**.hits {query: {multi_match: {query: the goblet,type: bool_prefix,analyzer: keyword,fields: [title,title._2gram,title._3gram]}} } 上面的查询返回 {hits: {hits: [{_index: movies,_id: fr-sHIsByaLf0EuThM2C,_score: 3,_source: {title: Harry Potter and the Goblet of Fire}}]} } 我们在这里将查询类型设置为 bool_prefix。查询将匹配具有任何顺序的 title 的文档但具有与查询中的文本匹配的顺序的文档将排名更高。在上面的示例中我们将 “the goblet” 作为查询文本传递因此标题为 “the goblet of fire” 的文档的排名将高于标题为 “fire goblet” 的文档。另外我们将查询分析器指定为关键字这样我们的查询文本 “the goblet” 就不会被分析而是按原样进行匹配。如果没有这个标题为 “Harry Potter and the Goblet of Fire” 的文档以及标题为 “Harry Potter and the Deathly Hallows – Part 1” 的文档也会匹配。这不是查询 search_as_you_type 字段的唯一方法但肯定更适合我们的自动完成用例。与 edge-n-gram 一样search_as_you_type 通过存储针对自动完成进行优化的数据来克服前缀查询方法的限制。因此在这种方法中我们也必须小心使用该字段存储的内容。需要额外的空间来存储这些 n-gram 分词。

查看全文

http://www.hkea.cn/news/14382422/