[DOCS] Reformat common grams token filter (#48426)
This commit is contained in:
parent
356066ce6a
commit
77acbc4fa9
|
@ -1,90 +1,52 @@
|
||||||
[[analysis-common-grams-tokenfilter]]
|
[[analysis-common-grams-tokenfilter]]
|
||||||
=== Common Grams Token Filter
|
=== Common grams token filter
|
||||||
|
++++
|
||||||
|
<titleabbrev>Common grams</titleabbrev>
|
||||||
|
++++
|
||||||
|
|
||||||
Token filter that generates bigrams for frequently occurring terms.
|
Generates https://en.wikipedia.org/wiki/Bigram[bigrams] for a specified set of
|
||||||
Single terms are still indexed. It can be used as an alternative to the
|
common words.
|
||||||
<<analysis-stop-tokenfilter,Stop
|
|
||||||
Token Filter>> when we don't want to completely ignore common terms.
|
|
||||||
|
|
||||||
For example, the text "the quick brown is a fox" will be tokenized as
|
For example, you can specify `is` and `the` as common words. This filter then
|
||||||
"the", "the_quick", "quick", "brown", "brown_is", "is", "is_a", "a",
|
converts the tokens `[the, quick, fox, is, brown]` to `[the, the_quick, quick,
|
||||||
"a_fox", "fox". Assuming "the", "is" and "a" are common words.
|
fox, fox_is, is, is_brown, brown]`.
|
||||||
|
|
||||||
When `query_mode` is enabled, the token filter removes common words and
|
You can use the `common_grams` filter in place of the
|
||||||
single terms followed by a common word. This parameter should be enabled
|
<<analysis-stop-tokenfilter,stop token filter>> when you don't want to
|
||||||
in the search analyzer.
|
completely ignore common words.
|
||||||
|
|
||||||
For example, the query "the quick brown is a fox" will be tokenized as
|
This filter uses Lucene's
|
||||||
"the_quick", "quick", "brown_is", "is_a", "a_fox", "fox".
|
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html[CommonGramsFilter].
|
||||||
|
|
||||||
The following are settings that can be set:
|
[[analysis-common-grams-analyze-ex]]
|
||||||
|
==== Example
|
||||||
|
|
||||||
[cols="<,<",options="header",]
|
The following <<indices-analyze,analyze API>> request creates bigrams for `is`
|
||||||
|=======================================================================
|
and `the`:
|
||||||
|Setting |Description
|
|
||||||
|`common_words` |A list of common words to use.
|
|
||||||
|
|
||||||
|`common_words_path` |A path (either relative to `config` location, or
|
|
||||||
absolute) to a list of common words. Each word should be in its own
|
|
||||||
"line" (separated by a line break). The file must be UTF-8 encoded.
|
|
||||||
|
|
||||||
|`ignore_case` |If true, common words matching will be case insensitive
|
|
||||||
(defaults to `false`).
|
|
||||||
|
|
||||||
|`query_mode` |Generates bigrams then removes common words and single
|
|
||||||
terms followed by a common word (defaults to `false`).
|
|
||||||
|=======================================================================
|
|
||||||
|
|
||||||
Note, `common_words` or `common_words_path` field is required.
|
|
||||||
|
|
||||||
Here is an example:
|
|
||||||
|
|
||||||
[source,console]
|
[source,console]
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
PUT /common_grams_example
|
GET /_analyze
|
||||||
{
|
{
|
||||||
"settings": {
|
|
||||||
"analysis": {
|
|
||||||
"analyzer": {
|
|
||||||
"index_grams": {
|
|
||||||
"tokenizer" : "whitespace",
|
"tokenizer" : "whitespace",
|
||||||
"filter": ["common_grams"]
|
"filter" : [
|
||||||
},
|
"common_grams", {
|
||||||
"search_grams": {
|
|
||||||
"tokenizer": "whitespace",
|
|
||||||
"filter": ["common_grams_query"]
|
|
||||||
}
|
|
||||||
},
|
|
||||||
"filter": {
|
|
||||||
"common_grams": {
|
|
||||||
"type": "common_grams",
|
"type": "common_grams",
|
||||||
"common_words": ["the", "is", "a"]
|
"common_words": ["is", "the"]
|
||||||
},
|
|
||||||
"common_grams_query": {
|
|
||||||
"type": "common_grams",
|
|
||||||
"query_mode": true,
|
|
||||||
"common_words": ["the", "is", "a"]
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
|
||||||
}
|
}
|
||||||
|
],
|
||||||
|
"text" : "the quick fox is brown"
|
||||||
}
|
}
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
|
|
||||||
You can see the output by using e.g. the `_analyze` endpoint:
|
The filter produces the following tokens:
|
||||||
|
|
||||||
[source,console]
|
[source,text]
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
POST /common_grams_example/_analyze
|
[ the, the_quick, quick, fox, fox_is, is, is_brown, brown ]
|
||||||
{
|
|
||||||
"analyzer" : "index_grams",
|
|
||||||
"text" : "the quick brown is a fox"
|
|
||||||
}
|
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
// TEST[continued]
|
|
||||||
|
|
||||||
And the response will be:
|
|
||||||
|
|
||||||
|
/////////////////////
|
||||||
[source,console-result]
|
[source,console-result]
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
{
|
{
|
||||||
|
@ -112,57 +74,155 @@ And the response will be:
|
||||||
"position" : 1
|
"position" : 1
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"token" : "brown",
|
"token" : "fox",
|
||||||
"start_offset" : 10,
|
"start_offset" : 10,
|
||||||
"end_offset" : 15,
|
"end_offset" : 13,
|
||||||
"type" : "word",
|
"type" : "word",
|
||||||
"position" : 2
|
"position" : 2
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"token" : "brown_is",
|
"token" : "fox_is",
|
||||||
"start_offset" : 10,
|
"start_offset" : 10,
|
||||||
"end_offset" : 18,
|
"end_offset" : 16,
|
||||||
"type" : "gram",
|
"type" : "gram",
|
||||||
"position" : 2,
|
"position" : 2,
|
||||||
"positionLength" : 2
|
"positionLength" : 2
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"token" : "is",
|
"token" : "is",
|
||||||
"start_offset" : 16,
|
"start_offset" : 14,
|
||||||
"end_offset" : 18,
|
"end_offset" : 16,
|
||||||
"type" : "word",
|
"type" : "word",
|
||||||
"position" : 3
|
"position" : 3
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"token" : "is_a",
|
"token" : "is_brown",
|
||||||
"start_offset" : 16,
|
"start_offset" : 14,
|
||||||
"end_offset" : 20,
|
"end_offset" : 22,
|
||||||
"type" : "gram",
|
"type" : "gram",
|
||||||
"position" : 3,
|
"position" : 3,
|
||||||
"positionLength" : 2
|
"positionLength" : 2
|
||||||
},
|
},
|
||||||
{
|
{
|
||||||
"token" : "a",
|
"token" : "brown",
|
||||||
"start_offset" : 19,
|
"start_offset" : 17,
|
||||||
"end_offset" : 20,
|
"end_offset" : 22,
|
||||||
"type" : "word",
|
"type" : "word",
|
||||||
"position" : 4
|
"position" : 4
|
||||||
},
|
|
||||||
{
|
|
||||||
"token" : "a_fox",
|
|
||||||
"start_offset" : 19,
|
|
||||||
"end_offset" : 24,
|
|
||||||
"type" : "gram",
|
|
||||||
"position" : 4,
|
|
||||||
"positionLength" : 2
|
|
||||||
},
|
|
||||||
{
|
|
||||||
"token" : "fox",
|
|
||||||
"start_offset" : 21,
|
|
||||||
"end_offset" : 24,
|
|
||||||
"type" : "word",
|
|
||||||
"position" : 5
|
|
||||||
}
|
}
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
--------------------------------------------------
|
--------------------------------------------------
|
||||||
|
/////////////////////
|
||||||
|
|
||||||
|
[[analysis-common-grams-tokenfilter-analyzer-ex]]
|
||||||
|
==== Add to an analyzer
|
||||||
|
|
||||||
|
The following <<indices-create-index,create index API>> request uses the
|
||||||
|
`common_grams` filter to configure a new
|
||||||
|
<<analysis-custom-analyzer,custom analyzer>>:
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
--------------------------------------------------
|
||||||
|
PUT /common_grams_example
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"analysis": {
|
||||||
|
"analyzer": {
|
||||||
|
"index_grams": {
|
||||||
|
"tokenizer": "whitespace",
|
||||||
|
"filter": ["common_grams"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter": {
|
||||||
|
"common_grams": {
|
||||||
|
"type": "common_grams",
|
||||||
|
"common_words": ["a", "is", "the"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
|
||||||
|
[[analysis-common-grams-tokenfilter-configure-parms]]
|
||||||
|
==== Configurable parameters
|
||||||
|
|
||||||
|
`common_words`::
|
||||||
|
+
|
||||||
|
--
|
||||||
|
(Required+++*+++, array of strings)
|
||||||
|
A list of tokens. The filter generates bigrams for these tokens.
|
||||||
|
|
||||||
|
Either this or the `common_words_path` parameter is required.
|
||||||
|
--
|
||||||
|
|
||||||
|
`common_words_path`::
|
||||||
|
+
|
||||||
|
--
|
||||||
|
(Required+++*+++, string)
|
||||||
|
Path to a file containing a list of tokens. The filter generates bigrams for
|
||||||
|
these tokens.
|
||||||
|
|
||||||
|
This path must be absolute or relative to the `config` location. The file must
|
||||||
|
be UTF-8 encoded. Each token in the file must be separated by a line break.
|
||||||
|
|
||||||
|
Either this or the `common_words` parameter is required.
|
||||||
|
--
|
||||||
|
|
||||||
|
`ignore_case`::
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, matches for common words matching are case-insensitive.
|
||||||
|
Defaults to `false`.
|
||||||
|
|
||||||
|
`query_mode`::
|
||||||
|
+
|
||||||
|
--
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, the filter excludes the following tokens from the output:
|
||||||
|
|
||||||
|
* Unigrams for common words
|
||||||
|
* Unigrams for terms followed by common words
|
||||||
|
|
||||||
|
Defaults to `false`. We recommend enabling this parameter for
|
||||||
|
<<search-analyzer,search analyzers>>.
|
||||||
|
|
||||||
|
For example, you can enable this parameter and specify `is` and `the` as
|
||||||
|
common words. This filter converts the tokens `[the, quick, fox, is, brown]` to
|
||||||
|
`[the_quick, quick, fox_is, is_brown,]`.
|
||||||
|
--
|
||||||
|
|
||||||
|
[[analysis-common-grams-tokenfilter-customize]]
|
||||||
|
==== Customize
|
||||||
|
|
||||||
|
To customize the `common_grams` filter, duplicate it to create the basis
|
||||||
|
for a new custom token filter. You can modify the filter using its configurable
|
||||||
|
parameters.
|
||||||
|
|
||||||
|
For example, the following request creates a custom `common_grams` filter with
|
||||||
|
`ignore_case` and `query_mode` set to `true`:
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
--------------------------------------------------
|
||||||
|
PUT /common_grams_example
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"analysis": {
|
||||||
|
"analyzer": {
|
||||||
|
"index_grams": {
|
||||||
|
"tokenizer": "whitespace",
|
||||||
|
"filter": ["common_grams_query"]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter": {
|
||||||
|
"common_grams_query": {
|
||||||
|
"type": "common_grams",
|
||||||
|
"common_words": ["a", "is", "the"],
|
||||||
|
"ignore_case": true,
|
||||||
|
"query_mode": true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
--------------------------------------------------
|
||||||
|
|
Loading…
Reference in New Issue