[DOCS] Reformat common grams token filter (#48426)

This commit is contained in:
James Rodewig 2019-10-30 08:40:11 -04:00
parent 356066ce6a
commit 77acbc4fa9
1 changed files with 155 additions and 95 deletions

View File

@ -1,90 +1,52 @@
[[analysis-common-grams-tokenfilter]] [[analysis-common-grams-tokenfilter]]
=== Common Grams Token Filter === Common grams token filter
++++
<titleabbrev>Common grams</titleabbrev>
++++
Token filter that generates bigrams for frequently occurring terms. Generates https://en.wikipedia.org/wiki/Bigram[bigrams] for a specified set of
Single terms are still indexed. It can be used as an alternative to the common words.
<<analysis-stop-tokenfilter,Stop
Token Filter>> when we don't want to completely ignore common terms.
For example, the text "the quick brown is a fox" will be tokenized as For example, you can specify `is` and `the` as common words. This filter then
"the", "the_quick", "quick", "brown", "brown_is", "is", "is_a", "a", converts the tokens `[the, quick, fox, is, brown]` to `[the, the_quick, quick,
"a_fox", "fox". Assuming "the", "is" and "a" are common words. fox, fox_is, is, is_brown, brown]`.
When `query_mode` is enabled, the token filter removes common words and You can use the `common_grams` filter in place of the
single terms followed by a common word. This parameter should be enabled <<analysis-stop-tokenfilter,stop token filter>> when you don't want to
in the search analyzer. completely ignore common words.
For example, the query "the quick brown is a fox" will be tokenized as This filter uses Lucene's
"the_quick", "quick", "brown_is", "is_a", "a_fox", "fox". https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/commongrams/CommonGramsFilter.html[CommonGramsFilter].
The following are settings that can be set: [[analysis-common-grams-analyze-ex]]
==== Example
[cols="<,<",options="header",] The following <<indices-analyze,analyze API>> request creates bigrams for `is`
|======================================================================= and `the`:
|Setting |Description
|`common_words` |A list of common words to use.
|`common_words_path` |A path (either relative to `config` location, or
absolute) to a list of common words. Each word should be in its own
"line" (separated by a line break). The file must be UTF-8 encoded.
|`ignore_case` |If true, common words matching will be case insensitive
(defaults to `false`).
|`query_mode` |Generates bigrams then removes common words and single
terms followed by a common word (defaults to `false`).
|=======================================================================
Note, `common_words` or `common_words_path` field is required.
Here is an example:
[source,console] [source,console]
-------------------------------------------------- --------------------------------------------------
PUT /common_grams_example GET /_analyze
{ {
"settings": { "tokenizer" : "whitespace",
"analysis": { "filter" : [
"analyzer": { "common_grams", {
"index_grams": {
"tokenizer": "whitespace",
"filter": ["common_grams"]
},
"search_grams": {
"tokenizer": "whitespace",
"filter": ["common_grams_query"]
}
},
"filter": {
"common_grams": {
"type": "common_grams", "type": "common_grams",
"common_words": ["the", "is", "a"] "common_words": ["is", "the"]
},
"common_grams_query": {
"type": "common_grams",
"query_mode": true,
"common_words": ["the", "is", "a"]
}
}
}
} }
],
"text" : "the quick fox is brown"
} }
-------------------------------------------------- --------------------------------------------------
You can see the output by using e.g. the `_analyze` endpoint: The filter produces the following tokens:
[source,console] [source,text]
-------------------------------------------------- --------------------------------------------------
POST /common_grams_example/_analyze [ the, the_quick, quick, fox, fox_is, is, is_brown, brown ]
{
"analyzer" : "index_grams",
"text" : "the quick brown is a fox"
}
-------------------------------------------------- --------------------------------------------------
// TEST[continued]
And the response will be:
/////////////////////
[source,console-result] [source,console-result]
-------------------------------------------------- --------------------------------------------------
{ {
@ -112,57 +74,155 @@ And the response will be:
"position" : 1 "position" : 1
}, },
{ {
"token" : "brown", "token" : "fox",
"start_offset" : 10, "start_offset" : 10,
"end_offset" : 15, "end_offset" : 13,
"type" : "word", "type" : "word",
"position" : 2 "position" : 2
}, },
{ {
"token" : "brown_is", "token" : "fox_is",
"start_offset" : 10, "start_offset" : 10,
"end_offset" : 18, "end_offset" : 16,
"type" : "gram", "type" : "gram",
"position" : 2, "position" : 2,
"positionLength" : 2 "positionLength" : 2
}, },
{ {
"token" : "is", "token" : "is",
"start_offset" : 16, "start_offset" : 14,
"end_offset" : 18, "end_offset" : 16,
"type" : "word", "type" : "word",
"position" : 3 "position" : 3
}, },
{ {
"token" : "is_a", "token" : "is_brown",
"start_offset" : 16, "start_offset" : 14,
"end_offset" : 20, "end_offset" : 22,
"type" : "gram", "type" : "gram",
"position" : 3, "position" : 3,
"positionLength" : 2 "positionLength" : 2
}, },
{ {
"token" : "a", "token" : "brown",
"start_offset" : 19, "start_offset" : 17,
"end_offset" : 20, "end_offset" : 22,
"type" : "word", "type" : "word",
"position" : 4 "position" : 4
},
{
"token" : "a_fox",
"start_offset" : 19,
"end_offset" : 24,
"type" : "gram",
"position" : 4,
"positionLength" : 2
},
{
"token" : "fox",
"start_offset" : 21,
"end_offset" : 24,
"type" : "word",
"position" : 5
} }
] ]
} }
-------------------------------------------------- --------------------------------------------------
/////////////////////
[[analysis-common-grams-tokenfilter-analyzer-ex]]
==== Add to an analyzer
The following <<indices-create-index,create index API>> request uses the
`common_grams` filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>:
[source,console]
--------------------------------------------------
PUT /common_grams_example
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": ["common_grams"]
}
},
"filter": {
"common_grams": {
"type": "common_grams",
"common_words": ["a", "is", "the"]
}
}
}
}
}
--------------------------------------------------
[[analysis-common-grams-tokenfilter-configure-parms]]
==== Configurable parameters
`common_words`::
+
--
(Required+++*+++, array of strings)
A list of tokens. The filter generates bigrams for these tokens.
Either this or the `common_words_path` parameter is required.
--
`common_words_path`::
+
--
(Required+++*+++, string)
Path to a file containing a list of tokens. The filter generates bigrams for
these tokens.
This path must be absolute or relative to the `config` location. The file must
be UTF-8 encoded. Each token in the file must be separated by a line break.
Either this or the `common_words` parameter is required.
--
`ignore_case`::
(Optional, boolean)
If `true`, matches for common words matching are case-insensitive.
Defaults to `false`.
`query_mode`::
+
--
(Optional, boolean)
If `true`, the filter excludes the following tokens from the output:
* Unigrams for common words
* Unigrams for terms followed by common words
Defaults to `false`. We recommend enabling this parameter for
<<search-analyzer,search analyzers>>.
For example, you can enable this parameter and specify `is` and `the` as
common words. This filter converts the tokens `[the, quick, fox, is, brown]` to
`[the_quick, quick, fox_is, is_brown,]`.
--
[[analysis-common-grams-tokenfilter-customize]]
==== Customize
To customize the `common_grams` filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a custom `common_grams` filter with
`ignore_case` and `query_mode` set to `true`:
[source,console]
--------------------------------------------------
PUT /common_grams_example
{
"settings": {
"analysis": {
"analyzer": {
"index_grams": {
"tokenizer": "whitespace",
"filter": ["common_grams_query"]
}
},
"filter": {
"common_grams_query": {
"type": "common_grams",
"common_words": ["a", "is", "the"],
"ignore_case": true,
"query_mode": true
}
}
}
}
}
--------------------------------------------------