[DOCS] Reformat stop token filter (#53059)

Makes the following changes to the `stop` token filter docs:

* Updates description
* Adds a link to the related Lucene filter
* Adds detailed analyze snippet
* Updates custom analyzer and custom filter snippets
* Adds a list of predefined stop words by language

Co-authored-by: ScottieL <36999642+ScottieL@users.noreply.github.com>
This commit is contained in:
James Rodewig 2020-03-03 13:22:52 -05:00 committed by GitHub
parent 9ad9ad7a6b
commit cf87724ff6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
2 changed files with 341 additions and 45 deletions

View File

@ -2,6 +2,7 @@
= Text analysis
:lucene-analysis-docs: https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis
:lucene-stop-word-link: https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis
[partintro]
--

View File

@ -4,79 +4,374 @@
<titleabbrev>Stop</titleabbrev>
++++
A token filter of type `stop` that removes stop words from token
streams.
Removes https://en.wikipedia.org/wiki/Stop_words[stop words] from a token
stream.
The following are settings that can be set for a `stop` token filter
type:
When not customized, the filter removes the following English stop words by
default:
`a`, `an`, `and`, `are`, `as`, `at`, `be`, `but`, `by`, `for`, `if`, `in`,
`into`, `is`, `it`, `no`, `not`, `of`, `on`, `or`, `such`, `that`, `the`,
`their`, `then`, `there`, `these`, `they`, `this`, `to`, `was`, `will`, `with`
In addition to English, the `stop` filter supports predefined
<<analysis-stop-tokenfilter-stop-words-by-lang,stop word lists for several
languages>>. You can also specify your own stop words as an array or file.
The `stop` filter uses Lucene's
{lucene-analysis-docs}/StopFilter.html[StopFilter].
[[analysis-stop-tokenfilter-analyze-ex]]
==== Example
The following analyze API request uses the `stop` filter to remove the stop words
`a` and `the` from `a quick fox jumps over the lazy dog`:
[source,console]
----
GET /_analyze
{
"tokenizer": "standard",
"filter": [ "stop" ],
"text": "a quick fox jumps over the lazy dog"
}
----
The filter produces the following tokens:
[source,text]
----
[ quick, fox, jumps, over, lazy, dog ]
----
////
[source,console-result]
----
{
"tokens": [
{
"token": "quick",
"start_offset": 2,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "fox",
"start_offset": 8,
"end_offset": 11,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "jumps",
"start_offset": 12,
"end_offset": 17,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "over",
"start_offset": 18,
"end_offset": 22,
"type": "<ALPHANUM>",
"position": 4
},
{
"token": "lazy",
"start_offset": 27,
"end_offset": 31,
"type": "<ALPHANUM>",
"position": 6
},
{
"token": "dog",
"start_offset": 32,
"end_offset": 35,
"type": "<ALPHANUM>",
"position": 7
}
]
}
----
////
[[analysis-stop-tokenfilter-analyzer-ex]]
==== Add to an analyzer
The following <<indices-create-index,create index API>> request uses the `stop`
filter to configure a new <<analysis-custom-analyzer,custom analyzer>>.
[source,console]
----
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [ "stop" ]
}
}
}
}
}
----
[[analysis-stop-tokenfilter-configure-parms]]
==== Configurable parameters
[horizontal]
`stopwords`::
+
--
(Optional, string or array of strings)
Language value, such as `_arabic_` or `_thai_`. Defaults to
<<english-stop-words,`_english_`>>.
A list of stop words to use. Defaults to `_english_` stop words.
Each language value corresponds to a predefined list of stop words in Lucene.
See <<analysis-stop-tokenfilter-stop-words-by-lang>> for supported language
values and their stop words.
Also accepts an array of stop words.
For an empty list of stop words, use `_none_`.
--
`stopwords_path`::
+
--
(Optional, string)
Path to a file that contains a list of stop words to remove.
A path (either relative to `config` location, or absolute) to a stopwords
file configuration. Each stop word should be in its own "line" (separated
by a line break). The file must be UTF-8 encoded.
This path must be absolute or relative to the `config` location, and the file
must be UTF-8 encoded. Each stop word in the file must be separated by a line
break.
--
`ignore_case`::
Set to `true` to lower case all words first. Defaults to `false`.
(Optional, boolean)
If `true`, stop word matching is case insensitive. For example, if `true`, a
stop word of `the` matches and removes `The`, `THE`, or `the`. Defaults to
`false`.
`remove_trailing`::
+
--
(Optional, boolean)
If `true`, the last token of a stream is removed if it's a stop word. Defaults
to `true`.
Set to `false` in order to not ignore the last term of a search if it is a
stop word. This is very useful for the completion suggester as a query
like `green a` can be extended to `green apple` even though you remove
stop words in general. Defaults to `true`.
This parameter should be `false` when using the filter with a
<<completion-suggester,completion suggester>>. This would ensure a query like
`green a` matches and suggests `green apple` while still removing other stop
words.
--
The `stopwords` parameter accepts either an array of stopwords:
[[analysis-stop-tokenfilter-customize]]
==== Customize
To customize the `stop` filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a custom case-insensitive `stop`
filter that removes stop words from the <<english-stop-words,`_english_`>> stop
words list:
[source,console]
------------------------------------
----
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": ["and", "is", "the"]
}
}
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [ "my_custom_stop_words_filter" ]
}
},
"filter": {
"my_custom_stop_words_filter": {
"type": "stop",
"ignore_case": true
}
}
}
}
}
------------------------------------
----
or a predefined language-specific list:
You can also specify your own list of stop words. For example, the following
request creates a custom case-sensitive `stop` filter that removes only the stop
words `and`, `is`, and `the`:
[source,console]
------------------------------------
----
PUT /my_index
{
"settings": {
"analysis": {
"filter": {
"my_stop": {
"type": "stop",
"stopwords": "_english_"
}
}
"settings": {
"analysis": {
"analyzer": {
"default": {
"tokenizer": "whitespace",
"filter": [ "my_custom_stop_words_filter" ]
}
},
"filter": {
"my_custom_stop_words_filter": {
"type": "stop",
"ignore_case": true,
"stopwords": [ "and", "is", "the" ]
}
}
}
}
}
------------------------------------
----
Elasticsearch provides the following predefined list of languages:
[[analysis-stop-tokenfilter-stop-words-by-lang]]
==== Stop words by language
`_arabic_`, `_armenian_`, `_basque_`, `_bengali_`, `_brazilian_`, `_bulgarian_`,
`_catalan_`, `_czech_`, `_danish_`, `_dutch_`, `_english_`, `_estonian_`, `_finnish_`,
`_french_`, `_galician_`, `_german_`, `_greek_`, `_hindi_`, `_hungarian_`,
`_indonesian_`, `_irish_`, `_italian_`, `_latvian_`, `_norwegian_`, `_persian_`,
`_portuguese_`, `_romanian_`, `_russian_`, `_sorani_`, `_spanish_`,
`_swedish_`, `_thai_`, `_turkish_`.
The following list contains supported language values for the `stopwords`
parameter and a link to their predefined stop words in Lucene.
For the empty stopwords list (to disable stopwords) use: `_none_`.
[[arabic-stop-words]]
`_arabic_`::
{lucene-stop-word-link}/ar/stopwords.txt[Arabic stop words]
[[armenian-stop-words]]
`_armenian_`::
{lucene-stop-word-link}/hy/stopwords.txt[Armenian stop words]
[[basque-stop-words]]
`_basque_`::
{lucene-stop-word-link}/eu/stopwords.txt[Basque stop words]
[[bengali-stop-words]]
`_bengali_`::
{lucene-stop-word-link}/bn/stopwords.txt[Bengali stop words]
[[brazilian-stop-words]]
`_brazilian_` (Brazilian Portuguese)::
{lucene-stop-word-link}/br/stopwords.txt[Brazilian Portuguese stop words]
[[bulgarian-stop-words]]
`_bulgarian_`::
{lucene-stop-word-link}/bg/stopwords.txt[Bulgarian stop words]
[[catalan-stop-words]]
`_catalan_`::
{lucene-stop-word-link}/ca/stopwords.txt[Catalan stop words]
[[cjk-stop-words]]
`_cjk_` (Chinese, Japanese, and Korean)::
{lucene-stop-word-link}/cjk/stopwords.txt[CJK stop words]
[[czech-stop-words]]
`_czech_`::
{lucene-stop-word-link}/cz/stopwords.txt[Czech stop words]
[[danish-stop-words]]
`_danish_`::
{lucene-stop-word-link}/snowball/danish_stop.txt[Danish stop words]
[[dutch-stop-words]]
`_dutch_`::
{lucene-stop-word-link}/snowball/dutch_stop.txt[Dutch stop words]
[[english-stop-words]]
`_english_`::
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/en/EnglishAnalyzer.java#L46[English stop words]
[[estonian-stop-words]]
`_estonian_`::
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis/et/stopwords.txt[Estonian stop words]
[[finnish-stop-words]]
`_finnish_`::
{lucene-stop-word-link}/snowball/finnish_stop.txt[Finnish stop words]
[[french-stop-words]]
`_french_`::
{lucene-stop-word-link}/snowball/french_stop.txt[French stop words]
[[galician-stop-words]]
`_galician_`::
{lucene-stop-word-link}/gl/stopwords.txt[Galician stop words]
[[german-stop-words]]
`_german_`::
{lucene-stop-word-link}/snowball/german_stop.txt[German stop words]
[[greek-stop-words]]
`_greek_`::
{lucene-stop-word-link}/el/stopwords.txt[Greek stop words]
[[hindi-stop-words]]
`_hindi_`::
{lucene-stop-word-link}/hi/stopwords.txt[Hindi stop words]
[[hungarian-stop-words]]
`_hungarian_`::
{lucene-stop-word-link}/snowball/hungarian_stop.txt[Hungarian stop words]
[[indonesian-stop-words]]
`_indonesian_`::
{lucene-stop-word-link}/id/stopwords.txt[Indonesian stop words]
[[irish-stop-words]]
`_irish_`::
{lucene-stop-word-link}/ga/stopwords.txt[Irish stop words]
[[italian-stop-words]]
`_italian_`::
{lucene-stop-word-link}/snowball/italian_stop.txt[Italian stop words]
[[latvian-stop-words]]
`_latvian_`::
{lucene-stop-word-link}/lv/stopwords.txt[Latvian stop words]
[[lithuanian-stop-words]]
`_lithuanian_`::
{lucene-stop-word-link}/lt/stopwords.txt[Lithuanian stop words]
[[norwegian-stop-words]]
`_norwegian_`::
{lucene-stop-word-link}/snowball/norwegian_stop.txt[Norwegian stop words]
[[persian-stop-words]]
`_persian_`::
{lucene-stop-word-link}/fa/stopwords.txt[Persian stop words]
[[portuguese-stop-words]]
`_portuguese_`::
{lucene-stop-word-link}/snowball/portuguese_stop.txt[Portuguese stop words]
[[romanian-stop-words]]
`_romanian_`::
{lucene-stop-word-link}/ro/stopwords.txt[Romanian stop words]
[[russian-stop-words]]
`_russian_`::
{lucene-stop-word-link}/snowball/russian_stop.txt[Russian stop words]
[[sorani-stop-words]]
`_sorani_`::
{lucene-stop-word-link}/ckb/stopwords.txt[Sorani stop words]
[[spanish-stop-words]]
`_spanish_`::
{lucene-stop-word-link}/snowball/spanish_stop.txt[Spanish stop words]
[[swedish-stop-words]]
`_swedish_`::
{lucene-stop-word-link}/snowball/swedish_stop.txt[Swedish stop words]
[[thai-stop-words]]
`_thai_`::
{lucene-stop-word-link}/th/stopwords.txt[Thai stop words]
[[turkish-stop-words]]
`_turkish_`::
{lucene-stop-word-link}/tr/stopwords.txt[Turkish stop words]