[DOCS] Reformat length token filter docs (#49805)

* Adds a title abbreviation
* Updates the description and adds a Lucene link
* Reformats the parameters section
* Adds analyze, custom analyzer, and custom filter snippets

Relates to #44726.
This commit is contained in:
James Rodewig 2019-12-04 09:58:19 -05:00
parent 6dcb7fa50e
commit 87a73b6bdf
1 changed files with 165 additions and 11 deletions

View File

@ -1,16 +1,170 @@
[[analysis-length-tokenfilter]]
=== Length Token Filter
=== Length token filter
++++
<titleabbrev>Length</titleabbrev>
++++
A token filter of type `length` that removes words that are too long or
too short for the stream.
Removes tokens shorter or longer than specified character lengths.
For example, you can use the `length` filter to exclude tokens shorter than 2
characters and tokens longer than 5 characters.
The following are settings that can be set for a `length` token filter
type:
This filter uses Lucene's
https://lucene.apache.org/core/{lucene_version_path}/analyzers-common/org/apache/lucene/analysis/miscellaneous/LengthFilter.html[LengthFilter].
[cols="<,<",options="header",]
|===========================================================
|Setting |Description
|`min` |The minimum number. Defaults to `0`.
|`max` |The maximum number. Defaults to `Integer.MAX_VALUE`, which is `2^31-1` or 2147483647.
|===========================================================
[TIP]
====
The `length` filter removes entire tokens. If you'd prefer to shorten tokens to
a specific length, use the <<analysis-truncate-tokenfilter,`truncate`>> filter.
====
[[analysis-length-tokenfilter-analyze-ex]]
==== Example
The following <<indices-analyze,analyze API>> request uses the `length`
filter to remove tokens longer than 4 characters:
[source,console]
--------------------------------------------------
GET _analyze
{
"tokenizer": "whitespace",
"filter": [
{
"type": "length",
"min": 0,
"max": 4
}
],
"text": "the quick brown fox jumps over the lazy dog"
}
--------------------------------------------------
The filter produces the following tokens:
[source,text]
--------------------------------------------------
[ the, fox, over, the, lazy, dog ]
--------------------------------------------------
/////////////////////
[source,console-result]
--------------------------------------------------
{
"tokens": [
{
"token": "the",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "fox",
"start_offset": 16,
"end_offset": 19,
"type": "word",
"position": 3
},
{
"token": "over",
"start_offset": 26,
"end_offset": 30,
"type": "word",
"position": 5
},
{
"token": "the",
"start_offset": 31,
"end_offset": 34,
"type": "word",
"position": 6
},
{
"token": "lazy",
"start_offset": 35,
"end_offset": 39,
"type": "word",
"position": 7
},
{
"token": "dog",
"start_offset": 40,
"end_offset": 43,
"type": "word",
"position": 8
}
]
}
--------------------------------------------------
/////////////////////
[[analysis-length-tokenfilter-analyzer-ex]]
==== Add to an analyzer
The following <<indices-create-index,create index API>> request uses the
`length` filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>.
[source,console]
--------------------------------------------------
PUT length_example
{
"settings": {
"analysis": {
"analyzer": {
"standard_length": {
"tokenizer": "standard",
"filter": [ "length" ]
}
}
}
}
}
--------------------------------------------------
[[analysis-length-tokenfilter-configure-parms]]
==== Configurable parameters
`min`::
(Optional, integer)
Minimum character length of a token. Shorter tokens are excluded from the
output. Defaults to `0`.
`max`::
(Optional, integer)
Maximum character length of a token. Longer tokens are excluded from the output.
Defaults to `Integer.MAX_VALUE`, which is `2^31-1` or `2147483647`.
[[analysis-length-tokenfilter-customize]]
==== Customize
To customize the `length` filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a custom `length` filter that removes
tokens shorter than 2 characters and tokens longer than 10 characters:
[source,console]
--------------------------------------------------
PUT length_custom_example
{
"settings": {
"analysis": {
"analyzer": {
"whitespace_length_2_to_10_char": {
"tokenizer": "whitespace",
"filter": [ "length_2_to_10_char" ]
}
},
"filter": {
"length_2_to_10_char": {
"type": "length",
"min": 2,
"max": 10
}
}
}
}
}
--------------------------------------------------