[DOCS] Reformat `word_delimiter` token filter (#53387)
Makes the following changes to the `word_delimiter` token filter docs: * Adds a warning admonition recommending the `word_delimiter_graph` filter instead. This warning includes a link to the deprecated Lucene `WordDelimiterFilter`. * Updates the description * Adds detailed analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter documentation
This commit is contained in:
parent
063957b7d8
commit
933a9c6fca
|
@ -4,70 +4,313 @@
|
|||
<titleabbrev>Word delimiter</titleabbrev>
|
||||
++++
|
||||
|
||||
Named `word_delimiter`, it Splits words into subwords and performs
|
||||
optional transformations on subword groups. Words are split into
|
||||
subwords with the following rules:
|
||||
[WARNING]
|
||||
====
|
||||
We recommend using the
|
||||
<<analysis-word-delimiter-graph-tokenfilter,`word_delimiter_graph`>> instead of
|
||||
the `word_delimiter` filter.
|
||||
|
||||
* split on intra-word delimiters (by default, all non alpha-numeric
|
||||
characters): "Wi-Fi" -> "Wi", "Fi"
|
||||
* split on case transitions: "PowerShot" -> "Power", "Shot"
|
||||
* split on letter-number transitions: "SD500" -> "SD", "500"
|
||||
* leading and trailing intra-word delimiters on each subword are
|
||||
ignored: "//hello---there, 'dude'" -> "hello", "there", "dude"
|
||||
* trailing "'s" are removed for each subword: "O'Neil's" -> "O", "Neil"
|
||||
The `word_delimiter` filter can produce invalid token graphs. See
|
||||
<<analysis-word-delimiter-graph-differences>>.
|
||||
|
||||
Parameters include:
|
||||
The `word_delimiter` filter also uses Lucene's
|
||||
{lucene-analysis-docs}/miscellaneous/WordDelimiterFilter.html[WordDelimiterFilter],
|
||||
which is marked as deprecated.
|
||||
====
|
||||
|
||||
`generate_word_parts`::
|
||||
If `true` causes parts of words to be
|
||||
generated: "Power-Shot", "(Power,Shot)" -> "Power" "Shot". Defaults to `true`.
|
||||
Splits tokens at non-alphanumeric characters. The `word_delimiter` filter
|
||||
also performs optional token normalization based on a set of rules. By default,
|
||||
the filter uses the following rules:
|
||||
|
||||
`generate_number_parts`::
|
||||
If `true` causes number subwords to be
|
||||
generated: "500-42" -> "500" "42". Defaults to `true`.
|
||||
* Split tokens at non-alphanumeric characters.
|
||||
The filter uses these characters as delimiters.
|
||||
For example: `Super-Duper` -> `Super`, `Duper`
|
||||
* Remove leading or trailing delimiters from each token.
|
||||
For example: `XL---42+'Autocoder'` -> `XL`, `42`, `Autocoder`
|
||||
* Split tokens at letter case transitions.
|
||||
For example: `PowerShot` -> `Power`, `Shot`
|
||||
* Split tokens at letter-number transitions.
|
||||
For example: `XL500` -> `XL`, `500`
|
||||
* Remove the English possessive (`'s`) from the end of each token.
|
||||
For example: `Neil's` -> `Neil`
|
||||
|
||||
`catenate_words`::
|
||||
If `true` causes maximum runs of word parts to be
|
||||
catenated: "wi-fi" -> "wifi". Defaults to `false`.
|
||||
[TIP]
|
||||
====
|
||||
The `word_delimiter` filter was designed to remove punctuation from complex
|
||||
identifiers, such as product IDs or part numbers. For these use cases, we
|
||||
recommend using the `word_delimiter` filter with the
|
||||
<<analysis-keyword-tokenizer,`keyword`>> tokenizer.
|
||||
|
||||
`catenate_numbers`::
|
||||
If `true` causes maximum runs of number parts to
|
||||
be catenated: "500-42" -> "50042". Defaults to `false`.
|
||||
Avoid using the `word_delimiter` filter to split hyphenated words, such as
|
||||
`wi-fi`. Because users often search for these words both with and without
|
||||
hyphens, we recommend using the
|
||||
<<analysis-synonym-graph-tokenfilter,`synonym_graph`>> filter instead.
|
||||
====
|
||||
|
||||
[[analysis-word-delimiter-tokenfilter-analyze-ex]]
|
||||
==== Example
|
||||
|
||||
The following <<indices-analyze,analyze API>> request uses the
|
||||
`word_delimiter` filter to split `Neil's-Super-Duper-XL500--42+AutoCoder`
|
||||
into normalized tokens using the filter's default rules:
|
||||
|
||||
[source,console]
|
||||
----
|
||||
GET /_analyze
|
||||
{
|
||||
"tokenizer": "keyword",
|
||||
"filter": [ "word_delimiter" ],
|
||||
"text": "Neil's-Super-Duper-XL500--42+AutoCoder"
|
||||
}
|
||||
----
|
||||
|
||||
The filter produces the following tokens:
|
||||
|
||||
[source,txt]
|
||||
----
|
||||
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
|
||||
----
|
||||
|
||||
////
|
||||
[source,console-result]
|
||||
----
|
||||
{
|
||||
"tokens": [
|
||||
{
|
||||
"token": "Neil",
|
||||
"start_offset": 0,
|
||||
"end_offset": 4,
|
||||
"type": "word",
|
||||
"position": 0
|
||||
},
|
||||
{
|
||||
"token": "Super",
|
||||
"start_offset": 7,
|
||||
"end_offset": 12,
|
||||
"type": "word",
|
||||
"position": 1
|
||||
},
|
||||
{
|
||||
"token": "Duper",
|
||||
"start_offset": 13,
|
||||
"end_offset": 18,
|
||||
"type": "word",
|
||||
"position": 2
|
||||
},
|
||||
{
|
||||
"token": "XL",
|
||||
"start_offset": 19,
|
||||
"end_offset": 21,
|
||||
"type": "word",
|
||||
"position": 3
|
||||
},
|
||||
{
|
||||
"token": "500",
|
||||
"start_offset": 21,
|
||||
"end_offset": 24,
|
||||
"type": "word",
|
||||
"position": 4
|
||||
},
|
||||
{
|
||||
"token": "42",
|
||||
"start_offset": 26,
|
||||
"end_offset": 28,
|
||||
"type": "word",
|
||||
"position": 5
|
||||
},
|
||||
{
|
||||
"token": "Auto",
|
||||
"start_offset": 29,
|
||||
"end_offset": 33,
|
||||
"type": "word",
|
||||
"position": 6
|
||||
},
|
||||
{
|
||||
"token": "Coder",
|
||||
"start_offset": 33,
|
||||
"end_offset": 38,
|
||||
"type": "word",
|
||||
"position": 7
|
||||
}
|
||||
]
|
||||
}
|
||||
----
|
||||
////
|
||||
|
||||
[analysis-word-delimiter-tokenfilter-analyzer-ex]]
|
||||
==== Add to an analyzer
|
||||
|
||||
The following <<indices-create-index,create index API>> request uses the
|
||||
`word_delimiter` filter to configure a new
|
||||
<<analysis-custom-analyzer,custom analyzer>>.
|
||||
|
||||
[source,console]
|
||||
----
|
||||
PUT /my_index
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"my_analyzer": {
|
||||
"tokenizer": "keyword",
|
||||
"filter": [ "word_delimiter" ]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
||||
|
||||
[WARNING]
|
||||
====
|
||||
Avoid using the `word_delimiter` filter with tokenizers that remove punctuation,
|
||||
such as the <<analysis-standard-tokenizer,`standard`>> tokenizer. This could
|
||||
prevent the `word_delimiter` filter from splitting tokens correctly. It can also
|
||||
interfere with the filter's configurable parameters, such as `catenate_all` or
|
||||
`preserve_original`. We recommend using the
|
||||
<<analysis-keyword-tokenizer,`keyword`>> or
|
||||
<<analysis-whitespace-tokenizer,`whitespace`>> tokenizer instead.
|
||||
====
|
||||
|
||||
[[word-delimiter-tokenfilter-configure-parms]]
|
||||
==== Configurable parameters
|
||||
|
||||
`catenate_all`::
|
||||
If `true` causes all subword parts to be catenated:
|
||||
"wi-fi-4000" -> "wifi4000". Defaults to `false`.
|
||||
+
|
||||
--
|
||||
(Optional, boolean)
|
||||
If `true`, the filter produces catenated tokens for chains of alphanumeric
|
||||
characters separated by non-alphabetic delimiters. For example:
|
||||
`super-duper-xl-500` -> [ `super`, **`superduperxl500`**, `duper`, `xl`, `500`
|
||||
]. Defaults to `false`.
|
||||
|
||||
`split_on_case_change`::
|
||||
If `true` causes "PowerShot" to be two tokens;
|
||||
("Power-Shot" remains two parts regards). Defaults to `true`.
|
||||
[WARNING]
|
||||
====
|
||||
When used for search analysis, catenated tokens can cause problems for the
|
||||
<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
|
||||
rely on token position for matching. Avoid setting this parameter to `true` if
|
||||
you plan to use these queries.
|
||||
====
|
||||
--
|
||||
|
||||
`catenate_numbers`::
|
||||
+
|
||||
--
|
||||
(Optional, boolean)
|
||||
If `true`, the filter produces catenated tokens for chains of numeric characters
|
||||
separated by non-alphabetic delimiters. For example: `01-02-03` ->
|
||||
[ `01`, **`010203`**, `02`, `03` ]. Defaults to `false`.
|
||||
|
||||
[WARNING]
|
||||
====
|
||||
When used for search analysis, catenated tokens can cause problems for the
|
||||
<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
|
||||
rely on token position for matching. Avoid setting this parameter to `true` if
|
||||
you plan to use these queries.
|
||||
====
|
||||
--
|
||||
|
||||
`catenate_words`::
|
||||
+
|
||||
--
|
||||
(Optional, boolean)
|
||||
If `true`, the filter produces catenated tokens for chains of alphabetical
|
||||
characters separated by non-alphabetic delimiters. For example: `super-duper-xl`
|
||||
-> [ `super`, **`superduperxl`**, `duper`, `xl` ]. Defaults to `false`.
|
||||
|
||||
[WARNING]
|
||||
====
|
||||
When used for search analysis, catenated tokens can cause problems for the
|
||||
<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
|
||||
rely on token position for matching. Avoid setting this parameter to `true` if
|
||||
you plan to use these queries.
|
||||
====
|
||||
--
|
||||
|
||||
`generate_number_parts`::
|
||||
(Optional, boolean)
|
||||
If `true`, the filter includes tokens consisting of only numeric characters in
|
||||
the output. If `false`, the filter excludes these tokens from the output.
|
||||
Defaults to `true`.
|
||||
|
||||
`generate_word_parts`::
|
||||
(Optional, boolean)
|
||||
If `true`, the filter includes tokens consisting of only alphabetical characters
|
||||
in the output. If `false`, the filter excludes these tokens from the output.
|
||||
Defaults to `true`.
|
||||
|
||||
`preserve_original`::
|
||||
If `true` includes original words in subwords:
|
||||
"500-42" -> "500-42" "500" "42". Defaults to `false`.
|
||||
|
||||
`split_on_numerics`::
|
||||
If `true` causes "j2se" to be three tokens; "j"
|
||||
"2" "se". Defaults to `true`.
|
||||
|
||||
`stem_english_possessive`::
|
||||
If `true` causes trailing "'s" to be
|
||||
removed for each subword: "O'Neil's" -> "O", "Neil". Defaults to `true`.
|
||||
|
||||
Advance settings include:
|
||||
(Optional, boolean)
|
||||
If `true`, the filter includes the original version of any split tokens in the
|
||||
output. This original version includes non-alphanumeric delimiters. For example:
|
||||
`super-duper-xl-500` -> [ **`super-duper-xl-500`**, `super`, `duper`, `xl`,
|
||||
`500` ]. Defaults to `false`.
|
||||
|
||||
`protected_words`::
|
||||
A list of protected words from being delimiter.
|
||||
Either an array, or also can set `protected_words_path` which resolved
|
||||
to a file configured with protected words (one on each line).
|
||||
Automatically resolves to `config/` based location if exists.
|
||||
(Optional, array of strings)
|
||||
Array of tokens the filter won't split.
|
||||
|
||||
`protected_words_path`::
|
||||
+
|
||||
--
|
||||
(Optional, string)
|
||||
Path to a file that contains a list of tokens the filter won't split.
|
||||
|
||||
This path must be absolute or relative to the `config` location, and the file
|
||||
must be UTF-8 encoded. Each token in the file must be separated by a line
|
||||
break.
|
||||
--
|
||||
|
||||
`split_on_case_change`::
|
||||
(Optional, boolean)
|
||||
If `true`, the filter splits tokens at letter case transitions. For example:
|
||||
`camelCase` -> [ `camel`, `Case` ]. Defaults to `true`.
|
||||
|
||||
`split_on_numerics`::
|
||||
(Optional, boolean)
|
||||
If `true`, the filter splits tokens at letter-number transitions. For example:
|
||||
`j2se` -> [ `j`, `2`, `se` ]. Defaults to `true`.
|
||||
|
||||
`stem_english_possessive`::
|
||||
(Optional, boolean)
|
||||
If `true`, the filter removes the English possessive (`'s`) from the end of each
|
||||
token. For example: `O'Neil's` -> [ `O`, `Neil` ]. Defaults to `true`.
|
||||
|
||||
`type_table`::
|
||||
A custom type mapping table, for example (when configured
|
||||
using `type_table_path`):
|
||||
+
|
||||
--
|
||||
(Optional, array of strings)
|
||||
Array of custom type mappings for characters. This allows you to map
|
||||
non-alphanumeric characters as numeric or alphanumeric to avoid splitting on
|
||||
those characters.
|
||||
|
||||
[source,type_table]
|
||||
--------------------------------------------------
|
||||
For example, the following array maps the plus (`+`) and hyphen (`-`) characters
|
||||
as alphanumeric, which means they won't be treated as delimiters:
|
||||
|
||||
`[ "+ => ALPHA", "- => ALPHA" ]`
|
||||
|
||||
Supported types include:
|
||||
|
||||
* `ALPHA` (Alphabetical)
|
||||
* `ALPHANUM` (Alphanumeric)
|
||||
* `DIGIT` (Numeric)
|
||||
* `LOWER` (Lowercase alphabetical)
|
||||
* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
|
||||
* `UPPER` (Uppercase alphabetical)
|
||||
--
|
||||
|
||||
`type_table_path`::
|
||||
+
|
||||
--
|
||||
(Optional, string)
|
||||
Path to a file that contains custom type mappings for characters. This allows
|
||||
you to map non-alphanumeric characters as numeric or alphanumeric to avoid
|
||||
splitting on those characters.
|
||||
|
||||
For example, the contents of this file may contain the following:
|
||||
|
||||
[source,txt]
|
||||
----
|
||||
# Map the $, %, '.', and ',' characters to DIGIT
|
||||
# This might be useful for financial data.
|
||||
$ => DIGIT
|
||||
|
@ -79,9 +322,61 @@ Advance settings include:
|
|||
# this also tests the case where we need a bigger byte[]
|
||||
# see http://en.wikipedia.org/wiki/Zero-width_joiner
|
||||
\\u200D => ALPHANUM
|
||||
--------------------------------------------------
|
||||
----
|
||||
|
||||
NOTE: Using a tokenizer like the `standard` tokenizer may interfere with
|
||||
the `catenate_*` and `preserve_original` parameters, as the original
|
||||
string may already have lost punctuation during tokenization. Instead,
|
||||
you may want to use the `whitespace` tokenizer.
|
||||
Supported types include:
|
||||
|
||||
* `ALPHA` (Alphabetical)
|
||||
* `ALPHANUM` (Alphanumeric)
|
||||
* `DIGIT` (Numeric)
|
||||
* `LOWER` (Lowercase alphabetical)
|
||||
* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
|
||||
* `UPPER` (Uppercase alphabetical)
|
||||
|
||||
This file path must be absolute or relative to the `config` location, and the
|
||||
file must be UTF-8 encoded. Each mapping in the file must be separated by a line
|
||||
break.
|
||||
--
|
||||
|
||||
[[analysis-word-delimiter-tokenfilter-customize]]
|
||||
==== Customize
|
||||
|
||||
To customize the `word_delimiter` filter, duplicate it to create the basis
|
||||
for a new custom token filter. You can modify the filter using its configurable
|
||||
parameters.
|
||||
|
||||
For example, the following request creates a `word_delimiter`
|
||||
filter that uses the following rules:
|
||||
|
||||
* Split tokens at non-alphanumeric characters, _except_ the hyphen (`-`)
|
||||
character.
|
||||
* Remove leading or trailing delimiters from each token.
|
||||
* Do _not_ split tokens at letter case transitions.
|
||||
* Do _not_ split tokens at letter-number transitions.
|
||||
* Remove the English possessive (`'s`) from the end of each token.
|
||||
|
||||
[source,console]
|
||||
----
|
||||
PUT /my_index
|
||||
{
|
||||
"settings": {
|
||||
"analysis": {
|
||||
"analyzer": {
|
||||
"my_analyzer": {
|
||||
"tokenizer": "keyword",
|
||||
"filter": [ "my_custom_word_delimiter_filter" ]
|
||||
}
|
||||
},
|
||||
"filter": {
|
||||
"my_custom_word_delimiter_filter": {
|
||||
"type": "word_delimiter",
|
||||
"type_table": [ "- => ALPHA" ],
|
||||
"split_on_case_change": false,
|
||||
"split_on_numerics": false,
|
||||
"stem_english_possessive": true
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
----
|
Loading…
Reference in New Issue