[DOCS] Reformat `word_delimiter_graph` token filter (#53170) (#53272)

Makes the following changes to the `word_delimiter_graph` token filter
docs:

* Updates the Lucene experimental admonition.
* Updates description
* Adds analyze snippet
* Adds custom analyzer and custom filter snippets
* Reorganizes and updates parameter list
* Expands and updates section re: differences between `word_delimiter`
  and `word_delimiter_graph`
This commit is contained in:
James Rodewig 2020-03-09 06:45:44 -04:00 committed by GitHub
parent aafc0409a9
commit 28cb4a167d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 621 additions and 81 deletions

View File

@ -4,105 +4,496 @@
<titleabbrev>Word delimiter graph</titleabbrev>
++++
experimental[This functionality is marked as experimental in Lucene]
Splits tokens at non-alphanumeric characters. The `word_delimiter_graph` filter
also performs optional token normalization based on a set of rules. By default,
the filter uses the following rules:
Named `word_delimiter_graph`, it splits words into subwords and performs
optional transformations on subword groups. Words are split into
subwords with the following rules:
* Split tokens at non-alphanumeric characters.
The filter uses these characters as delimiters.
For example: `Super-Duper` -> `Super`, `Duper`
* Remove leading or trailing delimiters from each token.
For example: `XL---42+'Autocoder'` -> `XL`, `42`, `Autocoder`
* Split tokens at letter case transitions.
For example: `PowerShot` -> `Power`, `Shot`
* Split tokens at letter-number transitions.
For example: `XL500` -> `XL`, `500`
* Remove the English possessive (`'s`) from the end of each token.
For example: `Neil's` -> `Neil`
* split on intra-word delimiters (by default, all non alpha-numeric
characters).
* "Wi-Fi" -> "Wi", "Fi"
* split on case transitions: "PowerShot" -> "Power", "Shot"
* split on letter-number transitions: "SD500" -> "SD", "500"
* leading and trailing intra-word delimiters on each subword are
ignored: "//hello---there, 'dude'" -> "hello", "there", "dude"
* trailing "'s" are removed for each subword: "O'Neil's" -> "O", "Neil"
The `word_delimiter_graph` filter uses Lucene's
{lucene-analysis-docs}/miscellaneous/WordDelimiterGraphFilter.html[WordDelimiterGraphFilter].
Unlike the `word_delimiter`, this token filter correctly handles positions for
multi terms expansion at search-time when any of the following options
are set to true:
[TIP]
====
The `word_delimiter_graph` filter was designed to remove punctuation from
complex identifiers, such as product IDs or part numbers. For these use cases,
we recommend using the `word_delimiter_graph` filter with the
<<analysis-keyword-tokenizer,`keyword`>> tokenizer.
* `preserve_original`
* `catenate_numbers`
* `catenate_words`
* `catenate_all`
Avoid using the `word_delimiter_graph` filter to split hyphenated words, such as
`wi-fi`. Because users often search for these words both with and without
hyphens, we recommend using the
<<analysis-synonym-graph-tokenfilter,`synonym_graph`>> filter instead.
====
Parameters include:
[[analysis-word-delimiter-graph-tokenfilter-analyze-ex]]
==== Example
`generate_word_parts`::
If `true` causes parts of words to be
generated: "PowerShot" -> "Power" "Shot". Defaults to `true`.
The following <<indices-analyze,analyze API>> request uses the
`word_delimiter_graph` filter to split `Neil's Super-Duper-XL500--42+AutoCoder`
into normalized tokens using the filter's default rules:
[source,console]
----
GET /_analyze
{
"tokenizer": "whitespace",
"filter": [ "word_delimiter_graph" ],
"text": "Neil's Super-Duper-XL500--42+AutoCoder"
}
----
The filter produces the following tokens:
[source,txt]
----
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
----
////
[source,console-result]
----
{
"tokens" : [
{
"token" : "Neil",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "Super",
"start_offset" : 7,
"end_offset" : 12,
"type" : "word",
"position" : 1
},
{
"token" : "Duper",
"start_offset" : 13,
"end_offset" : 18,
"type" : "word",
"position" : 2
},
{
"token" : "XL",
"start_offset" : 19,
"end_offset" : 21,
"type" : "word",
"position" : 3
},
{
"token" : "500",
"start_offset" : 21,
"end_offset" : 24,
"type" : "word",
"position" : 4
},
{
"token" : "42",
"start_offset" : 26,
"end_offset" : 28,
"type" : "word",
"position" : 5
},
{
"token" : "Auto",
"start_offset" : 29,
"end_offset" : 33,
"type" : "word",
"position" : 6
},
{
"token" : "Coder",
"start_offset" : 33,
"end_offset" : 38,
"type" : "word",
"position" : 7
}
]
}
----
////
[analysis-word-delimiter-tokenfilter-analyzer-ex]]
==== Add to an analyzer
The following <<indices-create-index,create index API>> request uses the
`word_delimiter_graph` filter to configure a new
<<analysis-custom-analyzer,custom analyzer>>.
[source,console]
----
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [ "word_delimiter_graph" ]
}
}
}
}
}
----
[WARNING]
====
Avoid using the `word_delimiter_graph` filter with tokenizers that remove
punctuation, such as the <<analysis-standard-tokenizer,`standard`>> tokenizer.
This could prevent the `word_delimiter_graph` filter from splitting tokens
correctly. It can also interfere with the filter's configurable parameters, such
as <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>> or
<<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>. We
recommend using the <<analysis-keyword-tokenizer,`keyword`>> or
<<analysis-whitespace-tokenizer,`whitespace`>> tokenizer instead.
====
[[word-delimiter-graph-tokenfilter-configure-parms]]
==== Configurable parameters
[[word-delimiter-graph-tokenfilter-adjust-offsets]]
`adjust_offsets`::
+
--
(Optional, boolean)
If `true`, the filter adjusts the offsets of split or catenated tokens to better
reflect their actual position in the token stream. Defaults to `true`.
[WARNING]
====
Set `adjust_offsets` to `false` if your analyzer uses filters, such as the
<<analysis-trim-tokenfilter,`trim`>> filter, that change the length of tokens
without changing their offsets. Otherwise, the `word_delimiter_graph` filter
could produce tokens with illegal offsets.
====
--
[[word-delimiter-graph-tokenfilter-catenate-all]]
`catenate_all`::
+
--
(Optional, boolean)
If `true`, the filter produces catenated tokens for chains of alphanumeric
characters separated by non-alphabetic delimiters. For example:
`super-duper-xl-500` -> [**`superduperxl500`**, `super`, `duper`, `xl`, `500` ].
Defaults to `false`.
[WARNING]
====
Setting this parameter to `true` produces multi-position tokens, which are not
supported by indexing.
If this parameter is `true`, avoid using this filter in an index analyzer or
use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
this filter to make the token stream suitable for indexing.
When used for search analysis, catenated tokens can cause problems for the
<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
rely on token position for matching. Avoid setting this parameter to `true` if
you plan to use these queries.
====
--
[[word-delimiter-graph-tokenfilter-catenate-numbers]]
`catenate_numbers`::
+
--
(Optional, boolean)
If `true`, the filter produces catenated tokens for chains of numeric characters
separated by non-alphabetic delimiters. For example: `01-02-03` ->
[**`010203`**, `01`, `02`, `03` ]. Defaults to `false`.
[WARNING]
====
Setting this parameter to `true` produces multi-position tokens, which are not
supported by indexing.
If this parameter is `true`, avoid using this filter in an index analyzer or
use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
this filter to make the token stream suitable for indexing.
When used for search analysis, catenated tokens can cause problems for the
<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
rely on token position for matching. Avoid setting this parameter to `true` if
you plan to use these queries.
====
--
[[word-delimiter-graph-tokenfilter-catenate-words]]
`catenate_words`::
+
--
(Optional, boolean)
If `true`, the filter produces catenated tokens for chains of alphabetical
characters separated by non-alphabetic delimiters. For example: `super-duper-xl`
-> [**`superduperxl`**, `super`, `duper`, `xl`]. Defaults to `false`.
[WARNING]
====
Setting this parameter to `true` produces multi-position tokens, which are not
supported by indexing.
If this parameter is `true`, avoid using this filter in an index analyzer or
use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
this filter to make the token stream suitable for indexing.
When used for search analysis, catenated tokens can cause problems for the
<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
rely on token position for matching. Avoid setting this parameter to `true` if
you plan to use these queries.
====
--
`generate_number_parts`::
If `true` causes number subwords to be
generated: "500-42" -> "500" "42". Defaults to `true`.
(Optional, boolean)
If `true`, the filter includes tokens consisting of only numeric characters in
the output. If `false`, the filter excludes these tokens from the output.
Defaults to `true`.
`catenate_words`::
If `true` causes maximum runs of word parts to be
catenated: "wi-fi" -> "wifi". Defaults to `false`.
`catenate_numbers`::
If `true` causes maximum runs of number parts to
be catenated: "500-42" -> "50042". Defaults to `false`.
`catenate_all`::
If `true` causes all subword parts to be catenated:
"wi-fi-4000" -> "wifi4000". Defaults to `false`.
`split_on_case_change`::
If `true` causes "PowerShot" to be two tokens;
("Power-Shot" remains two parts regards). Defaults to `true`.
`generate_word_parts`::
(Optional, boolean)
If `true`, the filter includes tokens consisting of only alphabetical characters
in the output. If `false`, the filter excludes these tokens from the output.
Defaults to `true`.
[[word-delimiter-graph-tokenfilter-preserve-original]]
`preserve_original`::
If `true` includes original words in subwords:
"500-42" -> "500-42" "500" "42". Defaults to `false`.
+
--
(Optional, boolean)
If `true`, the filter includes the original version of any split tokens in the
output. This original version includes non-alphanumeric delimiters. For example:
`super-duper-xl-500` -> [**`super-duper-xl-500`**, `super`, `duper`, `xl`, `500`
]. Defaults to `false`.
`split_on_numerics`::
If `true` causes "j2se" to be three tokens; "j"
"2" "se". Defaults to `true`.
[WARNING]
====
Setting this parameter to `true` produces multi-position tokens, which are not
supported by indexing.
`stem_english_possessive`::
If `true` causes trailing "'s" to be
removed for each subword: "O'Neil's" -> "O", "Neil". Defaults to `true`.
Advance settings include:
If this parameter is `true`, avoid using this filter in an index analyzer or
use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
this filter to make the token stream suitable for indexing.
====
--
`protected_words`::
A list of protected words from being delimiter.
Either an array, or also can set `protected_words_path` which resolved
to a file configured with protected words (one on each line).
Automatically resolves to `config/` based location if exists.
(Optional, array of strings)
Array of tokens the filter won't split.
`adjust_offsets`::
By default, the filter tries to output subtokens with adjusted offsets
to reflect their actual position in the token stream. However, when
used in combination with other filters that alter the length or starting
position of tokens without changing their offsets
(e.g. <<analysis-trim-tokenfilter,`trim`>>) this can cause tokens with
illegal offsets to be emitted. Setting `adjust_offsets` to false will
stop `word_delimiter_graph` from adjusting these internal offsets.
`protected_words_path`::
+
--
(Optional, string)
Path to a file that contains a list of tokens the filter won't split.
This path must be absolute or relative to the `config` location, and the file
must be UTF-8 encoded. Each token in the file must be separated by a line
break.
--
`split_on_case_change`::
(Optional, boolean)
If `true`, the filter splits tokens at letter case transitions. For example:
`camelCase` -> [ `camel`, `Case`]. Defaults to `true`.
`split_on_numerics`::
(Optional, boolean)
If `true`, the filter splits tokens at letter-number transitions. For example:
`j2se` -> [ `j`, `2`, `se` ]. Defaults to `true`.
`stem_english_possessive`::
(Optional, boolean)
If `true`, the filter removes the English possessive (`'s`) from the end of each
token. For example: `O'Neil's` -> `[ `O`, `Neil` ]. Defaults to `true`.
`type_table`::
A custom type mapping table, for example (when configured
using `type_table_path`):
+
--
(Optional, array of strings)
Array of custom type mappings for characters. This allows you to map
non-alphanumeric characters as numeric or alphanumeric to avoid splitting on
those characters.
[source,type_table]
--------------------------------------------------
# Map the $, %, '.', and ',' characters to DIGIT
# This might be useful for financial data.
$ => DIGIT
% => DIGIT
. => DIGIT
\\u002C => DIGIT
For example, the following array maps the plus (`+`) and hyphen (`-`) characters
as alphanumeric, which means they won't be treated as delimiters:
# in some cases you might not want to split on ZWJ
# this also tests the case where we need a bigger byte[]
# see http://en.wikipedia.org/wiki/Zero-width_joiner
\\u200D => ALPHANUM
--------------------------------------------------
`["+ => ALPHA", "- => ALPHA"]`
NOTE: Using a tokenizer like the `standard` tokenizer may interfere with
the `catenate_*` and `preserve_original` parameters, as the original
string may already have lost punctuation during tokenization. Instead,
you may want to use the `whitespace` tokenizer.
Supported types include:
* `ALPHA` (Alphabetical)
* `ALPHANUM` (Alphanumeric)
* `DIGIT` (Numeric)
* `LOWER` (Lowercase alphabetical)
* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
* `UPPER` (Uppercase alphabetical)
--
`type_table_path`::
+
--
(Optional, string)
Path to a file that contains custom type mappings for characters. This allows
you to map non-alphanumeric characters as numeric or alphanumeric to avoid
splitting on those characters.
For example, the contents of this file may contain the following:
[source,txt]
----
# Map the $, %, '.', and ',' characters to DIGIT
# This might be useful for financial data.
$ => DIGIT
% => DIGIT
. => DIGIT
\\u002C => DIGIT
# in some cases you might not want to split on ZWJ
# this also tests the case where we need a bigger byte[]
# see http://en.wikipedia.org/wiki/Zero-width_joiner
\\u200D => ALPHANUM
----
Supported types include:
* `ALPHA` (Alphabetical)
* `ALPHANUM` (Alphanumeric)
* `DIGIT` (Numeric)
* `LOWER` (Lowercase alphabetical)
* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
* `UPPER` (Uppercase alphabetical)
This file path must be absolute or relative to the `config` location, and the
file must be UTF-8 encoded. Each mapping in the file must be separated by a line
break.
--
[[analysis-word-delimiter-graph-tokenfilter-customize]]
==== Customize
To customize the `word_delimiter_graph` filter, duplicate it to create the basis
for a new custom token filter. You can modify the filter using its configurable
parameters.
For example, the following request creates a `word_delimiter_graph`
filter that uses the following rules:
* Split tokens at non-alphanumeric characters, _except_ the hyphen (`-`)
character.
* Remove leading or trailing delimiters from each token.
* Do _not_ split tokens at letter case transitions.
* Do _not_ split tokens at letter-number transitions.
* Remove the English possessive (`'s`) from the end of each token.
[source,console]
----
PUT /my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "whitespace",
"filter": [ "my_custom_word_delimiter_graph_filter" ]
}
},
"filter": {
"my_custom_word_delimiter_graph_filter": {
"type": "word_delimiter_graph",
"type_table": [ "- => ALPHA" ],
"split_on_case_change": false,
"split_on_numerics": false,
"stem_english_possessive": true
}
}
}
}
}
----
[[analysis-word-delimiter-graph-differences]]
==== Differences between `word_delimiter_graph` and `word_delimiter`
Both the `word_delimiter_graph` and
<<analysis-word-delimiter-tokenfilter,`word_delimiter`>> filters produce tokens
that span multiple positions when any of the following parameters are `true`:
* <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>>
* <<word-delimiter-graph-tokenfilter-catenate-numbers,`catenate_numbers`>>
* <<word-delimiter-graph-tokenfilter-catenate-words,`catenate_words`>>
* <<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>
However, only the `word_delimiter_graph` filter assigns multi-position tokens a
`positionLength` attribute, which indicates the number of positions a token
spans. This ensures the `word_delimiter_graph` filter always produces valid token
https://en.wikipedia.org/wiki/Directed_acyclic_graph[graphs].
The `word_delimiter` filter does not assign multi-position tokens a
`positionLength` attribute. This means it produces invalid graphs for streams
including these tokens.
While indexing does not support token graphs containing multi-position tokens,
queries, such as the <<query-dsl-match-query-phrase,`match_phrase`>> query, can
use these graphs to generate multiple sub-queries from a single query string.
To see how token graphs produced by the `word_delimiter` and
`word_delimiter_graph` filters differ, check out the following example.
.*Example*
[%collapsible]
====
[[analysis-word-delimiter-graph-basic-token-graph]]
*Basic token graph*
Both the `word_delimiter` and `word_delimiter_graph` produce the following token
graph for `PowerShot2000` when the following parameters are `false`:
* <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>>
* <<word-delimiter-graph-tokenfilter-catenate-numbers,`catenate_numbers`>>
* <<word-delimiter-graph-tokenfilter-catenate-words,`catenate_words`>>
* <<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>
This graph does not contain multi-position tokens. All tokens span only one
position.
image::images/analysis/token-graph-basic.svg[align="center"]
[[analysis-word-delimiter-graph-wdg-token-graph]]
*`word_delimiter_graph` graph with a multi-position token*
The `word_delimiter_graph` filter produces the following token graph for
`PowerShot2000` when `catenate_words` is `true`.
This graph correctly indicates the catenated `PowerShot` token spans two
positions.
image::images/analysis/token-graph-wdg.svg[align="center"]
[[analysis-word-delimiter-graph-wd-token-graph]]
*`word_delimiter` graph with a multi-position token*
When `catenate_words` is `true`, the `word_delimiter` filter produces
the following token graph for `PowerShot2000`.
Note that the catenated `PowerShot` token should span two positions but only
spans one in the token graph, making it invalid.
image::images/analysis/token-graph-wd.svg[align="center"]
====

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 26 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 37 KiB

File diff suppressed because one or more lines are too long

After

Width:  |  Height:  |  Size: 37 KiB