Makes the following changes to the `word_delimiter_graph` token filter docs: * Updates the Lucene experimental admonition. * Updates description * Adds analyze snippet * Adds custom analyzer and custom filter snippets * Reorganizes and updates parameter list * Expands and updates section re: differences between `word_delimiter` and `word_delimiter_graph`
This commit is contained in:
parent
aafc0409a9
commit
28cb4a167d
|
@ -4,91 +4,358 @@
|
||||||
<titleabbrev>Word delimiter graph</titleabbrev>
|
<titleabbrev>Word delimiter graph</titleabbrev>
|
||||||
++++
|
++++
|
||||||
|
|
||||||
experimental[This functionality is marked as experimental in Lucene]
|
Splits tokens at non-alphanumeric characters. The `word_delimiter_graph` filter
|
||||||
|
also performs optional token normalization based on a set of rules. By default,
|
||||||
|
the filter uses the following rules:
|
||||||
|
|
||||||
Named `word_delimiter_graph`, it splits words into subwords and performs
|
* Split tokens at non-alphanumeric characters.
|
||||||
optional transformations on subword groups. Words are split into
|
The filter uses these characters as delimiters.
|
||||||
subwords with the following rules:
|
For example: `Super-Duper` -> `Super`, `Duper`
|
||||||
|
* Remove leading or trailing delimiters from each token.
|
||||||
|
For example: `XL---42+'Autocoder'` -> `XL`, `42`, `Autocoder`
|
||||||
|
* Split tokens at letter case transitions.
|
||||||
|
For example: `PowerShot` -> `Power`, `Shot`
|
||||||
|
* Split tokens at letter-number transitions.
|
||||||
|
For example: `XL500` -> `XL`, `500`
|
||||||
|
* Remove the English possessive (`'s`) from the end of each token.
|
||||||
|
For example: `Neil's` -> `Neil`
|
||||||
|
|
||||||
* split on intra-word delimiters (by default, all non alpha-numeric
|
The `word_delimiter_graph` filter uses Lucene's
|
||||||
characters).
|
{lucene-analysis-docs}/miscellaneous/WordDelimiterGraphFilter.html[WordDelimiterGraphFilter].
|
||||||
* "Wi-Fi" -> "Wi", "Fi"
|
|
||||||
* split on case transitions: "PowerShot" -> "Power", "Shot"
|
|
||||||
* split on letter-number transitions: "SD500" -> "SD", "500"
|
|
||||||
* leading and trailing intra-word delimiters on each subword are
|
|
||||||
ignored: "//hello---there, 'dude'" -> "hello", "there", "dude"
|
|
||||||
* trailing "'s" are removed for each subword: "O'Neil's" -> "O", "Neil"
|
|
||||||
|
|
||||||
Unlike the `word_delimiter`, this token filter correctly handles positions for
|
[TIP]
|
||||||
multi terms expansion at search-time when any of the following options
|
====
|
||||||
are set to true:
|
The `word_delimiter_graph` filter was designed to remove punctuation from
|
||||||
|
complex identifiers, such as product IDs or part numbers. For these use cases,
|
||||||
|
we recommend using the `word_delimiter_graph` filter with the
|
||||||
|
<<analysis-keyword-tokenizer,`keyword`>> tokenizer.
|
||||||
|
|
||||||
* `preserve_original`
|
Avoid using the `word_delimiter_graph` filter to split hyphenated words, such as
|
||||||
* `catenate_numbers`
|
`wi-fi`. Because users often search for these words both with and without
|
||||||
* `catenate_words`
|
hyphens, we recommend using the
|
||||||
* `catenate_all`
|
<<analysis-synonym-graph-tokenfilter,`synonym_graph`>> filter instead.
|
||||||
|
====
|
||||||
|
|
||||||
Parameters include:
|
[[analysis-word-delimiter-graph-tokenfilter-analyze-ex]]
|
||||||
|
==== Example
|
||||||
|
|
||||||
`generate_word_parts`::
|
The following <<indices-analyze,analyze API>> request uses the
|
||||||
If `true` causes parts of words to be
|
`word_delimiter_graph` filter to split `Neil's Super-Duper-XL500--42+AutoCoder`
|
||||||
generated: "PowerShot" -> "Power" "Shot". Defaults to `true`.
|
into normalized tokens using the filter's default rules:
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
----
|
||||||
|
GET /_analyze
|
||||||
|
{
|
||||||
|
"tokenizer": "whitespace",
|
||||||
|
"filter": [ "word_delimiter_graph" ],
|
||||||
|
"text": "Neil's Super-Duper-XL500--42+AutoCoder"
|
||||||
|
}
|
||||||
|
----
|
||||||
|
|
||||||
|
The filter produces the following tokens:
|
||||||
|
|
||||||
|
[source,txt]
|
||||||
|
----
|
||||||
|
[ Neil, Super, Duper, XL, 500, 42, Auto, Coder ]
|
||||||
|
----
|
||||||
|
|
||||||
|
////
|
||||||
|
[source,console-result]
|
||||||
|
----
|
||||||
|
{
|
||||||
|
"tokens" : [
|
||||||
|
{
|
||||||
|
"token" : "Neil",
|
||||||
|
"start_offset" : 0,
|
||||||
|
"end_offset" : 4,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 0
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "Super",
|
||||||
|
"start_offset" : 7,
|
||||||
|
"end_offset" : 12,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 1
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "Duper",
|
||||||
|
"start_offset" : 13,
|
||||||
|
"end_offset" : 18,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 2
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "XL",
|
||||||
|
"start_offset" : 19,
|
||||||
|
"end_offset" : 21,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 3
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "500",
|
||||||
|
"start_offset" : 21,
|
||||||
|
"end_offset" : 24,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 4
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "42",
|
||||||
|
"start_offset" : 26,
|
||||||
|
"end_offset" : 28,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 5
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "Auto",
|
||||||
|
"start_offset" : 29,
|
||||||
|
"end_offset" : 33,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 6
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"token" : "Coder",
|
||||||
|
"start_offset" : 33,
|
||||||
|
"end_offset" : 38,
|
||||||
|
"type" : "word",
|
||||||
|
"position" : 7
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
----
|
||||||
|
////
|
||||||
|
|
||||||
|
[analysis-word-delimiter-tokenfilter-analyzer-ex]]
|
||||||
|
==== Add to an analyzer
|
||||||
|
|
||||||
|
The following <<indices-create-index,create index API>> request uses the
|
||||||
|
`word_delimiter_graph` filter to configure a new
|
||||||
|
<<analysis-custom-analyzer,custom analyzer>>.
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
----
|
||||||
|
PUT /my_index
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"analysis": {
|
||||||
|
"analyzer": {
|
||||||
|
"my_analyzer": {
|
||||||
|
"tokenizer": "whitespace",
|
||||||
|
"filter": [ "word_delimiter_graph" ]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
----
|
||||||
|
|
||||||
|
[WARNING]
|
||||||
|
====
|
||||||
|
Avoid using the `word_delimiter_graph` filter with tokenizers that remove
|
||||||
|
punctuation, such as the <<analysis-standard-tokenizer,`standard`>> tokenizer.
|
||||||
|
This could prevent the `word_delimiter_graph` filter from splitting tokens
|
||||||
|
correctly. It can also interfere with the filter's configurable parameters, such
|
||||||
|
as <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>> or
|
||||||
|
<<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>. We
|
||||||
|
recommend using the <<analysis-keyword-tokenizer,`keyword`>> or
|
||||||
|
<<analysis-whitespace-tokenizer,`whitespace`>> tokenizer instead.
|
||||||
|
====
|
||||||
|
|
||||||
|
[[word-delimiter-graph-tokenfilter-configure-parms]]
|
||||||
|
==== Configurable parameters
|
||||||
|
|
||||||
|
[[word-delimiter-graph-tokenfilter-adjust-offsets]]
|
||||||
|
`adjust_offsets`::
|
||||||
|
+
|
||||||
|
--
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, the filter adjusts the offsets of split or catenated tokens to better
|
||||||
|
reflect their actual position in the token stream. Defaults to `true`.
|
||||||
|
|
||||||
|
[WARNING]
|
||||||
|
====
|
||||||
|
Set `adjust_offsets` to `false` if your analyzer uses filters, such as the
|
||||||
|
<<analysis-trim-tokenfilter,`trim`>> filter, that change the length of tokens
|
||||||
|
without changing their offsets. Otherwise, the `word_delimiter_graph` filter
|
||||||
|
could produce tokens with illegal offsets.
|
||||||
|
====
|
||||||
|
--
|
||||||
|
|
||||||
|
[[word-delimiter-graph-tokenfilter-catenate-all]]
|
||||||
|
`catenate_all`::
|
||||||
|
+
|
||||||
|
--
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, the filter produces catenated tokens for chains of alphanumeric
|
||||||
|
characters separated by non-alphabetic delimiters. For example:
|
||||||
|
`super-duper-xl-500` -> [**`superduperxl500`**, `super`, `duper`, `xl`, `500` ].
|
||||||
|
Defaults to `false`.
|
||||||
|
|
||||||
|
[WARNING]
|
||||||
|
====
|
||||||
|
Setting this parameter to `true` produces multi-position tokens, which are not
|
||||||
|
supported by indexing.
|
||||||
|
|
||||||
|
If this parameter is `true`, avoid using this filter in an index analyzer or
|
||||||
|
use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
|
||||||
|
this filter to make the token stream suitable for indexing.
|
||||||
|
|
||||||
|
When used for search analysis, catenated tokens can cause problems for the
|
||||||
|
<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
|
||||||
|
rely on token position for matching. Avoid setting this parameter to `true` if
|
||||||
|
you plan to use these queries.
|
||||||
|
====
|
||||||
|
--
|
||||||
|
|
||||||
|
[[word-delimiter-graph-tokenfilter-catenate-numbers]]
|
||||||
|
`catenate_numbers`::
|
||||||
|
+
|
||||||
|
--
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, the filter produces catenated tokens for chains of numeric characters
|
||||||
|
separated by non-alphabetic delimiters. For example: `01-02-03` ->
|
||||||
|
[**`010203`**, `01`, `02`, `03` ]. Defaults to `false`.
|
||||||
|
|
||||||
|
[WARNING]
|
||||||
|
====
|
||||||
|
Setting this parameter to `true` produces multi-position tokens, which are not
|
||||||
|
supported by indexing.
|
||||||
|
|
||||||
|
If this parameter is `true`, avoid using this filter in an index analyzer or
|
||||||
|
use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
|
||||||
|
this filter to make the token stream suitable for indexing.
|
||||||
|
|
||||||
|
When used for search analysis, catenated tokens can cause problems for the
|
||||||
|
<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
|
||||||
|
rely on token position for matching. Avoid setting this parameter to `true` if
|
||||||
|
you plan to use these queries.
|
||||||
|
====
|
||||||
|
--
|
||||||
|
|
||||||
|
[[word-delimiter-graph-tokenfilter-catenate-words]]
|
||||||
|
`catenate_words`::
|
||||||
|
+
|
||||||
|
--
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, the filter produces catenated tokens for chains of alphabetical
|
||||||
|
characters separated by non-alphabetic delimiters. For example: `super-duper-xl`
|
||||||
|
-> [**`superduperxl`**, `super`, `duper`, `xl`]. Defaults to `false`.
|
||||||
|
|
||||||
|
[WARNING]
|
||||||
|
====
|
||||||
|
Setting this parameter to `true` produces multi-position tokens, which are not
|
||||||
|
supported by indexing.
|
||||||
|
|
||||||
|
If this parameter is `true`, avoid using this filter in an index analyzer or
|
||||||
|
use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
|
||||||
|
this filter to make the token stream suitable for indexing.
|
||||||
|
|
||||||
|
When used for search analysis, catenated tokens can cause problems for the
|
||||||
|
<<query-dsl-match-query-phrase,`match_phrase`>> query and other queries that
|
||||||
|
rely on token position for matching. Avoid setting this parameter to `true` if
|
||||||
|
you plan to use these queries.
|
||||||
|
====
|
||||||
|
--
|
||||||
|
|
||||||
`generate_number_parts`::
|
`generate_number_parts`::
|
||||||
If `true` causes number subwords to be
|
(Optional, boolean)
|
||||||
generated: "500-42" -> "500" "42". Defaults to `true`.
|
If `true`, the filter includes tokens consisting of only numeric characters in
|
||||||
|
the output. If `false`, the filter excludes these tokens from the output.
|
||||||
|
Defaults to `true`.
|
||||||
|
|
||||||
`catenate_words`::
|
`generate_word_parts`::
|
||||||
If `true` causes maximum runs of word parts to be
|
(Optional, boolean)
|
||||||
catenated: "wi-fi" -> "wifi". Defaults to `false`.
|
If `true`, the filter includes tokens consisting of only alphabetical characters
|
||||||
|
in the output. If `false`, the filter excludes these tokens from the output.
|
||||||
`catenate_numbers`::
|
Defaults to `true`.
|
||||||
If `true` causes maximum runs of number parts to
|
|
||||||
be catenated: "500-42" -> "50042". Defaults to `false`.
|
|
||||||
|
|
||||||
`catenate_all`::
|
|
||||||
If `true` causes all subword parts to be catenated:
|
|
||||||
"wi-fi-4000" -> "wifi4000". Defaults to `false`.
|
|
||||||
|
|
||||||
`split_on_case_change`::
|
|
||||||
If `true` causes "PowerShot" to be two tokens;
|
|
||||||
("Power-Shot" remains two parts regards). Defaults to `true`.
|
|
||||||
|
|
||||||
|
[[word-delimiter-graph-tokenfilter-preserve-original]]
|
||||||
`preserve_original`::
|
`preserve_original`::
|
||||||
If `true` includes original words in subwords:
|
+
|
||||||
"500-42" -> "500-42" "500" "42". Defaults to `false`.
|
--
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, the filter includes the original version of any split tokens in the
|
||||||
|
output. This original version includes non-alphanumeric delimiters. For example:
|
||||||
|
`super-duper-xl-500` -> [**`super-duper-xl-500`**, `super`, `duper`, `xl`, `500`
|
||||||
|
]. Defaults to `false`.
|
||||||
|
|
||||||
`split_on_numerics`::
|
[WARNING]
|
||||||
If `true` causes "j2se" to be three tokens; "j"
|
====
|
||||||
"2" "se". Defaults to `true`.
|
Setting this parameter to `true` produces multi-position tokens, which are not
|
||||||
|
supported by indexing.
|
||||||
|
|
||||||
`stem_english_possessive`::
|
If this parameter is `true`, avoid using this filter in an index analyzer or
|
||||||
If `true` causes trailing "'s" to be
|
use the <<analysis-flatten-graph-tokenfilter,`flatten_graph`>> filter after
|
||||||
removed for each subword: "O'Neil's" -> "O", "Neil". Defaults to `true`.
|
this filter to make the token stream suitable for indexing.
|
||||||
|
====
|
||||||
Advance settings include:
|
--
|
||||||
|
|
||||||
`protected_words`::
|
`protected_words`::
|
||||||
A list of protected words from being delimiter.
|
(Optional, array of strings)
|
||||||
Either an array, or also can set `protected_words_path` which resolved
|
Array of tokens the filter won't split.
|
||||||
to a file configured with protected words (one on each line).
|
|
||||||
Automatically resolves to `config/` based location if exists.
|
|
||||||
|
|
||||||
`adjust_offsets`::
|
`protected_words_path`::
|
||||||
By default, the filter tries to output subtokens with adjusted offsets
|
+
|
||||||
to reflect their actual position in the token stream. However, when
|
--
|
||||||
used in combination with other filters that alter the length or starting
|
(Optional, string)
|
||||||
position of tokens without changing their offsets
|
Path to a file that contains a list of tokens the filter won't split.
|
||||||
(e.g. <<analysis-trim-tokenfilter,`trim`>>) this can cause tokens with
|
|
||||||
illegal offsets to be emitted. Setting `adjust_offsets` to false will
|
This path must be absolute or relative to the `config` location, and the file
|
||||||
stop `word_delimiter_graph` from adjusting these internal offsets.
|
must be UTF-8 encoded. Each token in the file must be separated by a line
|
||||||
|
break.
|
||||||
|
--
|
||||||
|
|
||||||
|
`split_on_case_change`::
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, the filter splits tokens at letter case transitions. For example:
|
||||||
|
`camelCase` -> [ `camel`, `Case`]. Defaults to `true`.
|
||||||
|
|
||||||
|
`split_on_numerics`::
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, the filter splits tokens at letter-number transitions. For example:
|
||||||
|
`j2se` -> [ `j`, `2`, `se` ]. Defaults to `true`.
|
||||||
|
|
||||||
|
`stem_english_possessive`::
|
||||||
|
(Optional, boolean)
|
||||||
|
If `true`, the filter removes the English possessive (`'s`) from the end of each
|
||||||
|
token. For example: `O'Neil's` -> `[ `O`, `Neil` ]. Defaults to `true`.
|
||||||
|
|
||||||
`type_table`::
|
`type_table`::
|
||||||
A custom type mapping table, for example (when configured
|
+
|
||||||
using `type_table_path`):
|
--
|
||||||
|
(Optional, array of strings)
|
||||||
|
Array of custom type mappings for characters. This allows you to map
|
||||||
|
non-alphanumeric characters as numeric or alphanumeric to avoid splitting on
|
||||||
|
those characters.
|
||||||
|
|
||||||
[source,type_table]
|
For example, the following array maps the plus (`+`) and hyphen (`-`) characters
|
||||||
--------------------------------------------------
|
as alphanumeric, which means they won't be treated as delimiters:
|
||||||
|
|
||||||
|
`["+ => ALPHA", "- => ALPHA"]`
|
||||||
|
|
||||||
|
Supported types include:
|
||||||
|
|
||||||
|
* `ALPHA` (Alphabetical)
|
||||||
|
* `ALPHANUM` (Alphanumeric)
|
||||||
|
* `DIGIT` (Numeric)
|
||||||
|
* `LOWER` (Lowercase alphabetical)
|
||||||
|
* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
|
||||||
|
* `UPPER` (Uppercase alphabetical)
|
||||||
|
--
|
||||||
|
|
||||||
|
`type_table_path`::
|
||||||
|
+
|
||||||
|
--
|
||||||
|
(Optional, string)
|
||||||
|
Path to a file that contains custom type mappings for characters. This allows
|
||||||
|
you to map non-alphanumeric characters as numeric or alphanumeric to avoid
|
||||||
|
splitting on those characters.
|
||||||
|
|
||||||
|
For example, the contents of this file may contain the following:
|
||||||
|
|
||||||
|
[source,txt]
|
||||||
|
----
|
||||||
# Map the $, %, '.', and ',' characters to DIGIT
|
# Map the $, %, '.', and ',' characters to DIGIT
|
||||||
# This might be useful for financial data.
|
# This might be useful for financial data.
|
||||||
$ => DIGIT
|
$ => DIGIT
|
||||||
|
@ -100,9 +367,133 @@ Advance settings include:
|
||||||
# this also tests the case where we need a bigger byte[]
|
# this also tests the case where we need a bigger byte[]
|
||||||
# see http://en.wikipedia.org/wiki/Zero-width_joiner
|
# see http://en.wikipedia.org/wiki/Zero-width_joiner
|
||||||
\\u200D => ALPHANUM
|
\\u200D => ALPHANUM
|
||||||
--------------------------------------------------
|
----
|
||||||
|
|
||||||
NOTE: Using a tokenizer like the `standard` tokenizer may interfere with
|
Supported types include:
|
||||||
the `catenate_*` and `preserve_original` parameters, as the original
|
|
||||||
string may already have lost punctuation during tokenization. Instead,
|
* `ALPHA` (Alphabetical)
|
||||||
you may want to use the `whitespace` tokenizer.
|
* `ALPHANUM` (Alphanumeric)
|
||||||
|
* `DIGIT` (Numeric)
|
||||||
|
* `LOWER` (Lowercase alphabetical)
|
||||||
|
* `SUBWORD_DELIM` (Non-alphanumeric delimiter)
|
||||||
|
* `UPPER` (Uppercase alphabetical)
|
||||||
|
|
||||||
|
This file path must be absolute or relative to the `config` location, and the
|
||||||
|
file must be UTF-8 encoded. Each mapping in the file must be separated by a line
|
||||||
|
break.
|
||||||
|
--
|
||||||
|
|
||||||
|
[[analysis-word-delimiter-graph-tokenfilter-customize]]
|
||||||
|
==== Customize
|
||||||
|
|
||||||
|
To customize the `word_delimiter_graph` filter, duplicate it to create the basis
|
||||||
|
for a new custom token filter. You can modify the filter using its configurable
|
||||||
|
parameters.
|
||||||
|
|
||||||
|
For example, the following request creates a `word_delimiter_graph`
|
||||||
|
filter that uses the following rules:
|
||||||
|
|
||||||
|
* Split tokens at non-alphanumeric characters, _except_ the hyphen (`-`)
|
||||||
|
character.
|
||||||
|
* Remove leading or trailing delimiters from each token.
|
||||||
|
* Do _not_ split tokens at letter case transitions.
|
||||||
|
* Do _not_ split tokens at letter-number transitions.
|
||||||
|
* Remove the English possessive (`'s`) from the end of each token.
|
||||||
|
|
||||||
|
[source,console]
|
||||||
|
----
|
||||||
|
PUT /my_index
|
||||||
|
{
|
||||||
|
"settings": {
|
||||||
|
"analysis": {
|
||||||
|
"analyzer": {
|
||||||
|
"my_analyzer": {
|
||||||
|
"tokenizer": "whitespace",
|
||||||
|
"filter": [ "my_custom_word_delimiter_graph_filter" ]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"filter": {
|
||||||
|
"my_custom_word_delimiter_graph_filter": {
|
||||||
|
"type": "word_delimiter_graph",
|
||||||
|
"type_table": [ "- => ALPHA" ],
|
||||||
|
"split_on_case_change": false,
|
||||||
|
"split_on_numerics": false,
|
||||||
|
"stem_english_possessive": true
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
----
|
||||||
|
|
||||||
|
[[analysis-word-delimiter-graph-differences]]
|
||||||
|
==== Differences between `word_delimiter_graph` and `word_delimiter`
|
||||||
|
|
||||||
|
Both the `word_delimiter_graph` and
|
||||||
|
<<analysis-word-delimiter-tokenfilter,`word_delimiter`>> filters produce tokens
|
||||||
|
that span multiple positions when any of the following parameters are `true`:
|
||||||
|
|
||||||
|
* <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>>
|
||||||
|
* <<word-delimiter-graph-tokenfilter-catenate-numbers,`catenate_numbers`>>
|
||||||
|
* <<word-delimiter-graph-tokenfilter-catenate-words,`catenate_words`>>
|
||||||
|
* <<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>
|
||||||
|
|
||||||
|
However, only the `word_delimiter_graph` filter assigns multi-position tokens a
|
||||||
|
`positionLength` attribute, which indicates the number of positions a token
|
||||||
|
spans. This ensures the `word_delimiter_graph` filter always produces valid token
|
||||||
|
https://en.wikipedia.org/wiki/Directed_acyclic_graph[graphs].
|
||||||
|
|
||||||
|
The `word_delimiter` filter does not assign multi-position tokens a
|
||||||
|
`positionLength` attribute. This means it produces invalid graphs for streams
|
||||||
|
including these tokens.
|
||||||
|
|
||||||
|
While indexing does not support token graphs containing multi-position tokens,
|
||||||
|
queries, such as the <<query-dsl-match-query-phrase,`match_phrase`>> query, can
|
||||||
|
use these graphs to generate multiple sub-queries from a single query string.
|
||||||
|
|
||||||
|
To see how token graphs produced by the `word_delimiter` and
|
||||||
|
`word_delimiter_graph` filters differ, check out the following example.
|
||||||
|
|
||||||
|
.*Example*
|
||||||
|
[%collapsible]
|
||||||
|
====
|
||||||
|
|
||||||
|
[[analysis-word-delimiter-graph-basic-token-graph]]
|
||||||
|
*Basic token graph*
|
||||||
|
|
||||||
|
Both the `word_delimiter` and `word_delimiter_graph` produce the following token
|
||||||
|
graph for `PowerShot2000` when the following parameters are `false`:
|
||||||
|
|
||||||
|
* <<word-delimiter-graph-tokenfilter-catenate-all,`catenate_all`>>
|
||||||
|
* <<word-delimiter-graph-tokenfilter-catenate-numbers,`catenate_numbers`>>
|
||||||
|
* <<word-delimiter-graph-tokenfilter-catenate-words,`catenate_words`>>
|
||||||
|
* <<word-delimiter-graph-tokenfilter-preserve-original,`preserve_original`>>
|
||||||
|
|
||||||
|
This graph does not contain multi-position tokens. All tokens span only one
|
||||||
|
position.
|
||||||
|
|
||||||
|
image::images/analysis/token-graph-basic.svg[align="center"]
|
||||||
|
|
||||||
|
[[analysis-word-delimiter-graph-wdg-token-graph]]
|
||||||
|
*`word_delimiter_graph` graph with a multi-position token*
|
||||||
|
|
||||||
|
The `word_delimiter_graph` filter produces the following token graph for
|
||||||
|
`PowerShot2000` when `catenate_words` is `true`.
|
||||||
|
|
||||||
|
This graph correctly indicates the catenated `PowerShot` token spans two
|
||||||
|
positions.
|
||||||
|
|
||||||
|
image::images/analysis/token-graph-wdg.svg[align="center"]
|
||||||
|
|
||||||
|
[[analysis-word-delimiter-graph-wd-token-graph]]
|
||||||
|
*`word_delimiter` graph with a multi-position token*
|
||||||
|
|
||||||
|
When `catenate_words` is `true`, the `word_delimiter` filter produces
|
||||||
|
the following token graph for `PowerShot2000`.
|
||||||
|
|
||||||
|
Note that the catenated `PowerShot` token should span two positions but only
|
||||||
|
spans one in the token graph, making it invalid.
|
||||||
|
|
||||||
|
image::images/analysis/token-graph-wd.svg[align="center"]
|
||||||
|
|
||||||
|
====
|
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 26 KiB |
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 37 KiB |
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 37 KiB |
Loading…
Reference in New Issue