107 lines
2.2 KiB
Plaintext
107 lines
2.2 KiB
Plaintext
[[analysis-keyword-tokenizer]]
|
|
=== Keyword Tokenizer
|
|
|
|
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
|
|
is given and outputs the exact same text as a single term. It can be combined
|
|
with token filters to normalise output, e.g. lower-casing email addresses.
|
|
|
|
[float]
|
|
=== Example output
|
|
|
|
[source,console]
|
|
---------------------------
|
|
POST _analyze
|
|
{
|
|
"tokenizer": "keyword",
|
|
"text": "New York"
|
|
}
|
|
---------------------------
|
|
|
|
/////////////////////
|
|
|
|
[source,console-result]
|
|
----------------------------
|
|
{
|
|
"tokens": [
|
|
{
|
|
"token": "New York",
|
|
"start_offset": 0,
|
|
"end_offset": 8,
|
|
"type": "word",
|
|
"position": 0
|
|
}
|
|
]
|
|
}
|
|
----------------------------
|
|
|
|
/////////////////////
|
|
|
|
|
|
The above sentence would produce the following term:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ New York ]
|
|
---------------------------
|
|
|
|
[discrete]
|
|
[[analysis-keyword-tokenizer-token-filters]]
|
|
=== Combine with token filters
|
|
You can combine the `keyword` tokenizer with token filters to normalise
|
|
structured data, such as product IDs or email addresses.
|
|
|
|
For example, the following <<indices-analyze,analyze API>> request uses the
|
|
`keyword` tokenizer and <<analysis-lowercase-tokenfilter,`lowercase`>> filter to
|
|
convert an email address to lowercase.
|
|
|
|
[source,console]
|
|
---------------------------
|
|
POST _analyze
|
|
{
|
|
"tokenizer": "keyword",
|
|
"filter": [ "lowercase" ],
|
|
"text": "john.SMITH@example.COM"
|
|
}
|
|
---------------------------
|
|
|
|
/////////////////////
|
|
|
|
[source,console-result]
|
|
----------------------------
|
|
{
|
|
"tokens": [
|
|
{
|
|
"token": "john.smith@example.com",
|
|
"start_offset": 0,
|
|
"end_offset": 22,
|
|
"type": "word",
|
|
"position": 0
|
|
}
|
|
]
|
|
}
|
|
----------------------------
|
|
|
|
/////////////////////
|
|
|
|
|
|
The request produces the following token:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ john.smith@example.com ]
|
|
---------------------------
|
|
|
|
|
|
[float]
|
|
=== Configuration
|
|
|
|
The `keyword` tokenizer accepts the following parameters:
|
|
|
|
[horizontal]
|
|
`buffer_size`::
|
|
|
|
The number of characters read into the term buffer in a single pass.
|
|
Defaults to `256`. The term buffer will grow by this size until all the
|
|
text has been consumed. It is advisable not to change this setting.
|
|
|