OpenSearch/docs/reference/analysis/tokenizers/keyword-tokenizer.asciidoc

110 lines
2.2 KiB
Plaintext

[[analysis-keyword-tokenizer]]
=== Keyword tokenizer
++++
<titleabbrev>Keyword</titleabbrev>
++++
The `keyword` tokenizer is a ``noop'' tokenizer that accepts whatever text it
is given and outputs the exact same text as a single term. It can be combined
with token filters to normalise output, e.g. lower-casing email addresses.
[float]
=== Example output
[source,console]
---------------------------
POST _analyze
{
"tokenizer": "keyword",
"text": "New York"
}
---------------------------
/////////////////////
[source,console-result]
----------------------------
{
"tokens": [
{
"token": "New York",
"start_offset": 0,
"end_offset": 8,
"type": "word",
"position": 0
}
]
}
----------------------------
/////////////////////
The above sentence would produce the following term:
[source,text]
---------------------------
[ New York ]
---------------------------
[discrete]
[[analysis-keyword-tokenizer-token-filters]]
=== Combine with token filters
You can combine the `keyword` tokenizer with token filters to normalise
structured data, such as product IDs or email addresses.
For example, the following <<indices-analyze,analyze API>> request uses the
`keyword` tokenizer and <<analysis-lowercase-tokenfilter,`lowercase`>> filter to
convert an email address to lowercase.
[source,console]
---------------------------
POST _analyze
{
"tokenizer": "keyword",
"filter": [ "lowercase" ],
"text": "john.SMITH@example.COM"
}
---------------------------
/////////////////////
[source,console-result]
----------------------------
{
"tokens": [
{
"token": "john.smith@example.com",
"start_offset": 0,
"end_offset": 22,
"type": "word",
"position": 0
}
]
}
----------------------------
/////////////////////
The request produces the following token:
[source,text]
---------------------------
[ john.smith@example.com ]
---------------------------
[float]
=== Configuration
The `keyword` tokenizer accepts the following parameters:
[horizontal]
`buffer_size`::
The number of characters read into the term buffer in a single pass.
Defaults to `256`. The term buffer will grow by this size until all the
text has been consumed. It is advisable not to change this setting.