2013-08-28 19:24:34 -04:00
|
|
|
[[analysis-whitespace-tokenizer]]
|
2020-06-26 09:24:41 -04:00
|
|
|
=== Whitespace tokenizer
|
|
|
|
++++
|
|
|
|
<titleabbrev>Whitespace</titleabbrev>
|
|
|
|
++++
|
2013-08-28 19:24:34 -04:00
|
|
|
|
2016-05-19 13:42:23 -04:00
|
|
|
The `whitespace` tokenizer breaks text into terms whenever it encounters a
|
|
|
|
whitespace character.
|
|
|
|
|
2020-07-23 12:42:33 -04:00
|
|
|
[discrete]
|
2016-05-19 13:42:23 -04:00
|
|
|
=== Example output
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2016-05-19 13:42:23 -04:00
|
|
|
---------------------------
|
|
|
|
POST _analyze
|
|
|
|
{
|
|
|
|
"tokenizer": "whitespace",
|
|
|
|
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
|
|
|
|
}
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
/////////////////////
|
|
|
|
|
2019-09-06 09:22:08 -04:00
|
|
|
[source,console-result]
|
2016-05-19 13:42:23 -04:00
|
|
|
----------------------------
|
|
|
|
{
|
|
|
|
"tokens": [
|
|
|
|
{
|
|
|
|
"token": "The",
|
|
|
|
"start_offset": 0,
|
|
|
|
"end_offset": 3,
|
|
|
|
"type": "word",
|
|
|
|
"position": 0
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "2",
|
|
|
|
"start_offset": 4,
|
|
|
|
"end_offset": 5,
|
|
|
|
"type": "word",
|
|
|
|
"position": 1
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "QUICK",
|
|
|
|
"start_offset": 6,
|
|
|
|
"end_offset": 11,
|
|
|
|
"type": "word",
|
|
|
|
"position": 2
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "Brown-Foxes",
|
|
|
|
"start_offset": 12,
|
|
|
|
"end_offset": 23,
|
|
|
|
"type": "word",
|
|
|
|
"position": 3
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "jumped",
|
|
|
|
"start_offset": 24,
|
|
|
|
"end_offset": 30,
|
|
|
|
"type": "word",
|
|
|
|
"position": 4
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "over",
|
|
|
|
"start_offset": 31,
|
|
|
|
"end_offset": 35,
|
|
|
|
"type": "word",
|
|
|
|
"position": 5
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "the",
|
|
|
|
"start_offset": 36,
|
|
|
|
"end_offset": 39,
|
|
|
|
"type": "word",
|
|
|
|
"position": 6
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "lazy",
|
|
|
|
"start_offset": 40,
|
|
|
|
"end_offset": 44,
|
|
|
|
"type": "word",
|
|
|
|
"position": 7
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "dog's",
|
|
|
|
"start_offset": 45,
|
|
|
|
"end_offset": 50,
|
|
|
|
"type": "word",
|
|
|
|
"position": 8
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "bone.",
|
|
|
|
"start_offset": 51,
|
|
|
|
"end_offset": 56,
|
|
|
|
"type": "word",
|
|
|
|
"position": 9
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
/////////////////////
|
|
|
|
|
|
|
|
|
|
|
|
The above sentence would produce the following terms:
|
|
|
|
|
|
|
|
[source,text]
|
|
|
|
---------------------------
|
|
|
|
[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]
|
|
|
|
---------------------------
|
|
|
|
|
2020-07-23 12:42:33 -04:00
|
|
|
[discrete]
|
2016-05-19 13:42:23 -04:00
|
|
|
=== Configuration
|
|
|
|
|
2017-09-25 11:21:19 -04:00
|
|
|
The `whitespace` tokenizer accepts the following parameters:
|
|
|
|
|
|
|
|
[horizontal]
|
|
|
|
`max_token_length`::
|
|
|
|
|
|
|
|
The maximum token length. If a token is seen that exceeds this length then
|
|
|
|
it is split at `max_token_length` intervals. Defaults to `255`.
|