OpenSearch/docs/reference/analysis/tokenizers/letter-tokenizer.asciidoc

124 lines
2.3 KiB
Plaintext

[[analysis-letter-tokenizer]]
=== Letter Tokenizer
The `letter` tokenizer breaks text into terms whenever it encounters a
character which is not a letter. It does a reasonable job for most European
languages, but does a terrible job for some Asian languages, where words are
not separated by spaces.
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"tokenizer": "letter",
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}
---------------------------
// CONSOLE
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "The",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "QUICK",
"start_offset": 6,
"end_offset": 11,
"type": "word",
"position": 1
},
{
"token": "Brown",
"start_offset": 12,
"end_offset": 17,
"type": "word",
"position": 2
},
{
"token": "Foxes",
"start_offset": 18,
"end_offset": 23,
"type": "word",
"position": 3
},
{
"token": "jumped",
"start_offset": 24,
"end_offset": 30,
"type": "word",
"position": 4
},
{
"token": "over",
"start_offset": 31,
"end_offset": 35,
"type": "word",
"position": 5
},
{
"token": "the",
"start_offset": 36,
"end_offset": 39,
"type": "word",
"position": 6
},
{
"token": "lazy",
"start_offset": 40,
"end_offset": 44,
"type": "word",
"position": 7
},
{
"token": "dog",
"start_offset": 45,
"end_offset": 48,
"type": "word",
"position": 8
},
{
"token": "s",
"start_offset": 49,
"end_offset": 50,
"type": "word",
"position": 9
},
{
"token": "bone",
"start_offset": 51,
"end_offset": 55,
"type": "word",
"position": 10
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above sentence would produce the following terms:
[source,text]
---------------------------
[ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]
---------------------------
[float]
=== Configuration
The `letter` tokenizer is not configurable.