mirror of
https://github.com/honeymoose/OpenSearch.git
synced 2025-02-05 20:48:22 +00:00
5da9e5dcbc
* Docs: Improved tokenizer docs Added descriptions and runnable examples * Addressed Nik's comments * Added TESTRESPONSEs for all tokenizer examples * Added TESTRESPONSEs for all analyzer examples too * Added docs, examples, and TESTRESPONSES for character filters * Skipping two tests: One interprets "$1" as a stack variable - same problem exists with the REST tests The other because the "took" value is always different * Fixed tests with "took" * Fixed failing tests and removed preserve_original from fingerprint analyzer
124 lines
2.3 KiB
Plaintext
124 lines
2.3 KiB
Plaintext
[[analysis-letter-tokenizer]]
|
|
=== Letter Tokenizer
|
|
|
|
The `letter` tokenizer breaks text into terms whenever it encounters a
|
|
character which is not a letter. It does a reasonable job for most European
|
|
languages, but does a terrible job for some Asian languages, where words are
|
|
not separated by spaces.
|
|
|
|
[float]
|
|
=== Example output
|
|
|
|
[source,js]
|
|
---------------------------
|
|
POST _analyze
|
|
{
|
|
"tokenizer": "letter",
|
|
"text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
|
|
}
|
|
---------------------------
|
|
// CONSOLE
|
|
|
|
/////////////////////
|
|
|
|
[source,js]
|
|
----------------------------
|
|
{
|
|
"tokens": [
|
|
{
|
|
"token": "The",
|
|
"start_offset": 0,
|
|
"end_offset": 3,
|
|
"type": "word",
|
|
"position": 0
|
|
},
|
|
{
|
|
"token": "QUICK",
|
|
"start_offset": 6,
|
|
"end_offset": 11,
|
|
"type": "word",
|
|
"position": 1
|
|
},
|
|
{
|
|
"token": "Brown",
|
|
"start_offset": 12,
|
|
"end_offset": 17,
|
|
"type": "word",
|
|
"position": 2
|
|
},
|
|
{
|
|
"token": "Foxes",
|
|
"start_offset": 18,
|
|
"end_offset": 23,
|
|
"type": "word",
|
|
"position": 3
|
|
},
|
|
{
|
|
"token": "jumped",
|
|
"start_offset": 24,
|
|
"end_offset": 30,
|
|
"type": "word",
|
|
"position": 4
|
|
},
|
|
{
|
|
"token": "over",
|
|
"start_offset": 31,
|
|
"end_offset": 35,
|
|
"type": "word",
|
|
"position": 5
|
|
},
|
|
{
|
|
"token": "the",
|
|
"start_offset": 36,
|
|
"end_offset": 39,
|
|
"type": "word",
|
|
"position": 6
|
|
},
|
|
{
|
|
"token": "lazy",
|
|
"start_offset": 40,
|
|
"end_offset": 44,
|
|
"type": "word",
|
|
"position": 7
|
|
},
|
|
{
|
|
"token": "dog",
|
|
"start_offset": 45,
|
|
"end_offset": 48,
|
|
"type": "word",
|
|
"position": 8
|
|
},
|
|
{
|
|
"token": "s",
|
|
"start_offset": 49,
|
|
"end_offset": 50,
|
|
"type": "word",
|
|
"position": 9
|
|
},
|
|
{
|
|
"token": "bone",
|
|
"start_offset": 51,
|
|
"end_offset": 55,
|
|
"type": "word",
|
|
"position": 10
|
|
}
|
|
]
|
|
}
|
|
----------------------------
|
|
// TESTRESPONSE
|
|
|
|
/////////////////////
|
|
|
|
|
|
The above sentence would produce the following terms:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]
|
|
---------------------------
|
|
|
|
[float]
|
|
=== Configuration
|
|
|
|
The `letter` tokenizer is not configurable.
|