OpenSearch/docs/reference/analysis/tokenizers/thai-tokenizer.asciidoc
Clinton Gormley 5da9e5dcbc Docs: Improved tokenizer docs (#18356)
* Docs: Improved tokenizer docs

Added descriptions and runnable examples

* Addressed Nik's comments

* Added TESTRESPONSEs for all tokenizer examples

* Added TESTRESPONSEs for all analyzer examples too

* Added docs, examples, and TESTRESPONSES for character filters

* Skipping two tests:

One interprets "$1" as a stack variable - same problem exists with the REST tests

The other because the "took" value is always different

* Fixed tests with "took"

* Fixed failing tests and removed preserve_original from fingerprint analyzer
2016-05-19 19:42:23 +02:00

107 lines
2.2 KiB
Plaintext

[[analysis-thai-tokenizer]]
=== Thai Tokenizer
The `thai` tokenizer segments Thai text into words, using the Thai
segmentation algorithm included with Java. Text in other languages in general
will be treated the same as the
<<analysis-standard-tokenizer,`standard` tokenizer>>.
WARNING: This tokenizer may not be supported by all JREs. It is known to work
with Sun/Oracle and OpenJDK. If your application needs to be fully portable,
consider using the {plugins}/analysis-icu-tokenizer.html[ICU Tokenizer] instead.
[float]
=== Example output
[source,js]
---------------------------
POST _analyze
{
"tokenizer": "thai",
"text": "การที่ได้ต้องแสดงว่างานดี"
}
---------------------------
// CONSOLE
/////////////////////
[source,js]
----------------------------
{
"tokens": [
{
"token": "การ",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "ที่",
"start_offset": 3,
"end_offset": 6,
"type": "word",
"position": 1
},
{
"token": "ได้",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 2
},
{
"token": "ต้อง",
"start_offset": 9,
"end_offset": 13,
"type": "word",
"position": 3
},
{
"token": "แสดง",
"start_offset": 13,
"end_offset": 17,
"type": "word",
"position": 4
},
{
"token": "ว่า",
"start_offset": 17,
"end_offset": 20,
"type": "word",
"position": 5
},
{
"token": "งาน",
"start_offset": 20,
"end_offset": 23,
"type": "word",
"position": 6
},
{
"token": "ดี",
"start_offset": 23,
"end_offset": 25,
"type": "word",
"position": 7
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above sentence would produce the following terms:
[source,text]
---------------------------
[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]
---------------------------
[float]
=== Configuration
The `thai` tokenizer is not configurable.