OpenSearch/docs/reference/analysis/tokenizers/thai-tokenizer.asciidoc

105 lines
2.2 KiB
Plaintext

[[analysis-thai-tokenizer]]
=== Thai Tokenizer
The `thai` tokenizer segments Thai text into words, using the Thai
segmentation algorithm included with Java. Text in other languages in general
will be treated the same as the
<<analysis-standard-tokenizer,`standard` tokenizer>>.
WARNING: This tokenizer may not be supported by all JREs. It is known to work
with Sun/Oracle and OpenJDK. If your application needs to be fully portable,
consider using the {plugins}/analysis-icu-tokenizer.html[ICU Tokenizer] instead.
[float]
=== Example output
[source,console]
---------------------------
POST _analyze
{
"tokenizer": "thai",
"text": "การที่ได้ต้องแสดงว่างานดี"
}
---------------------------
/////////////////////
[source,console-result]
----------------------------
{
"tokens": [
{
"token": "การ",
"start_offset": 0,
"end_offset": 3,
"type": "word",
"position": 0
},
{
"token": "ที่",
"start_offset": 3,
"end_offset": 6,
"type": "word",
"position": 1
},
{
"token": "ได้",
"start_offset": 6,
"end_offset": 9,
"type": "word",
"position": 2
},
{
"token": "ต้อง",
"start_offset": 9,
"end_offset": 13,
"type": "word",
"position": 3
},
{
"token": "แสดง",
"start_offset": 13,
"end_offset": 17,
"type": "word",
"position": 4
},
{
"token": "ว่า",
"start_offset": 17,
"end_offset": 20,
"type": "word",
"position": 5
},
{
"token": "งาน",
"start_offset": 20,
"end_offset": 23,
"type": "word",
"position": 6
},
{
"token": "ดี",
"start_offset": 23,
"end_offset": 25,
"type": "word",
"position": 7
}
]
}
----------------------------
/////////////////////
The above sentence would produce the following terms:
[source,text]
---------------------------
[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]
---------------------------
[float]
=== Configuration
The `thai` tokenizer is not configurable.