Analysis: Add additional Analyzers, Tokenizers, and TokenFilters from Lucene
Add `irish` analyzer
Add `sorani` analyzer (Kurdish)
Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add `thai` tokenizer: segments thai text into words.
Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself
Add `german_normalization` tokenfilter: umlaut/sharp S normalization
Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences
Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages
Add `sorani_normalization` tokenfilter: normalizes kurdish text
Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text
Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization`
Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk`
Add support access to default Thai stopword set "_thai_"
Fix some bugs and broken links in documentation.
Closes #5935
2014-07-02 14:59:18 -04:00
|
|
|
[[analysis-thai-tokenizer]]
|
2020-06-26 09:24:41 -04:00
|
|
|
=== Thai tokenizer
|
|
|
|
++++
|
|
|
|
<titleabbrev>Thai</titleabbrev>
|
|
|
|
++++
|
Analysis: Add additional Analyzers, Tokenizers, and TokenFilters from Lucene
Add `irish` analyzer
Add `sorani` analyzer (Kurdish)
Add `classic` tokenizer: specific to english text and tries to recognize hostnames, companies, acronyms, etc.
Add `thai` tokenizer: segments thai text into words.
Add `classic` tokenfilter: cleans up acronyms and possessives from classic tokenizer
Add `apostrophe` tokenfilter: removes text after apostrophe and the apostrophe itself
Add `german_normalization` tokenfilter: umlaut/sharp S normalization
Add `hindi_normalization` tokenfilter: accounts for hindi spelling differences
Add `indic_normalization` tokenfilter: accounts for different unicode representations in Indian languages
Add `sorani_normalization` tokenfilter: normalizes kurdish text
Add `scandinavian_normalization` tokenfilter: normalizes Norwegian, Danish, Swedish text
Add `scandinavian_folding` tokenfilter: much more aggressive form of `scandinavian_normalization`
Add additional languages to stemmer tokenfilter: `galician`, `minimal_galician`, `irish`, `sorani`, `light_nynorsk`, `minimal_nynorsk`
Add support access to default Thai stopword set "_thai_"
Fix some bugs and broken links in documentation.
Closes #5935
2014-07-02 14:59:18 -04:00
|
|
|
|
2016-05-19 13:42:23 -04:00
|
|
|
The `thai` tokenizer segments Thai text into words, using the Thai
|
|
|
|
segmentation algorithm included with Java. Text in other languages in general
|
|
|
|
will be treated the same as the
|
|
|
|
<<analysis-standard-tokenizer,`standard` tokenizer>>.
|
|
|
|
|
|
|
|
WARNING: This tokenizer may not be supported by all JREs. It is known to work
|
|
|
|
with Sun/Oracle and OpenJDK. If your application needs to be fully portable,
|
|
|
|
consider using the {plugins}/analysis-icu-tokenizer.html[ICU Tokenizer] instead.
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Example output
|
|
|
|
|
2019-09-09 13:38:14 -04:00
|
|
|
[source,console]
|
2016-05-19 13:42:23 -04:00
|
|
|
---------------------------
|
|
|
|
POST _analyze
|
|
|
|
{
|
|
|
|
"tokenizer": "thai",
|
|
|
|
"text": "การที่ได้ต้องแสดงว่างานดี"
|
|
|
|
}
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
/////////////////////
|
|
|
|
|
2019-09-06 09:22:08 -04:00
|
|
|
[source,console-result]
|
2016-05-19 13:42:23 -04:00
|
|
|
----------------------------
|
|
|
|
{
|
|
|
|
"tokens": [
|
|
|
|
{
|
|
|
|
"token": "การ",
|
|
|
|
"start_offset": 0,
|
|
|
|
"end_offset": 3,
|
|
|
|
"type": "word",
|
|
|
|
"position": 0
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "ที่",
|
|
|
|
"start_offset": 3,
|
|
|
|
"end_offset": 6,
|
|
|
|
"type": "word",
|
|
|
|
"position": 1
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "ได้",
|
|
|
|
"start_offset": 6,
|
|
|
|
"end_offset": 9,
|
|
|
|
"type": "word",
|
|
|
|
"position": 2
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "ต้อง",
|
|
|
|
"start_offset": 9,
|
|
|
|
"end_offset": 13,
|
|
|
|
"type": "word",
|
|
|
|
"position": 3
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "แสดง",
|
|
|
|
"start_offset": 13,
|
|
|
|
"end_offset": 17,
|
|
|
|
"type": "word",
|
|
|
|
"position": 4
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "ว่า",
|
|
|
|
"start_offset": 17,
|
|
|
|
"end_offset": 20,
|
|
|
|
"type": "word",
|
|
|
|
"position": 5
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "งาน",
|
|
|
|
"start_offset": 20,
|
|
|
|
"end_offset": 23,
|
|
|
|
"type": "word",
|
|
|
|
"position": 6
|
|
|
|
},
|
|
|
|
{
|
|
|
|
"token": "ดี",
|
|
|
|
"start_offset": 23,
|
|
|
|
"end_offset": 25,
|
|
|
|
"type": "word",
|
|
|
|
"position": 7
|
|
|
|
}
|
|
|
|
]
|
|
|
|
}
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
/////////////////////
|
|
|
|
|
|
|
|
|
|
|
|
The above sentence would produce the following terms:
|
|
|
|
|
|
|
|
[source,text]
|
|
|
|
---------------------------
|
|
|
|
[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
[float]
|
|
|
|
=== Configuration
|
|
|
|
|
|
|
|
The `thai` tokenizer is not configurable.
|