OpenSearch/docs/reference/analysis/tokenizers/simplepatternsplit-tokenizer.asciidoc
Clinton Gormley ff4a2519f2 Update experimental labels in the docs (#25727)
Relates https://github.com/elastic/elasticsearch/issues/19798

Removed experimental label from:
* Painless
* Diversified Sampler Agg
* Sampler Agg
* Significant Terms Agg
* Terms Agg document count error and execution_hint
* Cardinality Agg precision_threshold
* Pipeline Aggregations
* index.shard.check_on_startup
* index.store.type (added warning)
* Preloading data into the file system cache
* foreach ingest processor
* Field caps API
* Profile API

Added experimental label to:
* Moving Average Agg Prediction


Changed experimental to beta for:
* Adjacency matrix agg
* Normalizers
* Tasks API
* Index sorting

Labelled experimental in Lucene:
* ICU plugin custom rules file
* Flatten graph token filter
* Synonym graph token filter
* Word delimiter graph token filter
* Simple pattern tokenizer
* Simple pattern split tokenizer

Replaced experimental label with warning that details may change in the future:
* Analysis explain output format
* Segments verbose output format
* Percentile Agg compression and HDR Histogram
* Percentile Rank Agg HDR Histogram
2017-07-18 14:06:22 +02:00

107 lines
2.6 KiB
Plaintext

[[analysis-simplepatternsplit-tokenizer]]
=== Simple Pattern Split Tokenizer
experimental[This functionality is marked as experimental in Lucene]
The `simple_pattern_split` tokenizer uses a regular expression to split the
input into terms at pattern matches. The set of regular expression features it
supports is more limited than the <<analysis-pattern-tokenizer,`pattern`>>
tokenizer, but the tokenization is generally faster.
This tokenizer does not produce terms from the matches themselves. To produce
terms from matches using patterns in the same restricted regular expression
subset, see the <<analysis-simplepattern-tokenizer,`simple_pattern`>>
tokenizer.
This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.
The default pattern is the empty string, which produces one term containing the
full input. This tokenizer should always be configured with a non-default
pattern.
[float]
=== Configuration
The `simple_pattern_split` tokenizer accepts the following parameters:
[horizontal]
`pattern`::
A {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
[float]
=== Example configuration
This example configures the `simple_pattern_split` tokenizer to split the input
text on underscores.
[source,js]
----------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern_split",
"pattern": "_"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "an_underscored_phrase"
}
----------------------------
// CONSOLE
/////////////////////
[source,js]
----------------------------
{
"tokens" : [
{
"token" : "an",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
},
{
"token" : "underscored",
"start_offset" : 3,
"end_offset" : 14,
"type" : "word",
"position" : 1
},
{
"token" : "phrase",
"start_offset" : 15,
"end_offset" : 21,
"type" : "word",
"position" : 2
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above example produces these terms:
[source,text]
---------------------------
[ an, underscored, phrase ]
---------------------------