OpenSearch/docs/reference/analysis/tokenizers/simplepattern-tokenizer.asciidoc
Clinton Gormley ff4a2519f2 Update experimental labels in the docs (#25727)
Relates https://github.com/elastic/elasticsearch/issues/19798

Removed experimental label from:
* Painless
* Diversified Sampler Agg
* Sampler Agg
* Significant Terms Agg
* Terms Agg document count error and execution_hint
* Cardinality Agg precision_threshold
* Pipeline Aggregations
* index.shard.check_on_startup
* index.store.type (added warning)
* Preloading data into the file system cache
* foreach ingest processor
* Field caps API
* Profile API

Added experimental label to:
* Moving Average Agg Prediction


Changed experimental to beta for:
* Adjacency matrix agg
* Normalizers
* Tasks API
* Index sorting

Labelled experimental in Lucene:
* ICU plugin custom rules file
* Flatten graph token filter
* Synonym graph token filter
* Word delimiter graph token filter
* Simple pattern tokenizer
* Simple pattern split tokenizer

Replaced experimental label with warning that details may change in the future:
* Analysis explain output format
* Segments verbose output format
* Percentile Agg compression and HDR Histogram
* Percentile Rank Agg HDR Histogram
2017-07-18 14:06:22 +02:00

106 lines
2.5 KiB
Plaintext

[[analysis-simplepattern-tokenizer]]
=== Simple Pattern Tokenizer
experimental[This functionality is marked as experimental in Lucene]
The `simple_pattern` tokenizer uses a regular expression to capture matching
text as terms. The set of regular expression features it supports is more
limited than the <<analysis-pattern-tokenizer,`pattern`>> tokenizer, but the
tokenization is generally faster.
This tokenizer does not support splitting the input on a pattern match, unlike
the <<analysis-pattern-tokenizer,`pattern`>> tokenizer. To split on pattern
matches using the same restricted regular expression subset, see the
<<analysis-simplepatternsplit-tokenizer,`simple_pattern_split`>> tokenizer.
This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.
The default pattern is the empty string, which produces no terms. This
tokenizer should always be configured with a non-default pattern.
[float]
=== Configuration
The `simple_pattern` tokenizer accepts the following parameters:
[horizontal]
`pattern`::
{lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
[float]
=== Example configuration
This example configures the `simple_pattern` tokenizer to produce terms that are
three-digit numbers
[source,js]
----------------------------
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "my_tokenizer"
}
},
"tokenizer": {
"my_tokenizer": {
"type": "simple_pattern",
"pattern": "[0123456789]{3}"
}
}
}
}
}
POST my_index/_analyze
{
"analyzer": "my_analyzer",
"text": "fd-786-335-514-x"
}
----------------------------
// CONSOLE
/////////////////////
[source,js]
----------------------------
{
"tokens" : [
{
"token" : "786",
"start_offset" : 3,
"end_offset" : 6,
"type" : "word",
"position" : 0
},
{
"token" : "335",
"start_offset" : 7,
"end_offset" : 10,
"type" : "word",
"position" : 1
},
{
"token" : "514",
"start_offset" : 11,
"end_offset" : 14,
"type" : "word",
"position" : 2
}
]
}
----------------------------
// TESTRESPONSE
/////////////////////
The above example produces these terms:
[source,text]
---------------------------
[ 786, 335, 514 ]
---------------------------