OpenSearch/docs/reference/analysis/tokenizers/simplepattern-tokenizer.asciidoc

[[analysis-simplepattern-tokenizer]]
=== Simple Pattern Tokenizer

experimental[]

The `simplepattern` tokenizer uses a regular expression to capture matching
text as terms. The set of regular expression features it supports is more
limited than the <<analysis-pattern-tokenizer,`pattern`>> tokenizer, but the
tokenization is generally faster.

This tokenizer does not support splitting the input on a pattern match, unlike
the <<analysis-pattern-tokenizer,`pattern`>> tokenizer. To split on pattern
matches using the same restricted regular expression subset, see the
<<analysis-simplepatternsplit-tokenizer,`simplepatternsplit`>> tokenizer.

This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.

The default pattern is the empty string, which produces no terms. This
tokenizer should always be configured with a non-default pattern.

[float]
=== Configuration

The `simplepattern` tokenizer accepts the following parameters:

[horizontal]
`pattern`::
    {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.

[float]
=== Example configuration

This example configures the `simplepattern` tokenizer to produce terms that are
three-digit numbers

[source,js]
----------------------------
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simplepattern",
          "pattern": "[0123456789]{3}"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "fd-786-335-514-x"
}
----------------------------
// CONSOLE

/////////////////////

[source,js]
----------------------------
{
  "tokens" : [
    {
      "token" : "786",
      "start_offset" : 3,
      "end_offset" : 6,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "335",
      "start_offset" : 7,
      "end_offset" : 10,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "514",
      "start_offset" : 11,
      "end_offset" : 14,
      "type" : "word",
      "position" : 2
    }
  ]
}
----------------------------
// TESTRESPONSE

/////////////////////

The above example produces these terms:

[source,text]
---------------------------
[ 786, 335, 514 ]
---------------------------