104 lines
2.5 KiB
Plaintext
104 lines
2.5 KiB
Plaintext
[[analysis-simplepattern-tokenizer]]
|
|
=== Simple Pattern Tokenizer
|
|
|
|
experimental[This functionality is marked as experimental in Lucene]
|
|
|
|
The `simple_pattern` tokenizer uses a regular expression to capture matching
|
|
text as terms. The set of regular expression features it supports is more
|
|
limited than the <<analysis-pattern-tokenizer,`pattern`>> tokenizer, but the
|
|
tokenization is generally faster.
|
|
|
|
This tokenizer does not support splitting the input on a pattern match, unlike
|
|
the <<analysis-pattern-tokenizer,`pattern`>> tokenizer. To split on pattern
|
|
matches using the same restricted regular expression subset, see the
|
|
<<analysis-simplepatternsplit-tokenizer,`simple_pattern_split`>> tokenizer.
|
|
|
|
This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions].
|
|
For an explanation of the supported features and syntax, see <<regexp-syntax,Regular Expression Syntax>>.
|
|
|
|
The default pattern is the empty string, which produces no terms. This
|
|
tokenizer should always be configured with a non-default pattern.
|
|
|
|
[float]
|
|
=== Configuration
|
|
|
|
The `simple_pattern` tokenizer accepts the following parameters:
|
|
|
|
[horizontal]
|
|
`pattern`::
|
|
{lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.
|
|
|
|
[float]
|
|
=== Example configuration
|
|
|
|
This example configures the `simple_pattern` tokenizer to produce terms that are
|
|
three-digit numbers
|
|
|
|
[source,console]
|
|
----------------------------
|
|
PUT my_index
|
|
{
|
|
"settings": {
|
|
"analysis": {
|
|
"analyzer": {
|
|
"my_analyzer": {
|
|
"tokenizer": "my_tokenizer"
|
|
}
|
|
},
|
|
"tokenizer": {
|
|
"my_tokenizer": {
|
|
"type": "simple_pattern",
|
|
"pattern": "[0123456789]{3}"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
|
|
POST my_index/_analyze
|
|
{
|
|
"analyzer": "my_analyzer",
|
|
"text": "fd-786-335-514-x"
|
|
}
|
|
----------------------------
|
|
|
|
/////////////////////
|
|
|
|
[source,console-result]
|
|
----------------------------
|
|
{
|
|
"tokens" : [
|
|
{
|
|
"token" : "786",
|
|
"start_offset" : 3,
|
|
"end_offset" : 6,
|
|
"type" : "word",
|
|
"position" : 0
|
|
},
|
|
{
|
|
"token" : "335",
|
|
"start_offset" : 7,
|
|
"end_offset" : 10,
|
|
"type" : "word",
|
|
"position" : 1
|
|
},
|
|
{
|
|
"token" : "514",
|
|
"start_offset" : 11,
|
|
"end_offset" : 14,
|
|
"type" : "word",
|
|
"position" : 2
|
|
}
|
|
]
|
|
}
|
|
----------------------------
|
|
|
|
/////////////////////
|
|
|
|
The above example produces these terms:
|
|
|
|
[source,text]
|
|
---------------------------
|
|
[ 786, 335, 514 ]
|
|
---------------------------
|